I was originally planning to tweet this by itself:




But I realized I actually wanted to say some earnest, not-shitposty things about each of these data structures, so I figured I should take it to my neglected blog instead. If you just wanted the clickbait version, you can stop reading now.


So what makes a data structure good to name-drop in an interview? I would say that it has to be mildly obscure, so that you sound like an erudite data structures hipster, but it can’t be too obscure, lest your interviewer ask you to really explain the implementation details or trade-offs to the point that you reveal your ignorance. It’s best when you mention a data structure that’s somewhat obscure, but that your interviewer has heard of, because then your name-dropping validates their knowledge. They want to think of themselves as smart, and they’ve heard of this data structure, so when you show your knowledge of it, they deem you smart by the transitive principle. To this end, I’m not going to cover any truly obscure data structures, because that would defeat the purpose of why I’m writing this blog post to begin with.


Other than that, it should have real-world use cases so that there’s a legitimate reason for you to mention it in the context of a technical interview. It shouldn’t be too pedestrian either, or you won’t sound impressive for knowing something deemed too “undergrad” (pffttt you only know linked lists? get out of my startup, we only do blockchain here).


A is a probabilistic version of a set. Sets contain elements and can tell you in O(1) time and O(N) space whether or not it contains that element. A bloom filter can tell you whether it contains an element, but in O(1) time and !


Who would really use this?


✨Google Chrome! ✨Chrome needs to protect you from visiting spam websites without sacrificing speed or space. Imagine if every time you clicked on a link, Chrome had to make a network call to check its massive database of spam URLs before allowing you to visit the page. Further, imagine if Chrome’s solution for improving latency was to store that entire list of spam URLs locally. That’s not feasible! Instead, Chrome stores a bloom filter of potential spam URLs locally. Bloom filters are both time- and space-efficient, so it can quickly check for whether the given URL is spam. For normal URLs, the bloom filter’s response of “not spam” is sufficient. If a URL gets flagged as “maybe spam”, then Google can check its real database before moving forward. It turns out you can do great things when you’re willing to sacrifice absolutes! (Yeah, yeah, only a Sith deals in absolutes.)

✨GoogleChrome! ✨Chrome需要保护您免受访问垃圾邮件网站而不牺牲速度或空间。想象一下,如果您每次点击链接,Chrome都必须进行网络通话,以便在您访问该页面之前检查其庞大的垃圾邮件URL数据库。此外,想象一下Chrome的改进延迟的解决方案是在本地存储整个垃圾邮件URL列表。那不可行!相反,Chrome会在本地存储潜在垃圾邮件URL的Bloom过滤器。 Bloom过滤器既节省时间又节省空间,因此可以快速检查给定的URL是否为垃圾邮件。对于普通网址,bloom过滤器对“非垃圾邮件”的响应就足够了。如果URL被标记为“可能是垃圾邮件”,那么Google可以在继续之前检查其真实数据库。事实证明,当你愿意牺牲绝对时,你可以做出伟大的事情! (是的,是的,只有一个西斯交易绝对。)

Implementation details that you can scroll past


The Wikipedia article for bloom filters describes the implementation details with a whole lot of jargon, so I’m going to quickly describe the implementation in plain English here. You should check out Wikipedia if you want more precise details; I’m going to gloss over a lot of information because this blog post is quickly turning out to not be clickbait.


Let’s say you want to insert an element into your bloom filter. First, imagine you have distinct, deterministic hash functions. When you use each hash function on an element, you get a different value (collisions are okay). You use the output of each hash function as an index into an array and for each index , you set the array[] to true. You’re done! Insertion is O(1) because the only work you do on each insertion is running a constant number of hash functions and setting a constant number of array indices.

假设您要在Bloom过滤器中插入一个元素。首先,假设您有不同的确定性哈希函数。当您在元素上使用每个哈希函数时,您将获得不同的值(冲突是可以的)。使用每个散列函数的输出作为数组的索引,对于每个索引,将array []设置为true。你完成了!插入是O(1),因为您在每次插入时所做的唯一工作是运行恒定数量的散列函数并设置恒定数量的数组索引。

How would you check whether your bloom filter contains that element? Run it through all of the same hash functions again! Your hash functions are deterministic, so the same input should return the same output. So now, for each index you have, you can check if your bloom filter’s array is set to true at that index. If every slot of the array for your hash functions’ outputs is true, then you can say with some high probability that the element is likely to have been inserted into your bloom filter in the past. However, there’s always a chance of a false positive, which would happen if those array slots were all set to true because the indices were used when some other elements were inserted. The great feature of bloom filters is that there will never be false negatives, though: there’s no way to find an array slot that’s false when that element had previously been inserted.

你如何检查你的布隆过滤器是否包含该元素?再次运行所有相同的哈希函数!您的哈希函数是确定性的,因此相同的输入应返回相同的输出。所以现在,对于您拥有的每个索引,您可以检查您的bloom过滤器的数组是否在该索引处设置为true。如果散列函数输出的数组的每个插槽都为真,那么您可以很高的概率说这个元素可能已经插入过去的布隆过滤器中。然而,总是存在误报的可能性,如果这些数组槽都被设置为真,则会发生这种情况,因为在插入其他一些元素时使用了索引。 bloom过滤器的一大特色是永远不会出现漏报:当先前插入该元素时,无法找到错误的数组槽。

You have to do some cool math to figure out how many hash functions and how big of an array you will need to guarantee certain probabilities. Wikipedia goes into greater detail here and I think their proof is worth reading.


mouth breathing intensifies


A prefix trie is a data structure that allows you to quickly look up a string by its prefix and also find strings that share a common prefix.


My first pro tip for this data structure is to refer to it as a “prefix trie” as opposed to just a “trie”. That way, you suggest to the interviewer that you are the type of person who knows about algorithms related to both prefixes and suffixes, and also you like to be precise about your hipster data structures. Suffix trees are also a pretty interesting topic, but the implementation details are so gory that I wouldn’t be able to do it justice. That’s why I just talk about prefix tries and bluff knowing about suffix trees.


Who would really use this?


✨Genomics researchers!✨It turns out that modern genomic research relies heavily on string algorithms and data structures because you’re trying to find insights from the millions of nucleotides that make up a genome sequence. With genome data, you often want to align sequences, find differences, or find repeated patterns. If you want to learn more about this, you can start by reading up on , and then look into courses such as “” or “Algorithms for Bioinformatics” (offered at multiple schools).


If you want some really exciting bonus reading, I’d highly recommend reading about . With advances in genome sequencing and string algorithms, we can actually predict use an individual’s genome to determine whether they have the right genes to react properly to a medication. For example, if their genome is missing a gene for producing an enzyme that processes a certain drug, they might experience side effects. If we knew what genes were important, we could give them a different drug! We currently do exactly this for , a blood thinning medication.


(I have to confess that the connection between prefix tries and genomics is somewhat tenuous, but I wanted to motivate string algorithms in general. If you want the canonical use case for prefix tries, then yeah whatever, a prefix trie can be used to implement a . I’m bored again.)


Implementation stuff


Imagine you have a tree where every node has an array of 26 children, one for each letter of the alphabet. (You can change 26 to be a different value if you want to include other characters. After all, you are the Ashton Kutcher of your string data structure.) To represent a word in your trie, you would walk down the tree and add a node for each letter of the word. For example, here’s this image I stole from :

想象一下,你有一棵树,每个节点都有一个包含26个孩子的数组,每个字母对应一个字母。 (如果要包含其他字符,可以将26更改为不同的值。毕竟,您是字符串数据结构的Ashton Kutcher。)要在您的trie中表示单词,您将沿着树向下走并添加一个每个字母的节点。例如,这是我偷走的这张图片:

For the word “tea”, you start at the root, navigate to the node, then , and finally . So searching for a word takes O(N) time (where N is the length of the word), and if the word’s prefix doesn’t exist, you can bail early. If I look up “zzzzzzzz”, the trie can stop looking for my term after “zz”.


A ring buffer is more of a nifty way to use a normal array, but in a clever way that makes it optimized for data streaming.


Who would really use this?


✨Maybe Netflix?! ✨I googled “netflix ring buffer” and found they published, but when has a company ever used their open source code internally in production, anyway?

✨MaybeNetflix ?! ✨我搜索了“netflix ring buffer”并发现它们已发布,但是当公司曾在生产中内部使用过他们的开源代码时,无论如何?

Totally unrelated, but , Stripe’s metrics pipeline that we use in production and just happens to be open source, uses a for collecting tracing data. #weirdflexbutokay #sponsored

完全不相关,但是,我们在生产中使用的Stripe的度量标准管道恰好是开源的,用于收集跟踪数据。 #weirdflexbutokay #sponsored

Is anyone even reading these sections? has a cool gif and their descriptions are sufficient for this data structure, so I won’t rewrite what they’ve said. This blog post is already getting pretty long! (If enough people complain, I’ll come back and edit this section.)

有人甚至读这些部分吗?有一个很酷的GIF,它们的描述足以满足这种数据结构,因此我不会重写他们所说的内容。这篇博文已经很长了! (如果有足够的人抱怨,我会回来编辑这一部分。)

With this blog post, I like to think I’ve saved you about $20,000 in university tuition, gotten you excited about at least one real-world use case that you might not have heard of before, and, most importantly, helped you win your next computer science penis measurement contest.


If you’re looking for more practical interview advice, you might enjoy and . Both are tailored for an entry-level audience.