I was originally planning to tweet this by itself:

我本来打算单独发推文:

STARTUPS HATE HER: DATA STRUCTURES TO NAME-DROP WHEN YOU WANT TO SOUND SMART IN AN INTERVIEW

STARTUPS恨她:数据结构名称下降,当你想在访谈中发出声音时

But I realized I actually wanted to say some earnest, not-shitposty things about each of these data structures, so I figured I should take it to my neglected blog instead. If you just wanted the clickbait version, you can stop reading now.

但是我意识到我实际上想对这些数据结构中的每一个都说一些认真但不狡猾的事情,所以我想我应该把它带到我被忽视的博客上。如果您只是想要clickbait版本,可以立即停止阅读。

So what makes a data structure good to name-drop in an interview? I would say that it has to be mildly obscure, so that you sound like an erudite data structures hipster, but it can’t be too obscure, lest your interviewer ask you to really explain the implementation details or trade-offs to the point that you reveal your ignorance. It’s best when you mention a data structure that’s somewhat obscure, but that your interviewer has heard of, because then your name-dropping validates their knowledge. They want to think of themselves as smart, and they’ve heard of this data structure, so when you show your knowledge of it, they deem you smart by the transitive principle. To this end, I’m not going to cover any truly obscure data structures, because that would defeat the purpose of why I’m writing this blog post to begin with.

那么是什么让数据结构在访谈中名声大噪?我会说它必须是温和的模糊,所以你听起来像一个博学的数据结构时髦,但它不能太模糊,以免你的面试官要求你真正解释实现细节或权衡到这一点你揭示了你的无知。当你提到一个有点模糊的数据结构时,这是最好的,但是你的面试官已经听说过,因为那时你的名字也会证明他们的知识。他们想要把自己看作是聪明的,并且他们已经听说过这种数据结构,所以当你展示自己的知识时,他们会认为你是传统的原则。为此,我不打算涵盖任何真正模糊的数据结构,因为这会破坏我为什么要开始写这篇博文的目的。

Other than that, it should have real-world use cases so that there’s a legitimate reason for you to mention it in the context of a technical interview. It shouldn’t be too pedestrian either, or you won’t sound impressive for knowing something deemed too “undergrad” (pffttt you only know linked lists? get out of my startup, we only do blockchain here).

除此之外,它应该具有真实世界的用例,因此您有合理的理由在技术访谈的背景下提及它。它也不应该太行人,或者你不会因为知道某些被认为太“不完美”的东西而感到印象深刻(pffttt你只知道链接列表?离开我的创业公司,我们只在这里做区块链)。

A is a probabilistic version of a set. Sets contain elements and can tell you in O(1) time and O(N) space whether or not it contains that element. A bloom filter can tell you whether it contains an element, but in O(1) time and !

A是集合的概率版本。集合包含元素,可以在O(1)时间和O(N)空间中告诉您它是否包含该元素。布隆过滤器可以告诉你它是否包含一个元素,但是在O(1)时间内!

Who would really use this?

谁会真的用这个?

✨Google Chrome! ✨Chrome needs to protect you from visiting spam websites without sacrificing speed or space. Imagine if every time you clicked on a link, Chrome had to make a network call to check its massive database of spam URLs before allowing you to visit the page. Further, imagine if Chrome’s solution for improving latency was to store that entire list of spam URLs locally. That’s not feasible! Instead, Chrome stores a bloom filter of potential spam URLs locally. Bloom filters are both time- and space-efficient, so it can quickly check for whether the given URL is spam. For normal URLs, the bloom filter’s response of “not spam” is sufficient. If a URL gets flagged as “maybe spam”, then Google can check its real database before moving forward. It turns out you can do great things when you’re willing to sacrifice absolutes! (Yeah, yeah, only a Sith deals in absolutes.)

✨GoogleChrome! ✨Chrome需要保护您免受访问垃圾邮件网站而不牺牲速度或空间。想象一下,如果您每次点击链接,Chrome都必须进行网络通话,以便在您访问该页面之前检查其庞大的垃圾邮件URL数据库。此外,想象一下Chrome的改进延迟的解决方案是在本地存储整个垃圾邮件URL列表。那不可行!相反,Chrome会在本地存储潜在垃圾邮件URL的Bloom过滤器。 Bloom过滤器既节省时间又节省空间,因此可以快速检查给定的URL是否为垃圾邮件。对于普通网址,bloom过滤器对“非垃圾邮件”的响应就足够了。如果URL被标记为“可能是垃圾邮件”,那么Google可以在继续之前检查其真实数据库。事实证明,当你愿意牺牲绝对时,你可以做出伟大的事情! (是的,是的,只有一个西斯交易绝对。)

Implementation details that you can scroll past

您可以滚动过去的实施细节

The Wikipedia article for bloom filters describes the implementation details with a whole lot of jargon, so I’m going to quickly describe the implementation in plain English here. You should check out Wikipedia if you want more precise details; I’m going to gloss over a lot of information because this blog post is quickly turning out to not be clickbait.

关于bloom过滤器的维基百科文章用大量的术语描述了实现细节,所以我将在这里用简单的英语快速描述实现。如果您想要更精确的细节,您应该查看维基百科;我要掩盖很多信息,因为这篇博文很快就会变成clickbait。

Let’s say you want to insert an element into your bloom filter. First, imagine you have distinct, deterministic hash functions. When you use each hash function on an element, you get a different value (collisions are okay). You use the output of each hash function as an index into an array and for each index , you set the array[] to true. You’re done! Insertion is O(1) because the only work you do on each insertion is running a constant number of hash functions and setting a constant number of array indices.

假设您要在Bloom过滤器中插入一个元素。首先,假设您有不同的确定性哈希函数。当您在元素上使用每个哈希函数时,您将获得不同的值(冲突是可以的)。使用每个散列函数的输出作为数组的索引,对于每个索引,将array []设置为true。你完成了!插入是O(1),因为您在每次插入时所做的唯一工作是运行恒定数量的散列函数并设置恒定数量的数组索引。

How would you check whether your bloom filter contains that element? Run it through all of the same hash functions again! Your hash functions are deterministic, so the same input should return the same output. So now, for each index you have, you can check if your bloom filter’s array is set to true at that index. If every slot of the array for your hash functions’ outputs is true, then you can say with some high probability that the element is likely to have been inserted into your bloom filter in the past. However, there’s always a chance of a false positive, which would happen if those array slots were all set to true because the indices were used when some other elements were inserted. The great feature of bloom filters is that there will never be false negatives, though: there’s no way to find an array slot that’s false when that element had previously been inserted.

你如何检查你的布隆过滤器是否包含该元素?再次运行所有相同的哈希函数!您的哈希函数是确定性的,因此相同的输入应返回相同的输出。所以现在,对于您拥有的每个索引,您可以检查您的bloom过滤器的数组是否在该索引处设置为true。如果散列函数输出的数组的每个插槽都为真,那么您可以很高的概率说这个元素可能已经插入过去的布隆过滤器中。然而,总是存在误报的可能性,如果这些数组槽都被设置为真,则会发生这种情况,因为在插入其他一些元素时使用了索引。 bloom过滤器的一大特色是永远不会出现漏报:当先前插入该元素时,无法找到错误的数组槽。

You have to do some cool math to figure out how many hash functions and how big of an array you will need to guarantee certain probabilities. Wikipedia goes into greater detail here and I think their proof is worth reading.

你必须做一些很酷的数学计算,以确定需要多少哈希函数以及保证某些概率所需的数组大小。维基百科在这里详细介绍,我认为他们的证明值得一读。

mouth breathing intensifies

口呼吸加剧

A prefix trie is a data structure that allows you to quickly look up a string by its prefix and also find strings that share a common prefix.

前缀trie是一种数据结构,允许您通过其前缀快速查找字符串,还可以查找共享公共前缀的字符串。

My first pro tip for this data structure is to refer to it as a “prefix trie” as opposed to just a “trie”. That way, you suggest to the interviewer that you are the type of person who knows about algorithms related to both prefixes and suffixes, and also you like to be precise about your hipster data structures. Suffix trees are also a pretty interesting topic, but the implementation details are so gory that I wouldn’t be able to do it justice. That’s why I just talk about prefix tries and bluff knowing about suffix trees.

我对这个数据结构的第一个专业技巧是将它称为“前缀trie”,而不仅仅是“trie”。这样,您建议面试官知道您是那种了解与前缀和后缀相关的算法的人,并且您也希望准确了解您的行家数据结构。后缀树也是一个非常有趣的话题,但实现细节是如此的血腥,以至于我无法做到这一点。这就是为什么我只是谈论前缀尝试和虚张声势了解后缀树。

Who would really use this?

谁会真的用这个?

✨Genomics researchers!✨It turns out that modern genomic research relies heavily on string algorithms and data structures because you’re trying to find insights from the millions of nucleotides that make up a genome sequence. With genome data, you often want to align sequences, find differences, or find repeated patterns. If you want to learn more about this, you can start by reading up on , and then look into courses such as “” or “Algorithms for Bioinformatics” (offered at multiple schools).

✨基因组学研究人员!✨事实证明,现代基因组研究在很大程度上依赖于字符串算法和数据结构,因为您试图从组成基因组序列的数百万个核苷酸中找到见解。对于基因组数据,您经常需要对齐序列,找到差异或找到重复的模式。如果您想了解更多相关信息,可以先阅读,然后查看“”或“生物信息学算法”等课程(在多所学校提供)。

If you want some really exciting bonus reading, I’d highly recommend reading about . With advances in genome sequencing and string algorithms, we can actually predict use an individual’s genome to determine whether they have the right genes to react properly to a medication. For example, if their genome is missing a gene for producing an enzyme that processes a certain drug, they might experience side effects. If we knew what genes were important, we could give them a different drug! We currently do exactly this for , a blood thinning medication.

如果你想要一些真正令人兴奋的奖金阅读,我强烈建议阅读。随着基因组测序和字符串算法的进步,我们实际上可以预测使用个体的基因组来确定它们是否具有对药物正确反应的正确基因。例如,如果他们的基因组缺少用于产生处理某种药物的酶的基因,则它们可能经历副作用。如果我们知道什么基因是重要的,我们可以给他们一种不同的药物!我们目前正在做一个血液稀释药物。

(I have to confess that the connection between prefix tries and genomics is somewhat tenuous, but I wanted to motivate string algorithms in general. If you want the canonical use case for prefix tries, then yeah whatever, a prefix trie can be used to implement a . I’m bored again.)

(我必须承认前缀尝试和基因组学之间的联系有点脆弱,但我想激励字符串算法。如果你想要用于前缀尝试的规范用例,那么无论如何,可以使用前缀trie来实现a。我又厌烦了。)

Implementation stuff

实施的东西

Imagine you have a tree where every node has an array of 26 children, one for each letter of the alphabet. (You can change 26 to be a different value if you want to include other characters. After all, you are the Ashton Kutcher of your string data structure.) To represent a word in your trie, you would walk down the tree and add a node for each letter of the word. For example, here’s this image I stole from :

想象一下,你有一棵树,每个节点都有一个包含26个孩子的数组,每个字母对应一个字母。 (如果要包含其他字符,可以将26更改为不同的值。毕竟,您是字符串数据结构的Ashton Kutcher。)要在您的trie中表示单词,您将沿着树向下走并添加一个每个字母的节点。例如,这是我偷走的这张图片:

For the word “tea”, you start at the root, navigate to the node, then , and finally . So searching for a word takes O(N) time (where N is the length of the word), and if the word’s prefix doesn’t exist, you can bail early. If I look up “zzzzzzzz”, the trie can stop looking for my term after “zz”.

对于“茶”这个词,你从根开始,导航到节点,然后,最后。因此,搜索单词需要O(N)时间(其中N是单词的长度),如果单词的前缀不存在,则可以提前保释。如果我查看“zzzzzzzz”,trie可以在“zz”之后停止查找我的术语。

A ring buffer is more of a nifty way to use a normal array, but in a clever way that makes it optimized for data streaming.

环形缓冲区是使用普通数组的一种非常好的方式,但它以巧妙的方式使其针对数据流进行了优化。

Who would really use this?

谁会真的用这个?

✨Maybe Netflix?! ✨I googled “netflix ring buffer” and found they published, but when has a company ever used their open source code internally in production, anyway?

✨MaybeNetflix ?! ✨我搜索了“netflix ring buffer”并发现它们已发布,但是当公司曾在生产中内部使用过他们的开源代码时,无论如何?

Totally unrelated, but , Stripe’s metrics pipeline that we use in production and just happens to be open source, uses a for collecting tracing data. #weirdflexbutokay #sponsored

完全不相关,但是,我们在生产中使用的Stripe的度量标准管道恰好是开源的,用于收集跟踪数据。 #weirdflexbutokay #sponsored

Is anyone even reading these sections? has a cool gif and their descriptions are sufficient for this data structure, so I won’t rewrite what they’ve said. This blog post is already getting pretty long! (If enough people complain, I’ll come back and edit this section.)

有人甚至读这些部分吗?有一个很酷的GIF,它们的描述足以满足这种数据结构,因此我不会重写他们所说的内容。这篇博文已经很长了! (如果有足够的人抱怨,我会回来编辑这一部分。)

With this blog post, I like to think I’ve saved you about $20,000 in university tuition, gotten you excited about at least one real-world use case that you might not have heard of before, and, most importantly, helped you win your next computer science penis measurement contest.

有了这篇博文,我想我已经为你节省了大约2万美元的大学学费,让你对至少一个你以前可能没有听过的真实用例感到兴奋,最重要的是,帮助你赢得了下一次计算机科学阴茎测量比赛。

If you’re looking for more practical interview advice, you might enjoy and . Both are tailored for an entry-level audience.

如果您正在寻找更实用的面试建议,您可能会喜欢和。两者都是为入门级受众量身定制的。