Sets are useful mathematical abstractions. They are also useful and ubiquitous programming abstractions. Typically, we are concerned with the membership of elements within a set. Is the element we’re talking about in the set, or not?
Think about sets as unordered collections of unique elements. You could construct a set of all blue items in the world. The group of all real numbers is a set. The group of all integers is a set. So on and so forth.
Sets provide constant-time lookup operations on average. Let’s say your program builds up a group (read: not list) of a user’s favorite artists based on songs he or she has liked. When the user favorites a song performed by an artist already in the list, you don’t want to duplicate any work, so you use sets to ensure the items in your collection are unique. If you need to check if something is in your set, you can rest assured it will be fast.
Sets never contain duplicates, and they do not maintain any degree of order.
A hash function is an arbitrary function
f that takes an arbitrary input
x and outputs a number within set bounds. For example, let’s define
g, a function that takes an arbitrary string
x and converts that to a number between 0 and 1,000,000.
Our hash function requires:
* Like all mathematical functions, a given
x will always produce the same
* There is no relationship between the ordering of inputs and the ordering of outputs.
g("hello") might equal
g("helloo") might equal
* A uniform distribution of outputs. A random sampling of inputs must produce a random sampling of outputs. There cannot be any clumping on a number line of outputs.
It’s nice to think of hash functions as black boxes that deterministically produce (almost) unique outputs based only on their inputs. They’re useful when you want to transform a erratically formed dataset into a common format. (like taking state names like Nebraska, New Hampshire and turning them into hex strings like
3e99. Same size, same format.)
Sets, specifically hash-sets, use hash functions to ‘insert’ elements. Let’s say I take a guess and say I think my set will never hold more than 100k elements. So I make space for 100k contiguous ‘slots’. If I want to ‘insert’ the string “A” into the set (so I can check for membership afterwards), I do the following:
- Hash “A”. This produces a number that depends on our hash function. Let’s say
hash("A") = 121103. Since there are only 100,000 slots in my set, I have to take
121103 % 100000…
= 21103. At the 21,103rd slot in my array of slots, I add “A”. Similarly,
hash("B") = 6 % 100000 = 6. So at the 6th slot, I add “B”.
Now - let’s say
hash("KL") = 100006 % 100000 = 6. This is called a collision. Remember, our hash function is bound over some range and our set is constrained to the number of slots we allocated. Since we use modulo arithmetic to keep everything in line, collisions can occur. Two items can hash to the same thing, or they can end up with the same slot in the array.
Many implementations of hash-sets resolve this by making their array of slots two-dimensional. Each slot contains more slots, or “buckets”, where items that resolved to the same index in the set reside in an ordered list. In this case, the bucket at index 6 would look like
How do we check for membership? Let’s say we want to see if “KL” is in the set. We perform the same steps as insertion, almost. First we hash “KL” using our hash function. Then we take the result modulo the size of our set. After that, we arrive at the appropriate bucket. Here, we must walk every element in the bucket to see if “KL” exists. First we say
is "B" == "KL"? No, so move on -
is "KL" == "KL"? Yes, so return
true for the membership check.
When talking about the time complexity of the
exists? operations on a hash-set, you really want to talk about the average case.
In our above “KL” example, we hashed the item to find the index. The number of steps in this operation did not depend on the size of the set (
n), so that is constant time, or
However, we did have to walk the bucket, and that does depend on the size of the bucket. The hope is that a properly implemented hash-set on average will result in the bucket sizes being bounded by a constant, making the overall time complexity constant, or
O(1). I waved my hands here a bit, but the purpose of this post is simply to understand sets so that we can understand bloom filters. If you’d like to read more…
Compare this to storing items in an ordered collection, such a list. Every time you insert an element, you just add it to the end of the list. If you want to check if an item is in the list, you have to walk the entire list. This operation scales with the size of the list, so this operation is
What about bloom filters?
In the next post I’ll actually connect this to bloom filters, which are just a clever and unique kind of set.