Fast - FPSet can compute an intersection in O( min(n) * m ) time. In space, the algorithm is O(m). m is the number of sets. n is the number of elements in each set.
Persistent - Each set can be larger than available memory because the intersection happens while data is streamed from the disk. FPSet does not need to have the sets entirely in memory to perform the intersection.
Plots
These plots show the runtime of FPSet (blue) compared with the runtime of Ruby’s Array intersect (red) as a function of the size of the set being intersected. Check the alt-text of the image for more details.
How it works
FPSet takes any Ruby enumerable, serializes each member, sorts it, de-duplicates it, compresses it, and then writes it to a file. Creating a new set isn’t quick. FPSet is intended for applications that need very fast intersections of large sets but don’t expect to change those sets very often.
When you ask FPSet to compute the set intersection of several files it opens all of those files and decompresses just enough of the first file to find the first element in that set. It doesn’t bother to deserialize that element. Instead, at the byte level, FPSet compares that element with the first element of the next set (again, decompressing just-enough.) If the element in the second set is less than the element being searched for then that lower element cannot be in the intersection. It is discarded and the next element is read. This continues until FPSet finds a match or an element that is greater than the element being searched for. If FPSet finds a match then it compares with the third set and so on. If FPSet instead finds a greater element then the element that was being searched for cannot be in the intersection. However, the greater element may be the next member of the intersection so FPSet makes that the new searched-for element and continues streaming elements from the first set looking for a value that is equal to or greater than this new searched-for element.
If we exhaust all of the elements in one set at any point during the algorithm we know that we’ve found the complete intersection. This is where the min(n) comes from in our time complexity.
Why C?
The core algorithm of FPSet is actually implemented in ANSI C. There are two good reasons for doing this. First, the byte level comparisons that are behind the sort and intersect steps of the algorithm can be very clearly and efficiently represented in C. Second, and most important for performance, the intersection algorithm in C is expressed in a way that doesn’t allocate or free memory (no GC churn!) I’m not sure how/if one can write a Ruby equivalent of this algorithm that doesn’t keep the garbage collector busy.
Compression
Using compression when reading and writing was an interesting optimization. By far, the most expensive part of this algorithm is reading data from the disk. FPSet is completely IO bound. Modern processors are incredibly fast relative to modern disks (even SSDs.) My first implementation of FPSet wrote exactly the bytes that the algorithm would later read and compare. I discovered that the intersection was significantly faster if I added a decompression step and thus have fewer bytes to read. In my benchmarks, the runtime of FPSet decreased by 30-40% when I zlib (level 2) compressed the data.
Set Contents
You can store anything in an FPSet that Ruby knows how to Marshal.dump. I don’t know if there’s any value-like thing that Ruby doesn’t know how to marshal so I think this is a pretty loose restriction. The byte representation produced by Marshal.dump tends to be sparse but the compression we use tightens it up and makes it efficient on the disk.
RubyGems
FPSet is available through RubyGems under the name rfpset. You get it in the usual way:
1
|
|
You can also build the gem from source:
1 2 |
|
Edit: My C is ANSI in the same way my shoes are ANSI… not at all. But, just use gcc like a normal person and you’ll be happy.