Friday, November 11, 2011

Crippling Hash Collisions in Rubinius

Last weekend, I was happily and innocently coding away in Ruby, when I noticed that my tests were running one hundred times slower in one Ruby implementation than in another. The different implementations naturally have different weaknesses and strengths in performance, but when you find some code that runs 100 times slower there is probably something going very wrong. I figured out the root cause of this problem and I hope you will find it interesting.

I managed to simplify my code to the simplest possible thing that demonstrates the problem:

P = Struct.new :x, :y, :z
hash = {}
(0..15).each do |x|
  (0..15).each do |y|
    (0..15).each do |z|
      key = P.new x, y, z
      hash[key] = true
    end
  end
end

This code looks pretty innocent, right? If I were on the jury I would say "not guilty".

Here's how long the code took to run in various Ruby implementations on my computer (according to the Linux time command):

Implementation Run time
ruby 1.9.2p290 (MRI) 0.020 s
rubinius 1.2.4 3.210 s
rubinius 2.0.0dev 0.894 s

Here are some gratuitous SVG bar graphs:




That's right, you need to use a logarithmic scale to even see the run time of this code under MRI.

I was very curious about why Rubinius was so slow, so I used the profiler built in to Rubinius and discovered that most of the time was being spent searching for matching hash entries (Hash::Entry#match?). That's because the hash function in Rubinius 1.2.4 only generated 16 different hash-codes for the 4096 different keys I put into the hash. Every time a new key was added to the hash, there was a hash collision with approximately one sixteenth of the keys already in the hash, so the new key had to be compared to each of them to make sure it wasn't equal to any of the keys already in the hash. Rubinius 2.0.0dev did better, with 576 unique hash-codes. In MRI 1.9.2, all the hash-codes were unique.

If you didn't understand the paragraph above, read on because I'm going to explain how hash tables work.

As we know, the job of a hash table to associate keys to values. The keys and values can be any Ruby object. Imagine the hash table is a city, and the key-value pairs are houses in the city. Whenever you access an element in the hash table, you are visiting one of the houses.

When you want to visit a house (key-value pair) in the city, the first thing you need to do is compute its hash code, which is like the street address of the house. In Ruby, this is done by calling the key's hash method, which returns an integer. The roads in the hash city are designed so that once you know the address of a house, you can quickly find all the houses that have that address.

In our hash city, it possible that two or more houses will have the same address. This is called a hash collision, but unlike other collisions it doesn't leave body parts lying in the middle of the road. The consequence of a hash collision is relatively mild: after you find the houses that have the address you are looking for, you just need to look at each house individually to see whether it is the house you are looking for. This can increase the time it takes to find the right house. This applies equally to both read and write operations, but in our example code we only did write operations.

For example, if you are looking for a house (key-value pair) with address (hash-code) 34, you would quickly walk to the houses that have address 34. You would then look at each house (key) and see if it is the house you are looking for (using the eql? method). If you find the house, you can stop searching, but otherwise you will have to look at all the houses with that address.

There are different ways to generate hash-codes and for a given set of data some of those methods will produce more hash collisions than others.

In Ruby 1.9, the 4096 keys generated in the code above all had different hash-codes so there were no hash collisions.

In Rubinius 1.2.4, because the hash method was implemented differently, the 4096 keys in the code above had only *16* different addresses. Imagine a city with 4096 houses, but only 16 addresses. It would take forever to find anyone's house!

I hope this illuminated something for you. Let me know what you think in the comments!

8 comments:

  1. Wow, thank you for finally explaining this. I realize that I've gradually picked up the gist of what a hash mapping was over the years, but I never had someone just sit down and explain it.

    Here's my question, because it's still a bit magical to me. By analogy to C++ I usually either store groups of objects together in arrays, in which the objects themselves are stored in equal sized containers; Store them separately allocated on demand but link to them from a static array of pointers to objects (in this case only one pointer width is required regardless whether the container is filled or not); Or use a linked list of some kind.

    In each of these cases though, the only natural way of locating an object is by applying its index to the array/list offset. As far as I can see, this amounts to a very basic kind of a key-value pair; But putting a hash table as a buffer between this linear listing and the user presents some design questions.

    If we create a pointer array (mostly empty) and have a hash table with two objects corresponding to keys 1 and 23, then assign the pointers at at indexes 1 and 23 of the pointer array to the corresponding objects; Then we have a fantastically efficient hash function, because as soon as we know the key of an object we know where it is stored. On the other hand, this isn't practical since the programmer could choose to build a hashed list of two objects and indicate them by, say, keys 1 and 2^10. This pointer array would then require 4 gigabytes of memory to store.

    The best alternative I can imagine would be to create an array of objects (or a linked list) containing unordered key and pointer-to-value pairs, and then sorting the entire list every time a new object is added to the list to allow some key-search algorithm to efficiently find a given key.

    If the dictionary was stored as a linked list, then the sort function would have to create a static array of pointers to dictionary objects to gain a wider view of what the dictionary looked like so that it could re-sort it. Rearranging the dictionary objects shouldn't be too difficult then, because each rearrangement action would only require at max 4 pointer redirections (once the whole array of pointers of key/pointer-to-value pairs had been generated).

    In any case, this all seems awfully messy to me, though it might work. The next level up then, in python the keys to a dictionary could just as easily be 1, 23, 2^10, or "dog". Even all simultaneously. This clearly would present a bit of a challenge for the hash table's search/sorting algorithms.

    Could this be how hash tables are implemented though?

    ReplyDelete
  2. Thanks for the comment Noah! As I said in the post, the "roads" (i.e. data structure) need to be designed so that if you have the integer hash-code of the key, you can efficiently find the object that represents the key-value pair.

    A linked list is NOT an efficient way to do this, because most operations on a linked list take O(N) time. For example, finding the element at position N/5 takes O(N) time because you have to traverse every node starting at 0 all the way up to N/5.

    I didn't talk about this in my original post, but I think for a hash table, you might use something like a Binary Search Tree.

    I'm pretty sure a binary search tree allows you to locate, insert, and delete items in O(log N) time, which is much better than O(N).

    Does it make sense yet?

    ReplyDelete
  3. Oh yes, I remember talking with you about binary trees once. I haven't had a chance to actually implemented one yet, so it's hard to see all the consequences of it, or all the advantages, but let's give it a try:

    If I keyed a two object pool of values with the two numerical keys 1 and 2^10, then their two locations in the binary tree would be:

    [0,0,0,0,0,0,0,0,0,0,1]
    and
    [1,0,0,0,0,0,0,0,0,0,0]
    (Big Endian)

    Could you store those two keys in a two-branch tree? If you can, then I might seriously need to start implementing binary trees. What would you do (as in python) if you were given the key "dog" for a particular object?

    ReplyDelete
  4. While it's true that a Hash data structure should (mostly) avoid collisions to perform well, I don't think you found the root cause in Rubinius. I'd be curious to see your 1.2.4 profile results. The gist below basically shows that the vast majority of the cost was actually computing the Struct#hash value.

    https://gist.github.com/1360070

    ReplyDelete
  5. FWIW, it appears to perform normally in JRuby. I had to give it a try.

    I modified your benchmark to run the loops 1000 times, since in both JRuby and Rubinius you're paying some startup + warmup time for short run. JRuby beats 1.9.3 in that case.

    JRuby = 0m5.653s

    Ruby 1.9.3 = 0m6.477s

    Unfortunately Rubinius performs poorly at this scale...I'm not sure why. Perhaps there's still a bug?

    Rubinius = 1m55.583s

    Rubinius =

    ReplyDelete
  6. Brian,

    Just because you change the behavior of one function and the overall code gets faster, it doesn't imply that the function was costly. Here's a trivial example of that:

    def h; 99999999; end
    h.times { puts }

    If I change h to return 0 then the code gets MUCH faster, but you wouldn't say that h is costly.

    Here are the results of profiling my code in Rubinius 1.2.4:
    https://gist.github.com/1360989

    Notice that Struch#hash is only called 4096 times and the execution time is negligible.

    A lot of other functions are called about 522,000 times, and this can be explained by a very simple model for hash collisions. Every time you add a key to the hash, there are on average about 2048 other keys in the hash and roughly one sixteenth of those will have the same hash-code. Since you are adding 4096 keys, this means you will have about 4096*2048/16 = 524,288 collisions, each requiring a call to Struct#eql?. This is a perfect example of O(N^2) behavior, suitable for textbooks.

    If you change Rubinius's Struct#hash (and Fixnum#hash?) methods to be more like MRI's, you will reduce the number of collisions and thus reduce this astronomical number.

    The next steps for fixing this are to think about what characteristics a good hash function has, compare the algorithm of rubinius's Struct#hash and Fixnum#hash to their counterparts in other implementations of Ruby, and propose changes to those functions in rubinius if we find that there really is a problem.

    ReplyDelete
  7. Charles Oliver Nutter and Brian: thanks for reading my post and investigating it!

    Charles, which version of Rubinius were you using in your tests?

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete