Friday, November 11, 2011

Crippling Hash Collisions in Rubinius

Last weekend, I was happily and innocently coding away in Ruby, when I noticed that my tests were running one hundred times slower in one Ruby implementation than in another. The different implementations naturally have different weaknesses and strengths in performance, but when you find some code that runs 100 times slower there is probably something going very wrong. I figured out the root cause of this problem and I hope you will find it interesting.

I managed to simplify my code to the simplest possible thing that demonstrates the problem:

P = Struct.new :x, :y, :z
hash = {}
(0..15).each do |x|
  (0..15).each do |y|
    (0..15).each do |z|
      key = P.new x, y, z
      hash[key] = true
    end
  end
end

This code looks pretty innocent, right? If I were on the jury I would say "not guilty".

Here's how long the code took to run in various Ruby implementations on my computer (according to the Linux time command):

Implementation Run time
ruby 1.9.2p290 (MRI) 0.020 s
rubinius 1.2.4 3.210 s
rubinius 2.0.0dev 0.894 s

Here are some gratuitous SVG bar graphs:




That's right, you need to use a logarithmic scale to even see the run time of this code under MRI.

I was very curious about why Rubinius was so slow, so I used the profiler built in to Rubinius and discovered that most of the time was being spent searching for matching hash entries (Hash::Entry#match?). That's because the hash function in Rubinius 1.2.4 only generated 16 different hash-codes for the 4096 different keys I put into the hash. Every time a new key was added to the hash, there was a hash collision with approximately one sixteenth of the keys already in the hash, so the new key had to be compared to each of them to make sure it wasn't equal to any of the keys already in the hash. Rubinius 2.0.0dev did better, with 576 unique hash-codes. In MRI 1.9.2, all the hash-codes were unique.

If you didn't understand the paragraph above, read on because I'm going to explain how hash tables work.

As we know, the job of a hash table to associate keys to values. The keys and values can be any Ruby object. Imagine the hash table is a city, and the key-value pairs are houses in the city. Whenever you access an element in the hash table, you are visiting one of the houses.

When you want to visit a house (key-value pair) in the city, the first thing you need to do is compute its hash code, which is like the street address of the house. In Ruby, this is done by calling the key's hash method, which returns an integer. The roads in the hash city are designed so that once you know the address of a house, you can quickly find all the houses that have that address.

In our hash city, it possible that two or more houses will have the same address. This is called a hash collision, but unlike other collisions it doesn't leave body parts lying in the middle of the road. The consequence of a hash collision is relatively mild: after you find the houses that have the address you are looking for, you just need to look at each house individually to see whether it is the house you are looking for. This can increase the time it takes to find the right house. This applies equally to both read and write operations, but in our example code we only did write operations.

For example, if you are looking for a house (key-value pair) with address (hash-code) 34, you would quickly walk to the houses that have address 34. You would then look at each house (key) and see if it is the house you are looking for (using the eql? method). If you find the house, you can stop searching, but otherwise you will have to look at all the houses with that address.

There are different ways to generate hash-codes and for a given set of data some of those methods will produce more hash collisions than others.

In Ruby 1.9, the 4096 keys generated in the code above all had different hash-codes so there were no hash collisions.

In Rubinius 1.2.4, because the hash method was implemented differently, the 4096 keys in the code above had only *16* different addresses. Imagine a city with 4096 houses, but only 16 addresses. It would take forever to find anyone's house!

I hope this illuminated something for you. Let me know what you think in the comments!