SimHash or the way to compare quickly two datasets

SimHash is an algorithm created by Moses Charikar, from Google. It is an effective tool to compare easily and very fast two datasets.

What does this really mean? Let's talk a bit about it.

The similarity problem

Computers are very pragmatic. They can determine very fast whether two elements are different or not. It's a binary world: equal or different.

Imagine now we would like to have an idea of the similarity between these two elements. That would be a much bigger problem: a computer is not designed to do such comparison by nature. It's difficult, so it takes resources.

For a better understanding, let's take the following sentences as an example:

Shakespeare produced most of his known work between 1589 and 1613  
Shakespeare produced most of his work after 1589  

How does the computer will determine if the texts are (strictly) equals? It will browse the first text and compare each letter of it with the letter in the same position in the second text.

But the point here is that if the computer find a difference, it stops. It does not use anymore resources, it knows that the strings are strictly different.

That's where similarity is a challenge. If we check for strict equality, we only need to stop on the first difference. If we want a similarity estimation, we have to check the entire text. In math it would be:

$$similarity(A, B) = \frac{A \cap B}{A \cup B}$$

For big datasets, the part \(A \cap B\) is very hard to compute. That's where SimHash is useful.

With SimHash, we will create two fingerprints to replace the datasets A and B:

$$simhash(A) \cap simhash(B)$$

Thus we will compare much smaller elements and the comparison time will be dramatically reduced.

SimHash or the way to create fingerprints

To compare fingerprints, we need an algorithm that generate them using a big dataset. The first idea would be to use hash (md5, sha1), but theses algorithms does not represent well a text as if the text change a bit, the hash change a lot.

SimHash does not behave like this: instead, it's a bit like a very compressed version of the dataset (it does not change a lot if the text does not change a lot).

The official SimHash algorithm is:

  • Define a fingerprint size (for instance 32 bits)
  • Create an array V[] filled with this size of zeros
  • For each element in the dataset, we create a unique hash with md5,sha1 of any other hash algorithm that give same-sized results
  • For each hash, for each bit i in this hash:
    • If the bit is 0, we add 1 to V[i]
    • If the bit is 1, we take 1 from V[i]
  • For each bit j of the global fingerprint:
    • If V[j] >= 0, we set V[j] = 1
    • If V[j] < 0, we set V[j] = 0

It gives us a fingerprint characterizing our text, an approximation of the text data. This fingerprint is a binary number, for instance: 10101011100010001010000101111100.

Now, to find

$$simhash(A) \cap simhash(B)$$

we only have to use a XOR operation:

    10101011100010001010000101111100
XOR 10101011100010011110000101111110  
  = 00000000000000010100000000000010

Here, the 1s in the XOR result are the differences between the two fingerprints. To get an idea of the difference between the original texts, we just have to count the number of 1 and divide it by the total size.

Here, we have 3 ones for 32 characters : the estimation of the difference is 3 / 32 = 0,09375, so the estimation of the similarity is 1 - 3 / 32 = 0,90625
(a bit more than 90%). We have our similarity index!

Real world usage

SimHash is currently used by Google to compare page with its database, to avoid duplicate contents. But we can use it too!

I created a small PHP library to use it programmatically in PHP: SimHashPHP.

The main usage of SimHash is to compare things in a database. For instance, let's imagine we want to find the most similar articles of the one we are currently reading. It appears complicated at the first sight using only SQL. However with SimHash it's not that difficult: we just have to store a fingerprint for each article, and use the XOR operation in SQL to count the 1 in the binary result.

For instance:

SELECT id, title, (LENGTH(CONV(fp ^ ?, 10, 2)) - LENGTH(REPLACE(CONV(fp ^ ?, 10, 2), '1', ''))) / LENGTH('1') AS comparison  
FROM articles  
ORDER BY comparison ASC