A.I. & Optimization

Advanced Machine Learning, Data Mining, and Online Advertising Services

Clustering by Compression

In this post we describe an interesting approach to compute information distance between two objects.

Normalized Compression Distance

In information theory, there is a powerful notion called normalized compression distance which can be used to compute the distance between two objects where objects can be two pictures, two programs, two documents, two musics and etc. Here, we briefly describe the distance notion and explain how to compute it. We also show an application of ncd.

NCD Description

Loosely speaking we want to compute how difficult it is to transform the first object to the other object from information point of view. Lower information changes we need for transforming objects, more similar the objects are. One practical way to measure the information distance between two objects is to use a compressor like Huffman coding. One can compute NCD distance using following equation:


      
      

Here, represents the binary length of compressed version of object . The compressor is the only parameter in the above equation where one for instance can use zlib for the computation.

One Application!

For testing how ncd works in practice, we collected over 80k tweets where people posted tweets about them being sick or having flu. We use ncd measure to find similar tweets. We wrote a simple python script where we used zlib cas the ompression for ncd computation. Following table shows a few pair of tweets along with their computed ncd. Please note when we consider objects to be similar.

Tweet#1 Tweet#2 NCD
I hate being sick! I hate being sick! 0.15
I think im getting sick, mehhh. Think im getting sick!!! 0.35
I hate being sick! hate being sick, fuck offff 0.36
I have the flu Got the flu 0.36
Headache -,- headache. headache. headache. 0.28
I got a headache): I got a big ass headache 0.37
I got a headache): Gtta headache 0.38
You're making my headache worse just please stop talking. Please shut up you're making my headache worse # thanks 0.32

Observations

As we see in our initial results, thanks to ncd's root in information theory it can find similar tweets. In the future, we have plan to continue this work and use ncd to run clustering by compression on flu-related tweets!