# A.I. & Optimization

## Clustering by Compression

In this post we describe an interesting approach to compute information distance between two objects.

### Normalized Compression Distance

In information theory, there is a powerful notion called normalized compression distance which can be used to compute the distance between two objects where objects can be two pictures, two programs, two documents, two musics and etc. Here, we briefly describe the distance notion and explain how to compute it. We also show an application of ncd.

### NCD Description

Loosely speaking we want to compute how difficult it is to transform the first object to the other object from information point of view. Lower information changes we need for transforming objects, more similar the objects are. One practical way to measure the information distance between two objects is to use a compressor like Huffman coding. One can compute NCD distance using following equation:

``````
$NCD(x,y)=\dfrac{Z(xy)&space;-&space;min(Z(x),&space;Z(y))}{max(Z(x),&space;Z(y))}$
``````

Here, $Z(x)$ represents the binary length of compressed version of object $x$. The compressor is the only parameter in the above equation where one for instance can use zlib for the computation.

### One Application!

For testing how ncd works in practice, we collected over 80k tweets where people posted tweets about them being sick or having flu. We use ncd measure to find similar tweets. We wrote a simple python script where we used zlib cas the ompression for ncd computation. Following table shows a few pair of tweets along with their computed ncd. Please note when $ncd&space;<&space;0.5$ we consider objects to be similar.

Tweet#1 Tweet#2 NCD
I hate being sick! I hate being sick! 0.15
I think im getting sick, mehhh. Think im getting sick!!! 0.35
I hate being sick! hate being sick, fuck offff 0.36
I have the flu Got the flu 0.36