Advanced Machine Learning, Data Mining, and Online Advertising Services
In this post we explain the mathematics behind Latent Dirichlet Allocation (LDA) model using collapsed Gibbs sampling technique. Also we present a Java implementation of the LDA model.
One essential problem in NLP is to model the language behind a collection of documents, and specifically to extract the underlying topics from a collection of documents. Here, we model a document as a bag of words in which words ('s) are the observed variables while topics are latent variables. Blei et al. has proposed the LDA as a generative model for solving such NLP problems [1].
One way to approach the problem of topics modeling for a collection of documents is to employ a generative framework and that's how LDA solves the problem. To understand the intuitions behind the LDA model let's look at a science paper published by J. Kleinberg in the Nature: Navigation in a Small World.
With a quick glance, you can see that the paper talks about a few topics: "social networks", "graphs", "search", and "routing". In the real world, we expect that most of documents talk about a few topics. One can argue that our mind expresses a topic somehow with a list of probabilities over words vocabulary .
Fig 1 shows the graphical representation of the LDA model [1]. We can generate a document by first sampling from a multinomial topic distribution for that document
to find the topic
for word
in document
. Next, we can sample word
from
which is the probability distribution of topic
over vocabulary words. Thus, we can use LDA to generate documents given documents distributions over topics along with topics distributions over the vocabulary. This description has been captured in Fig 1 above.
The next key question is how one can train the LDA model. Here the only observed variables are words while
(topics) are latent variables. Also, both
(documents distributions over topics) and
(topics distributions over
) distributions are unknown and should be computed.
Let's define to be the number of words
in document
with topic
. So, we can compute the following counts:

Here, we've marginalized out and
. In words,
is the number of words in document
with topic
and
is the number of times word
has been assigned to topic
[2].
Given the current state of all but one variable (topic assigned to word i in doc j), the conditional probability of
is computed as below:
, where and
are computed as follows:
is the normalization constant:
Here, the subscript means that the corresponding dataum has been excluded in the coun summations
.
Fig 2 shows the LDA algorithm for sampling hidden topic of word i in document j using conditional probability of . Fig 2 shows all the steps required to scan
words in document
when first we compute the probability for each topic using the current state of
as shown in conditional probability. Next, we use the probability weights in order to determine the current topic of the given word with index
in the document
under process.
Running one iteration of Gibbs sampling takes steps where
is the number of words in a document and
is the total number of topics. So, it takes
steps to scan all documents where
is the number of documents in our corpus. We need to run Gibbs sampler for enough number of iterations in order to guarantee the convergence of probability distributions. This makes the overall running time to be
where
is the number of iterations.
We have implemented LDA model using Gibbs sampling technique in Java. You can pull the code from its github repo: LDA implementation in Java. Currently there is not any unit test in place but we have plan to add unit tests to the code in the future.
In the next post we will discuss different applications of the LDA model such as clustering documents and information retrieval. Also we will explain how one can pick parameter as the total number of topics.
Please contact us if you have any questions for the LDA implementation.