Journal of Machine Learning Research, 2003
=====================================
Latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.It aim to solve the problems that pLSA didn't handle well, such as the unseen document.
The structure of LDA is as below:
alpha are the parameters of Dirichlet distribution, beta is the word probability k*v matrix, where k is number of topics and v is number of vocabulary
The summation of word over k topics is to decide how likely this word occurs in the combination of the topics and weighted by the probability of topics
The model becomes as follow, it models the document distribution
The illustrative figures of how LDA use topics simplex to generate word is as follow:
Following shows the difference between pLSA and LDA
========================================================
Comment:
LDA is brilliant that solve the problem of pLSA and seems outperform pLSA. But compare to pLSA, it's model is more complcated and hard to understand, while the amount of computations is another big issue.
沒有留言:
張貼留言