ㄎㄎㄎㄎㄎㄎ: Probabilistic latent semantic indexing

Probabilistic latent semantic indexingT. Hofmann,
SIGIR, 1999.
==================================================
This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and de nes a proper generative model of the data.
According to the Bayesian rule, we can get the following equtions
d is document , w is word, z is hidden topic

we want to minimizq the error distance between original data and the reconstructed data after using latent classes (using KL divergence). We covert original question to maximize the following log-likelihood function

Use the EM algorithn to solve the optimization problem
E-step

M-step

To notice that the result of traditional EM method is highly depend on the initial probabilities, so they proposed tempered EM (TEM) which is derived from deterministic annealing. Then, the formula in E step becomes

The following is the illustraion (physical meaning) of the hidden topic.

P(w|z1), P(w|z2) ... P(w|zn) can be viewed as the basis vectors for representing the document. The reconstructed document should be on the plane expended by these vectors because of convexity.

============================================

Comment:

The physical meaning of this paper is strong and easy to understand. It's brillient to introduced the idea of hidden topic and construct it on the probablistic model. But the paper doesn't give a good way to deal with unseen data.

ㄎㄎㄎㄎㄎㄎ

2012年4月11日星期三

Probabilistic latent semantic indexing

沒有留言:

張貼留言

關於我自己

網誌存檔

2012年4月11日 星期三

Probabilistic latent semantic indexing

沒有留言:

張貼留言

2012年4月11日星期三