SIGIR, 1999.
==================================================
This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and de nes a proper generative model of the data.
According to the Bayesian rule, we can get the following equtions
d is document , w is word, z is hidden topic
we want to minimizq the error distance between original data and the reconstructed data after using latent classes (using KL divergence). We covert original question to maximize the following log-likelihood function
Use the EM algorithn to solve the optimization problem
E-step
M-step
To notice that the result of traditional EM method is highly depend on the initial probabilities, so they proposed tempered EM (TEM) which is derived from deterministic annealing. Then, the formula in E step becomes
The following is the illustraion (physical meaning) of the hidden topic.
P(w|z1), P(w|z2) ... P(w|zn) can be viewed as the basis vectors for representing
the document. The reconstructed document should be on the plane expended by these vectors because of convexity.
============================================
Comment:
The physical meaning of this paper is strong and easy to understand. It's brillient to introduced the idea of hidden topic and construct it on the probablistic model. But the paper doesn't give a good way to deal with unseen data.
沒有留言:
張貼留言