2012年6月8日 星期五

Learning to rank: from pairwise approach to listwise approach

Learning to rank: from pairwise approach to listwise approach Cao, ICML, 2007
==================================================

When use only pairwise method to solve the problem of ordinal learning. it comes out that this will induce some problem. First, the objective of learning is formalized as minimizing errors in classification of document pairs, disobey the goal of minimizing errors in ranking of documents. Second, the training process is computationally costly because it based on pairwise learning style. Third, the assumption of that the document pairs are generated i.i.d. is also too strong. Fourth, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs.
Thus, in this paper, they propose the list-wise approach.
The followings are the concept of listwise ranking


where q are set of queries, d(i) associated with respect to query q(i), The judgment y(i)j represents the relevance degree of d(i)j to q(i).
x are the feature vectors and z are the list of scores obtained, the objective of learning is formalized as minimization of the total losses with respect to the training data (the formula in the above figure at the right)

In order to measure the loss function of two different ranking, we first convert the ranking into probability distribution, and measure the loss function by using Cross Entropy.The following definition  convert the ranking into probability distribution.
to simplify the computation procedure,  calculating the top one permutation probability
using this representation, then we can use Cross Entropy as metric

phi-function is defined as exponential function 


Then use gradient descent method to solve this optimization problem.
The ranking function fw is based on the Neural Network model w.
Given a feature vector x(i)j, fw(x(i)j) assigns a score to it.
Finally, learn a linear combination that minimize the loss function measure by Cross Entropy.
Following is the deatail of the algorithn.

 
==============================================
Comment:
The listwise learing style improve the drawback that pairwise learing has. It's more efficient and have more physical by training naturally based on list ranking style.




 

Support vector learning for ordinal regression

Support vector learning for ordinal regressionR. Herbrich, ICANN, 1999.
=================================================
The paper aim to convert the ordinal regression problem to classification problem, then use the SVM to solve the reduced problem.

The input pair is as follow

loss is formulated like following


The concept of this method is simple, below is a toy example



It simply convert the ranking problem into classification.
Solve the W, find the linear mapping as follow



The equation above is somewhat like the SVM optimization problem.
The pictture bellow shows the result of the mapping.
========================================================
Comment:
The idea is smart and easy to understand. But the paper use too many notation and math equation which make it hard to read. It should explain the notion in a more general way.


Latent Dirichlet allocation

Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan
Journal of Machine Learning Research, 2003

=====================================

Latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.It aim to solve the problems that pLSA didn't handle well, such as the unseen document.


The structure of LDA is as below:
alpha are the parameters of Dirichlet distribution, beta is the word probability k*v matrix, where k is number of topics and v is number of vocabulary


The summation of word over k topics is to decide how likely this word occurs in the combination of the topics and weighted by the probability of topics


The model becomes as follow, it models the document distribution

The illustrative figures of how LDA use topics simplex to generate word is as follow:
Following shows  the difference between pLSA and LDA
========================================================
Comment:
LDA is brilliant that solve the problem of pLSA and seems outperform pLSA. But compare to pLSA, it's model is more complcated and hard to understand, while the amount of computations is another big issue.