2012年6月8日 星期五

Learning to rank: from pairwise approach to listwise approach

Learning to rank: from pairwise approach to listwise approach Cao, ICML, 2007
==================================================

When use only pairwise method to solve the problem of ordinal learning. it comes out that this will induce some problem. First, the objective of learning is formalized as minimizing errors in classification of document pairs, disobey the goal of minimizing errors in ranking of documents. Second, the training process is computationally costly because it based on pairwise learning style. Third, the assumption of that the document pairs are generated i.i.d. is also too strong. Fourth, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs.
Thus, in this paper, they propose the list-wise approach.
The followings are the concept of listwise ranking


where q are set of queries, d(i) associated with respect to query q(i), The judgment y(i)j represents the relevance degree of d(i)j to q(i).
x are the feature vectors and z are the list of scores obtained, the objective of learning is formalized as minimization of the total losses with respect to the training data (the formula in the above figure at the right)

In order to measure the loss function of two different ranking, we first convert the ranking into probability distribution, and measure the loss function by using Cross Entropy.The following definition  convert the ranking into probability distribution.
to simplify the computation procedure,  calculating the top one permutation probability
using this representation, then we can use Cross Entropy as metric

phi-function is defined as exponential function 


Then use gradient descent method to solve this optimization problem.
The ranking function fw is based on the Neural Network model w.
Given a feature vector x(i)j, fw(x(i)j) assigns a score to it.
Finally, learn a linear combination that minimize the loss function measure by Cross Entropy.
Following is the deatail of the algorithn.

 
==============================================
Comment:
The listwise learing style improve the drawback that pairwise learing has. It's more efficient and have more physical by training naturally based on list ranking style.




 

Support vector learning for ordinal regression

Support vector learning for ordinal regressionR. Herbrich, ICANN, 1999.
=================================================
The paper aim to convert the ordinal regression problem to classification problem, then use the SVM to solve the reduced problem.

The input pair is as follow

loss is formulated like following


The concept of this method is simple, below is a toy example



It simply convert the ranking problem into classification.
Solve the W, find the linear mapping as follow



The equation above is somewhat like the SVM optimization problem.
The pictture bellow shows the result of the mapping.
========================================================
Comment:
The idea is smart and easy to understand. But the paper use too many notation and math equation which make it hard to read. It should explain the notion in a more general way.


Latent Dirichlet allocation

Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan
Journal of Machine Learning Research, 2003

=====================================

Latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.It aim to solve the problems that pLSA didn't handle well, such as the unseen document.


The structure of LDA is as below:
alpha are the parameters of Dirichlet distribution, beta is the word probability k*v matrix, where k is number of topics and v is number of vocabulary


The summation of word over k topics is to decide how likely this word occurs in the combination of the topics and weighted by the probability of topics


The model becomes as follow, it models the document distribution

The illustrative figures of how LDA use topics simplex to generate word is as follow:
Following shows  the difference between pLSA and LDA
========================================================
Comment:
LDA is brilliant that solve the problem of pLSA and seems outperform pLSA. But compare to pLSA, it's model is more complcated and hard to understand, while the amount of computations is another big issue. 




2012年5月4日 星期五

A Global Geometric Framework for Nonlinear Dimensionality Reduction

A Global Geometric Framework for Nonlinear Dimensionality Reduction
Joshua B. Tenenbaum,1* Vin de Silva,2 John C. Langford3
Science 2000
======================================================

Below is  canonical problems in dimensionality reduction from the domain of visual perception



The paper proposed a method called Isomap, it builds on classical MDS but seeks to preserve the intrinsic geometry of the data, as captured in the geodesic manifold distances between all pairs of data pointsthe. Algorithm has three steps as below:


So we could simply see Isomap as an approach that combines the major algorithmic features of PCA
and MDS, but calculate the geodesic distance in advance. The figure below shows the physical meaning of geodesic distance and the sampling effect.



=======================================================
Sumary and Comment:

Compared to MDS: ISOMAP has the ability to discover the underlying structure (latent variables) which is nonlinear embedded in the feature space. It is a global method, which preserves all pairs of
distances.













2012年4月25日 星期三

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Nonlinear Dimensionality Reduction by Locally Linear Embedding
Sam T. Roweis1 and Lawrence K. Saul2
 Science, 2000
======================================================

Locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs.

The figure below is the toy example which shows the dimension reduction from 3-D to 2-D manifold subspace. The structure(neighborhood distance) of the date was kept through the reduction.

Characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function

 
Below is the illustration of the reconstruction, for a given point, select the neighbors and resconstruct with lineat weught and reserve this property in the embedded coordinates.



An common example of face

==================================================
Sumary and Comment:

The main goal of LLE is neighborhood preserving which based on the fundamental manifold properties, preserved the local geometry of each sample and its neighbors. The advantage is it Ignores the global geometry in large scale which could be more efficient than calculate the all pairs distance.

But like many other manifold learning methods, it must assume that the data must be well-sampled and sufficient, and  each sample and its neighbors lie on or closed to a local
linear patch (sub-plane) of the manifold. The assumption might be too strong and cause the method could not be general used in many problems.

2012年4月11日 星期三

Probabilistic latent semantic indexing

Probabilistic latent semantic indexingT. Hofmann,
SIGIR, 1999.
==================================================
This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and de nes a proper generative model of the data.
According to the Bayesian rule, we can get the following equtions
d is document , w is word, z is hidden topic

we want to minimizq the error distance between original data and the reconstructed data after using latent classes (using KL divergence). We covert original question to maximize the following log-likelihood function

Use the EM algorithn to solve the optimization problem
E-step

M-step

To notice that the result of traditional EM method is highly depend on the initial probabilities, so they proposed tempered EM (TEM) which is derived from deterministic annealing. Then, the formula in E step becomes



The following is the illustraion (physical meaning) of the hidden topic. 



P(w|z1), P(w|z2) ... P(w|zn) can be viewed as the basis vectors for representing the document. The reconstructed document should be on the plane expended by these vectors because of convexity.

============================================

Comment:

The physical meaning of this paper is strong and easy to understand. It's brillient to introduced the idea of hidden topic and construct it on the probablistic model. But the paper doesn't give a good way to deal with unseen data.














2012年3月28日 星期三

Semi-Supervised Hashing for Scalable Image Retrieval

Semi-Supervised Hashing for Scalable Image RetrievalJun Wang,Columbia University
Sanjiv Kumar,Google Research
Shih-Fu Chang,Columbia University
CVPR, 2010
============================================
Hashing methid is popular for allowing approximate but highly efficient search.
There are

1.Unsupervised hashing
   Unsupervised hashing methods show good performance with metric distances but, in image
   search, semantic similarity is usually given in terms of labeled pairs of images.

2.Supervised hashing
   Supervised hashing methods that can handle such semantic similarity but they are prone to
   overfitting when labeled data is small or noisy.

Tthese methods are usually very slow to train. The paper proposed a semi-supervised hashing method that improved the problems mentioned above.

Background

1.LSH
A key ingredient of Locality Sensitive Hashing is mapping similar samples to the same bucket with high probability.

A typical category of LSH functions consistsof random projections and thresholds as


Construct a total of K-bit length hash table


,K should be large enough to reduce the size of bucket, and the collision probability is as followiing:


2.SH
Due to the limitation of random projection based LSH, SH was proposed to to design compact binary
codes for ANN search. SH want to minimize the following function.


where , k = 1...K
This is a NP-hard problem, after relaxing the constraints,the above optimization was solved using "spectral" graph analysis. Maybe this is the words Spectral Hashing come from.

3.RBMs
Learning deep belief networks via stacking Restricted Boltzmann Machines (RBMs) to obtain compact binary codes. A practical implementation of RBMs has two critical stages:
  
     (1) unsupervised pre-training
     (2) supervised finetuning.

RBM-based binary encoding involves estimating a large number of weights, result in extremely costly
training procedure and demand sufficient training data for fine-tuning.

Semi-Supervised Hashing

given a set of n points, X = {xi}, i = 1 . . . n, xi 2 RD,in which a fraction of pairs are associated with two categories of label information, M and C.


we want to maximize

Relaxing Objective Function

Here use a simple relaxing method ,we replace the sign of projection with its signed magnitude
With this relaxation, the new objective can be directly written as a function ofWas

Finally, we can get

Experiments

We can see that the result was really good. SSH outperform other methods


========================================
Comment

The paper proposed a smart and impressive way that take the advantage of the original supervised and unsupervised method to improve both efficiency and accuracy.
It'll be an important method in the practical applications.