2012年6月8日 星期五

Learning to rank: from pairwise approach to listwise approach

Learning to rank: from pairwise approach to listwise approach Cao, ICML, 2007
==================================================

When use only pairwise method to solve the problem of ordinal learning. it comes out that this will induce some problem. First, the objective of learning is formalized as minimizing errors in classification of document pairs, disobey the goal of minimizing errors in ranking of documents. Second, the training process is computationally costly because it based on pairwise learning style. Third, the assumption of that the document pairs are generated i.i.d. is also too strong. Fourth, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs.
Thus, in this paper, they propose the list-wise approach.
The followings are the concept of listwise ranking


where q are set of queries, d(i) associated with respect to query q(i), The judgment y(i)j represents the relevance degree of d(i)j to q(i).
x are the feature vectors and z are the list of scores obtained, the objective of learning is formalized as minimization of the total losses with respect to the training data (the formula in the above figure at the right)

In order to measure the loss function of two different ranking, we first convert the ranking into probability distribution, and measure the loss function by using Cross Entropy.The following definition  convert the ranking into probability distribution.
to simplify the computation procedure,  calculating the top one permutation probability
using this representation, then we can use Cross Entropy as metric

phi-function is defined as exponential function 


Then use gradient descent method to solve this optimization problem.
The ranking function fw is based on the Neural Network model w.
Given a feature vector x(i)j, fw(x(i)j) assigns a score to it.
Finally, learn a linear combination that minimize the loss function measure by Cross Entropy.
Following is the deatail of the algorithn.

 
==============================================
Comment:
The listwise learing style improve the drawback that pairwise learing has. It's more efficient and have more physical by training naturally based on list ranking style.




 

Support vector learning for ordinal regression

Support vector learning for ordinal regressionR. Herbrich, ICANN, 1999.
=================================================
The paper aim to convert the ordinal regression problem to classification problem, then use the SVM to solve the reduced problem.

The input pair is as follow

loss is formulated like following


The concept of this method is simple, below is a toy example



It simply convert the ranking problem into classification.
Solve the W, find the linear mapping as follow



The equation above is somewhat like the SVM optimization problem.
The pictture bellow shows the result of the mapping.
========================================================
Comment:
The idea is smart and easy to understand. But the paper use too many notation and math equation which make it hard to read. It should explain the notion in a more general way.


Latent Dirichlet allocation

Latent Dirichlet allocation, D. Blei, A. Ng, and M. Jordan
Journal of Machine Learning Research, 2003

=====================================

Latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.It aim to solve the problems that pLSA didn't handle well, such as the unseen document.


The structure of LDA is as below:
alpha are the parameters of Dirichlet distribution, beta is the word probability k*v matrix, where k is number of topics and v is number of vocabulary


The summation of word over k topics is to decide how likely this word occurs in the combination of the topics and weighted by the probability of topics


The model becomes as follow, it models the document distribution

The illustrative figures of how LDA use topics simplex to generate word is as follow:
Following shows  the difference between pLSA and LDA
========================================================
Comment:
LDA is brilliant that solve the problem of pLSA and seems outperform pLSA. But compare to pLSA, it's model is more complcated and hard to understand, while the amount of computations is another big issue. 




2012年5月4日 星期五

A Global Geometric Framework for Nonlinear Dimensionality Reduction

A Global Geometric Framework for Nonlinear Dimensionality Reduction
Joshua B. Tenenbaum,1* Vin de Silva,2 John C. Langford3
Science 2000
======================================================

Below is  canonical problems in dimensionality reduction from the domain of visual perception



The paper proposed a method called Isomap, it builds on classical MDS but seeks to preserve the intrinsic geometry of the data, as captured in the geodesic manifold distances between all pairs of data pointsthe. Algorithm has three steps as below:


So we could simply see Isomap as an approach that combines the major algorithmic features of PCA
and MDS, but calculate the geodesic distance in advance. The figure below shows the physical meaning of geodesic distance and the sampling effect.



=======================================================
Sumary and Comment:

Compared to MDS: ISOMAP has the ability to discover the underlying structure (latent variables) which is nonlinear embedded in the feature space. It is a global method, which preserves all pairs of
distances.













2012年4月25日 星期三

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Nonlinear Dimensionality Reduction by Locally Linear Embedding
Sam T. Roweis1 and Lawrence K. Saul2
 Science, 2000
======================================================

Locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs.

The figure below is the toy example which shows the dimension reduction from 3-D to 2-D manifold subspace. The structure(neighborhood distance) of the date was kept through the reduction.

Characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function

 
Below is the illustration of the reconstruction, for a given point, select the neighbors and resconstruct with lineat weught and reserve this property in the embedded coordinates.



An common example of face

==================================================
Sumary and Comment:

The main goal of LLE is neighborhood preserving which based on the fundamental manifold properties, preserved the local geometry of each sample and its neighbors. The advantage is it Ignores the global geometry in large scale which could be more efficient than calculate the all pairs distance.

But like many other manifold learning methods, it must assume that the data must be well-sampled and sufficient, and  each sample and its neighbors lie on or closed to a local
linear patch (sub-plane) of the manifold. The assumption might be too strong and cause the method could not be general used in many problems.

2012年4月11日 星期三

Probabilistic latent semantic indexing

Probabilistic latent semantic indexingT. Hofmann,
SIGIR, 1999.
==================================================
This paper presents a statistical view on LSA which leads to a new model called Probabilistic Latent Semantics Analysis (PLSA). In contrast to standard LSA, its probabilistic variant has a sound statistical foundation and de nes a proper generative model of the data.
According to the Bayesian rule, we can get the following equtions
d is document , w is word, z is hidden topic

we want to minimizq the error distance between original data and the reconstructed data after using latent classes (using KL divergence). We covert original question to maximize the following log-likelihood function

Use the EM algorithn to solve the optimization problem
E-step

M-step

To notice that the result of traditional EM method is highly depend on the initial probabilities, so they proposed tempered EM (TEM) which is derived from deterministic annealing. Then, the formula in E step becomes



The following is the illustraion (physical meaning) of the hidden topic. 



P(w|z1), P(w|z2) ... P(w|zn) can be viewed as the basis vectors for representing the document. The reconstructed document should be on the plane expended by these vectors because of convexity.

============================================

Comment:

The physical meaning of this paper is strong and easy to understand. It's brillient to introduced the idea of hidden topic and construct it on the probablistic model. But the paper doesn't give a good way to deal with unseen data.














2012年3月28日 星期三

Semi-Supervised Hashing for Scalable Image Retrieval

Semi-Supervised Hashing for Scalable Image RetrievalJun Wang,Columbia University
Sanjiv Kumar,Google Research
Shih-Fu Chang,Columbia University
CVPR, 2010
============================================
Hashing methid is popular for allowing approximate but highly efficient search.
There are

1.Unsupervised hashing
   Unsupervised hashing methods show good performance with metric distances but, in image
   search, semantic similarity is usually given in terms of labeled pairs of images.

2.Supervised hashing
   Supervised hashing methods that can handle such semantic similarity but they are prone to
   overfitting when labeled data is small or noisy.

Tthese methods are usually very slow to train. The paper proposed a semi-supervised hashing method that improved the problems mentioned above.

Background

1.LSH
A key ingredient of Locality Sensitive Hashing is mapping similar samples to the same bucket with high probability.

A typical category of LSH functions consistsof random projections and thresholds as


Construct a total of K-bit length hash table


,K should be large enough to reduce the size of bucket, and the collision probability is as followiing:


2.SH
Due to the limitation of random projection based LSH, SH was proposed to to design compact binary
codes for ANN search. SH want to minimize the following function.


where , k = 1...K
This is a NP-hard problem, after relaxing the constraints,the above optimization was solved using "spectral" graph analysis. Maybe this is the words Spectral Hashing come from.

3.RBMs
Learning deep belief networks via stacking Restricted Boltzmann Machines (RBMs) to obtain compact binary codes. A practical implementation of RBMs has two critical stages:
  
     (1) unsupervised pre-training
     (2) supervised finetuning.

RBM-based binary encoding involves estimating a large number of weights, result in extremely costly
training procedure and demand sufficient training data for fine-tuning.

Semi-Supervised Hashing

given a set of n points, X = {xi}, i = 1 . . . n, xi 2 RD,in which a fraction of pairs are associated with two categories of label information, M and C.


we want to maximize

Relaxing Objective Function

Here use a simple relaxing method ,we replace the sign of projection with its signed magnitude
With this relaxation, the new objective can be directly written as a function ofWas

Finally, we can get

Experiments

We can see that the result was really good. SSH outperform other methods


========================================
Comment

The paper proposed a smart and impressive way that take the advantage of the original supervised and unsupervised method to improve both efficiency and accuracy.
It'll be an important method in the practical applications.

2012年3月21日 星期三

Online dictionary learning for sparse coding

Online dictionary learning for sparse coding.Mairal et al.
ICML 2009
=============================================
Sparse codind is a method that model data vectors as sparse linear combinations of basis elements,and widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses
on learning the basis set, also called dictionary,to adapt it to specific data, This paper proposes a new online optimization algorithm for dictionary learning, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples.

Contribution:
1.optimization of a smooth nonconvex objective function over a convex set
2.iterative online algorithm that solves this problem by efficiently minimizing at each step a
   quadratic surrogate function of the empirical cost over the set of constraints.
3.significantly faster than previous approaches to dictionary learning on both small and large datasets  
   of natural images

1.Problem Statement
2.Online Dictionary Learning:

1.Problem Statement

we define l(x,D) as the optimal value of the ℓ1-sparse coding problem:

The empirical cost function
define a convex set


convert the problem into convex optimization problem


2.Online Dictionary Learning:

Observed that a Cholesky-based implementation of the LARS-Lasso algorithm that provides the whole regularization path.

Optimizing the Algorithm

1.Handling Fixed-Size Datasets
When dealing with large training sets, it is impossible to store all the past coefficients, but we can use the repeated data that appeared frequently.

2.Mini-Batch Extension
Speed up by using multiple signals at a time rather than only one.

3.Purging the Dictionary from Unused Atoms
Replace the dictionary atoms which are seldom used.

Author also gives some intelligent proof which is beyond my comprehension.
=====================================================
Pros and Cons:

Pros:
1. This is an extraordinary work in the sparse coding area, it gives us a good and efficient way to implement it on the real-world problem.

Cons:
1. The proves are difficult to understand, it's better to give more details or even tutorial to reader.

2012年3月14日 星期三

Aggregating local descriptors into a compact image representation

Aggregating local descriptors into a compact image representation,
Herve Jegou, et al.,
Proc. IEEE CVPR'10
================================================
The goal of this paper is to solve the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory
usage of the representation.

There are three main steps:
1. the representation, i.e., how to aggregate local image descriptors into a vector representation;
2. the dimensionality reduction of these vectors;
3. the indexing algorithm.

1. the representation

The paper propose a descriptor, derived from both BOF and Fisher kernel, that aggregates SIFT descriptors and produces a compact representation, is was called "VLAD (vector of locally aggregated descriptors)".
The vector v is subsequently L2-normalized by v := v/||v||2 .


 Then we need to transform  an image vector into code. The coding has the property such that the nearest neighbors of a (non-encoded) query vector can be efficiently searched in a set of n encoded database vectors.

Using the ADC approach to approximate nearest neighbors



2. the dimensionality reduction

Dimensionality reduction is an important step in approximate nearest neighbor search, as it impacts the subsequent indexation method. Using the standard tool for dimensionality reduction, principal component analysis (PCA).



3. the indexing algorithm.

In the end, It focus on joint optimization of reduction/indexing, optimizing the dimension
D′, having fixed a constraint on the number of bits B used to represent the D-dimensional VLAD vector x, for instance B=128 (16 bytes).


 =============================================
Comments:

Pros:
1. The paper was clear and mentioned many related work for reference and comparison.
2. It combined and modified other methods and really result in good performance.

Cons:
1. It has mentioned about the Fisher kernel but I can't find the clear relation and details to the paper.