0% found this document useful (0 votes)

185 views9 pages

Learning To Rank: From Pairwise Approach To Listwise Approach

Q: What are the main differences between the pairwise and listwise approaches in learning to rank, and why might the listwise approach be preferred?

The pairwise approach takes document pairs as instances in learning and formalizes the ranking problem as a classification problem. It applies classification models directly, such as Support Vector Machines, and constructs pairs by assigning labels based on the relative relevance of the documents. However, this approach is computationally costly due to the large number of document pairs, and it focuses on minimizing classification errors rather than ranking errors . In contrast, the listwise approach uses entire lists of documents for instances, allowing it to directly optimize for ranking list outputs. This approach employs a probabilistic model to transform ranking scores into probability distributions, enabling the use of metrics between distributions as loss functions . The listwise approach is preferred because it directly addresses the ranking problem and has demonstrated better performance in empirical studies .

Q: How do probability models assist in formulating the listwise loss function within the listwise approach?

Probability models in the listwise approach serve to transform document scores into a probability distribution that reflects the likelihood of each ranking permutation. The listwise loss function can then utilize any metric for comparing distributions to assess the accuracy of the ranking function relative to ground truth data . The top one probability and permutation probability are two such models, enabling the calculation of loss based on probabilities rather than raw scores, thus aligning more closely with the goal of optimal ranking prediction .

Q: What are the theoretical advantages of using the listwise approach compared to the pairwise approach regarding ranking probability distributions?

The listwise approach theoretically advantages over the pairwise approach mainly because it directly models the ranking problem by optimizing for the entire ranking list via probability distributions rather than individual document pairs. This results in minimizing errors in the entire ordered list, which aligns more closely with the goal of applications like document retrieval . Additionally, it uses permutation and top one probabilities, offering a more nuanced representation of ranking uncertainties and priorities, which are not addressed in the pairwise approach's binary class classification framework .

Q: How does the concept of permutation probability contribute to the listwise approach in learning to rank?

Permutation probability is used to evaluate the likelihood of a specific ranking permutation given a set of scores. It provides a way to transform a list of scores into a probability distribution over all possible permutations of ranked objects . This concept allows the listwise approach to model ranking uncertainty and offers a basis for calculating the listwise loss function. By comparing permutation probabilities from predicted and ground truth rankings, the listwise approach can effectively minimize the ranking errors .

Q: How is the ListNet method in learning to rank implemented to optimize the listwise loss function?

The ListNet method optimizes the listwise loss function by employing Neural Network models and Gradient Descent algorithms. First, the ranking function assigns scores using the neural model, transforming them into probability distributions with exponential functions . The top one probability, calculated through this transformation, represents the probability distribution of the ranking, which serves as the basis for calculating loss using Cross Entropy. This setup allows the neural network to be trained effectively via gradient descent, optimizing for accurate listwise predictions iteratively through parameter adjustments .

Q: What are the implications of the listwise approach for other ranking-related applications beyond document retrieval?

The listwise approach has significant implications for various ranking-related applications, such as collaborative filtering, sentiment analysis, and product rating. By directly optimizing for ranking lists through probability-based loss functions, the listwise approach can improve the precision of predictions in these contexts. For example, in collaborative filtering, it can enhance recommendation systems by aligning predictions more closely with user preferences. In sentiment analysis, it could provide more accurate sentiment rankings of text data by optimizing entire lists of reviews or comments. The conceptual flexibility of the approach suggests broad applicability in areas requiring the ranking of complex lists according to nuanced criteria .

Q: Discuss the role of neural networks and gradient descent in the implementation of the listwise approach within the ListNet method.

In the ListNet method, neural networks serve as the model framework for learning complex, non-linear ranking functions that can capture intricate relationships within the data. The neural network computes scores for documents, which are then transformed into probabilities using exponential functions . Gradient descent is employed as the optimization algorithm to iteratively update the model's parameters, minimizing the discrepancy between predicted and actual rankings as measured by the listwise loss function. By leveraging the strengths of neural networks and gradient descent, ListNet ensures scalable and effective training for listwise ranking tasks .

Q: What challenges are associated with calculating permutation probabilities, and how does the top one probability address these challenges?

Calculating permutation probabilities directly can be computationally intensive as it requires considering all possible permutations, which is factorial in the number of documents n! . To address this challenge, the top one probability is introduced, representing the probability of a document being ranked at the top among all documents. This approach aggregates permutation probabilities more efficiently, allowing the calculation of probabilities without evaluating factorial permutations. The top one probability reduces computational complexity while still facilitating accurate ranking loss comparison .

Q: In the context of learning to rank, how is the top one probability calculated and utilized?

The top one probability of an object is calculated by summing the permutation probabilities of all permutations where that object is ranked first. Mathematically, it is defined as Ps(j) = φ(sj) / Σk φ(sk), where φ is an exponential function . This probability is used as part of the listwise loss function in the ListNet method, allowing for efficient optimization of ranking lists by comparing the predicted and actual top one probabilities using metrics such as Cross Entropy .

Q: What experimental evidence supports the effectiveness of the listwise approach over the pairwise approach in learning to rank?

Empirical experiments conducted on three datasets showed that the listwise approach, specifically the ListNet method, consistently outperformed existing pairwise methods like Ranking SVM, RankBoost, and RankNet in document retrieval tasks . These experiments highlighted that the listwise approach leads to superior ranking accuracy by optimizing entire lists directly, which was verified through better performance metrics in the experiments . The use of probability models to calculate a listwise loss function contributed significantly to this enhanced performance .

The document discusses learning to rank which involves constructing models to rank objects. It describes the pairwise approach which uses object pairs as training instances and proposes a listwise approach that uses lists of objects. The paper introduces two probability models to define a listwise loss function for learning rankings and uses neural networks and gradient descent for the learning method.

Uploaded by

Laure Prétet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views9 pages

Learning To Rank: From Pairwise Approach To Listwise Approach

Uploaded by

Laure Prétet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Learning to Rank: From Pairwise Approach to Listwise Approach

Zhe Cao* @...

Tao Qin* @.
Tsinghua University, Beijing, 100084, P. R. China
Tie-Yan Liu @.
Microsoft Research Asia, No.49 Zhichun Road, Haidian District, Beijing 100080, P. R. China
Ming-Feng Tsai* @....
National Taiwan University, Taipei 106, Taiwan
Hang Li @.
Microsoft Research Asia, No.49 Zhichun Road, Haidian District, Beijing 100080, P. R. China

Abstract 1. Introduction
The central issues of many applications are ranking. These
The paper is concerned with learning to rank, include document retrieval, collaborative filtering, expert
which is to construct a model or a function for finding, anti web spam, sentiment analysis, and product rat-
ranking objects. Learning to rank is useful for ing. In this paper, we address learning to rank and without
document retrieval, collaborative filtering, and loss of generality we take document retrieval as example.
many other applications. Several methods for
Learning to rank, when applied to document retrieval, is a
learning to rank have been proposed, which take
task as follows. Assume that there is a collection of docu-
object pairs as ‘instances’ in learning. We refer to
ments. In retrieval (i.e., ranking), given a query, the rank-
them as the pairwise approach in this paper. Al-
ing function assigns a score to each document, and ranks
though the pairwise approach offers advantages,
the documents in descending order of the scores. The rank-
it ignores the fact that ranking is a prediction task
ing order represents relative relevance of documents with
on list of objects. The paper postulates that learn-
respect to the query. In learning, a number of queries are
ing to rank should adopt the listwise approach in
provided; each query is associated with a perfect ranking
which lists of objects are used as ‘instances’ in
list of documents; a ranking function is then created using
learning. The paper proposes a new probabilis-
the training data, such that the model can precisely predict
tic method for the approach. Specifically it in-
the ranking lists in the training data.
troduces two probability models, respectively re-
ferred to as permutation probability and top one Due to its importance, learning to rank has been draw-
probability, to define a listwise loss function for ing broad attention in the machine learning community re-
learning. Neural Network and Gradient Descent cently. Several methods based on what we call the pairwise
are then employed as model and algorithm in the approach have been developed and successfully applied to
learning method. Experimental results on infor- document retrieval. This approach takes document pairs as
mation retrieval show that the proposed listwise instances in learning, and formalizes the problem of learn-
approach performs better than the pairwise ap- ing to rank as that of classification. Specifically, in learning
proach. it collects document pairs from the ranking lists, and for
each document pair it assigns a label representing the rela-
tive relevance of the two documents. It then trains a classi-
fication model with the labeled data and makes use of the
classification model in ranking. The uses of Support Vec-
tor Machines (SVM), Boosting, and Neural Network as the
classification model lead to the methods of Ranking SVM
Microsoft technique report. A short version of this work is pub-
lished in ICML2007. (Herbrich et al., 1999), RankBoost (Freund et al., 1998),
Learning to Rank: From Pairwise Approach to Listwise Approach

and RankNet (Burges et al., 2005). in Section 4 and the learning method ListNet is explained
in Section 5. Section 6 reports our experimental results.
There are advantages with taking the pairwise approach.
Finally, Section 7 makes conclusions.
First, existing methodologies on classification can be di-
rectly applied. Second, the training instances of doc-
ument pairs can be easily obtained in certain scenarios 2. Related Work
(Joachims, 2002). However, there are also problems with
2.1. Learning to Rank
the approach. First, the objective of learning is formalized
as minimizing errors in classification of document pairs, Learning to rank is a new and popular topic in machine
rather than minimizing errors in ranking of documents. learning. There is one major approach to learning to rank,
Second, the training process is computationally costly, as referred to as the pairwise approach in this paper. For
the number of document pairs is very large. Third, the as- other approaches, see (Shashua & Levin, 2002; Crammer
sumption of that the document pairs are generated i.i.d. is & Singer, 2001; Lebanon & Lafferty, 2002), for example.
also too strong. Fourth, the number of generated document
In the pairwise approach, the learning task is formalized as
pairs varies largely from query to query, which will result
classification of object pairs into two categories (correctly
in training a model biased toward queries with more docu-
ranked and incorrectly ranked). Herbrich et al. (1999) pro-
ment pairs (Cao et al., 2006)
posed employing the approach and using the SVM tech-
In this paper, we propose employing what we call the list- niques to build the classification model. The method is re-
wise approach, in which document lists instead of docu- ferred to as Ranking SVM. Freund et al. (1998) proposed
ment pairs are used as instances in learning. The major performing the task in the same way but by means of Boost-
question then is how to define a listwise loss function, rep- ing. Burges et al. (2005) also adopted the approach and de-
resenting the difference between the ranking list output by veloped a method called RankNet. They employed Cross
a ranking model and the ranking list given as ground truth. Entropy as loss function and Gradient Descent as algorithm
to train a Neural Network model.
We propose a probabilistic method to calculate the listwise
loss function. Specifically we transform both the scores of Learning to rank, particularly the pairwise approach, has
the documents assigned by a ranking function and the ex- been successively applied to information retrieval. For in-
plicit or implicit judgments of the documents given by hu- stance, Joachims (2002) applied Ranking SVM to docu-
mans into probability distributions. We can then utilize any ment retrieval. He developed a method of deriving doc-
metric between probability distributions as the loss func- ument pairs for training, from users’ clicks-through data.
tion. We consider the uses of two models for the transfor- Burges et al. (2005) applied RankNet to large scale web
mation; one is referred to as permutation probability and search. Cao et al. (2006) adapted Ranking SVM to doc-
the other top one probability. ument retrieval by modifying the loss function. See also
(Matveeva et al., 2006; Yu, 2005).
We then propose a learning to rank method using the list-
wise loss function, with Neural Network as model and Gra-
dient Descent as algorithm. We refer to it as ListNet. 2.2. Probability Models on Ranking

We applied ListNet to document retrieval and compared the In statistics, probability distributions for representing rank-
results of it with those of existing pairwise methods including lists of objects and methods for estimation of the dis-
ing Ranking SVM, RankBoost, and RankNet. The results tributions have been proposed. For example, following
on three data sets show that our method outperforms the the work by Luce (1959), Plackett (1975) defined proba-
existing methods, suggesting that it is better to employ the bility distributions on ranking lists of objects. He further
listwise approach than the pairwise approach in learning to introduced parameters to characterize the probability dis-
rank. tributions and developed a method for estimating the pa-
rameters. Plackett applied the model and method to pre-
The major contributions of this paper include (1) proposal diction on voting results. In this paper, we make use of
of the listwise approach, (2) formulation of the listwise loss similar probability distributions. However, the underlying
function on the basis of probability models, (3) develop- structure (i.e., parameters) and the fundamental usage (i.e.,
ment of the ListNet method, (4) empirical verification of transformation of scores to probability distributions) of our
the effectiveness of the approach. model differ from those of Plackett’s.
The rest of the paper is organized as follows. Section 2 in-
troduces related work. Section 3 gives a general description 3. Listwise Approach
on the listwise approach to learning to rank. Probability
models for defining a listwise loss function are introduced In this section, we give a general description on learning
to rank, with document retrieval as example. Particularly
Learning to Rank: From Pairwise Approach to Listwise Approach

we describe in details the listwise approach. In follow- wise approach has advantages, it also suffers from draw-
ing descriptions, we use superscript to indicate the index backs. The listwise approach can naturally deal with the
of queries and subscript to indicate the index of documents problems, which will be made clearer in Section 6.
for a specific query.
In training, a set of queries Q = {q(1) , q(2) , · · · , q(m) } is 4. Probability Models
given. Each! query q(i) is associated " with a list of docu-
We propose using probability models to calculate the list-
ments d(i) = d1(i) , d2(i) , · · · , dn(i)(i) , where d(i)
j denotes the j-th wise loss function in Eq. (1). Specifically we map a list of
document and n denotes the sizes of d(i) . Furthermore,
(i)
scores to a probability distribution using one of the prob-
each list of documents! d(i) is associated" with a list of judg- ability models described in this section and then take any
ments (scores) y(i) = y(i) (i) (i) (i)
1 , y2 , · · · , yn(i) where y j denotes metric between two probability distributions as a loss func-
the judgment on document d(i) (i)
j with respect to query q .
tion . The two models are referred to as permutation prob-
The judgment y(i) (i) ability and top one probability.
j represents the relevance degree of d j to
(i)
q , and can be a score explicitly or implicitly given by hu-
4.1. Permutation Probability
mans. For example, y(i)j can be number of clicks on d j
(i)

when d(i) (i)

j is retrieved and returned for query q at a search
Suppose that the set of objects to be ranked are identified
engine (Joachims, 2002). The assumption is that the higher with the numbers 1, 2, ..., n . A permutation π on the objects
click-on rate is observed for d(i) (i)
j and q the stronger rele-
is defined as a bijection from {1, 2, ..., n} to itself. We write
vance exists between them. the permutation as π = #π(1), π(2), ..., π(n)$. Here, π( j) de-
notes the object at position j in the permutation. The set of
A feature vector x(i)
j = Ψ(q(i) , d(i)
j ) is created from all possible permutations of n objects is denoted as Ωn .
each query-document pair (q(i) , d(i) j ), i = 1, !2, · · · , m; j = " Suppose that there is a ranking function which assigns
1, 2, · · · , n . Each list of features x(i) = x1(i) , · · · , xn(i)(i)
(i)
scores to the n objects. We use s to denote the list of scores
! "
and the corresponding list of scores y(i) = y(i) (i)
1 , · · · , yn(i) s = (s1 , s2 , ..., sn ), where s j is the score of the j-th object.
then form
# an ‘instance’.
$m The training set can be denoted as Hereafter we sometimes make interchangeable the ranking
T = (x(i) , y(i) ) . function and the list of scores given by the ranking func-
i=1
tion.
We then create a ranking function f ; for each feature vec-
tor x(i) (i)
j (corresponding to document d j ) it outputs a score
We assume that there is uncertainty in the prediction of
f (x(i) For the list of feature vectors x(i) we obtain a list of ranking lists (permutations) using the ranking function. In
j ). ! " other words, any permutation is assumed to be possible, but
scores z(i) = f (x1(i) ), · · · , f (xn(i)(i) ) . The objective of learn- different permutations may have different likelihood calcu-
ing is formalized as minimization of the total losses with lated based on the ranking function. We define the per-
respect to the training data. mutation probability, so that it has desirable properties for
m
% representing the likelihood of a permutation (ranking list),
L(y(i) , z(i) ) (1) given the ranking function.
i=1
Definition 1 Suppose that π is a permutation on the n ob-
where L is a listwise loss function. jects, and φ(.) is an increasing and strictly positive func-
!
In ranking, when a new query q(i ) and its associated docu- tion. Then, the probability of permutation π given the list
! !
ments d(i ) are given, we construct feature vectors x(i ) from of scores s is defined as
them and use the trained ranking function to assign scores n
&
! ! φ(sπ( j) )
to the documents d(i ) . Finally we rank the documents d(i ) P s (π) = 'n
in descending order of the scores. We call the learning j=1 k= j φ(sπ(k) )
problem described above as the listwise approach to learn-
ing to rank. where sπ( j) is the score of object at position j of permutation
π.
By contrast, in the pairwise approach, a new training data
set T ! is created from T , in which each feature vector pair Let us consider an example with three objects {1, 2, 3} hav-
x(i) (i)
j and xk forms a new instance where j ! k, and +1 is ing scores s = (s1 , s2 , s3 ). The probabilities of permuta-
assigned to the pair if y(i) (i) tions π = #1, 2, 3$ and π! = #3, 2, 1$ are calculated as fol-
j is larger than yk otherwise −1.
!
It turns out that the training data T is a data set of bi- lows:
nary classification. A classification model like SVM can φ(s1 ) φ(s2 ) φ(s3 )
be created. As explained in Section 1, although the pair- P s (π) = · · .
φ(s1 ) + φ(s2 ) + φ(s3 ) φ(s2 ) + φ(s3 ) φ(s3 )
Learning to Rank: From Pairwise Approach to Listwise Approach

φ(s3 ) φ(s2 ) φ(s1 ) That is to say, the top one probability of object j equals
P s (π! ) = · · .
φ(s1 ) + φ(s2 ) + φ(s3 ) φ(s2 ) + φ(s1 ) φ(s1 ) the sum of the permutation probabilities of permutations in
which object j is ranked on the top.
Lemma 2 The permutation probabilities P s (π), π ∈ Ωn
form a probability distribution over the set of permuta- One may argue that in order to calculate n top one probabil-
tions, i.e., for each π ∈ Ωn , we have P s (π) > 0, and ities, we still need to calculate n! permutation probabilities.
' Theorem 6 shows that we can calculate top one probability
P s (π) = 1.
π∈Ωn in a different way, which is efficient.

Theorem 3 Given any two permutations π and π! ∈ Ωn , if Theorem 6 For top one probability P s ( j), we have
(1) π(p) = π! (q), π(q) = π! (p), p < q; (2) π(r) = π! (r), r !
φ(s j )
p, q; (3) sπ(p) > sπ(q) , then P s (π) > P s (π! ). P s ( j) = 'n ,
k=1 φ(sk )
Theorem 4 For the n objects, if s1 > s2 > ... > sn , then where s j is the score of object j, j = 1, 2, ..., n.
P s (#1, 2, ..., n$) is the highest permutation probability and
P s (#n, n − 1, ..., 1$) is the lowest permutation probability Lemma 7 Top one probabilities P s ( j), j = 1, 2, ..., n forms
among the permutation probabilities of the n objects. a probability distribution over the set of n objects.

It is easy to verify that Theorem 4 holds. The proofs for Theorem 8 Given any two objects j and k, if s j > sk , j !
Lemma 2 and Theorem 3 can be found in Appendix. k, j, k = 1, 2, ..., n, then P s ( j) > P s (k).
Theorem 3 indicates that for any ranking list based on the See Appendix for a proof of Theorem 6. It is easy to verify
given ranking function, if we exchange the position of an that Lemma 7 and Theorem 8 hold.
object with higher score and the position of an object with
lower score, we obtain a ranking list with lower permuta- With the use of top one probability, given two lists of
tion probability. Theorem 4 indicates given a ranking func- scores we can use any metric to represent the distance
tion, the list of objects sorted based on the ranking function (listwise loss function) between the two score lists. For
has the highest permutation probability, while the list of ob- example, when we use Cross Entropy as metric, the
jects sorted in the inverse order has the lowest permutation listwise loss function becomes
probability. That is to say, although all the permutations are n
%
assumed to be possible, the permutation sorted by using the L(y(i) , z(i) ) = − Py(i) ( j) log(Pz(i) ( j))
ranking function is most likely to occur. j=1

Given two lists of scores, we can first calculate two permu-

tation probability distributions from them, and then calcu- 5. Learning Method: ListNet
late the distance between the two distributions as the list-
wise loss function. Since the number of permutations is n!, We employ a new learning method for optimizing the list-
however, the calculation might be intractable. 1 To cope wise loss function based on top one probability, with Neu-
with the problem, we consider the use of top one probabil- ral Network as model and Gradient Descent as optimization
ity. algorithm. We refer to the method as ListNet.
Again, let us take document retrieval as example. We de-
4.2. Top One Probability note the ranking function based on the Neural Network
model ω as fω . Given a feature vector x(i) (i)
j , fω (x j ) assigns
The top one probability of an object represents the proba-
a score to it. For simplicity, we define φ in Definition 1
bility of its being ranked on the top, given the scores of all
as an exponential function. We then rewrite the top one
the objects.
probability in Theorem 6 as
Definition 5 The top one probability of object j is defined φ(s j ) exp(s j )
as P s ( j) = 'n = 'n
% k=1 φ(s k ) k=1 exp(sk )
P s ( j) = P s (π).
π(1)= j,π∈Ωn Given query q(i) , the! ranking function fω can generate " a
(i) (i) (i) (i)
score list z ( fω ) = fω (x1 ), fω (x2 ), · · · , fω (xn(i) ) . Then
where P s (π) is permutation probability of π given s.
the top one probability of document d(i)
j is calculated as
1
It might not be intractable to use ”permutation probability” in
practice due to its complexity. Permutation probability by itself, exp( fω (x(i)
j ))
however, is a valuable notion for the studies on learning to rank Pz(i) ( fω ) (x(i)
j ) = 'n(i) (i)
and our approach. k=1 exp( fω (xk ))
Learning to Rank: From Pairwise Approach to Listwise Approach

Algorithm 1 Learning Algorithm of ListNet Ranking SVM (Herbrich et al., 1999), and RankBoost (Fre-
Input:training data {(x(1) , y(1) ), (x(2) , y(2) ), ..., (x(m) , y(m) )} und et al., 1998) using three data sets. Note that ListNet is
Parameter: number of iterations T and learning rate η based on top one probability model.
Initialize parameter ω For simplicity, in our experiments we use linear Neural
for t = 1 to T do Network model and omit the constant b in the model:
for i = 1 to m do
Input x(i) of query q(i) to Neural Network and com- fω (x(i) (i)
j ) = #ω, x j $
pute score list z(i) ( fω ) with current ω
Compute gradient &ω using Eq. (3) where #·, ·$ denotes an inner product.2
Update ω = ω − η × &ω
end for 6.1. Data Collections
end for We used three data sets in the experiments: TREC, a data
Output Neural Network model ω set obtained from web track of TREC 2003 (Craswell et al.,
2003); OHSUMED, a benchmark data set for document re-
trieval (Hersh et al., 1994); and CSearch, a data set from a
With Cross Entropy as metric, the loss for query q(i) be- commercial search engine.
comes
TREC consists of web pages crawled from the .gov do-
n(i)
% main in early 2002. There are in total 1,053,110 pages
L(y(i) , z(i) ( fω )) = − Py(i) (x(i) (i)
j ) log(Pz(i) ( fω ) (x j )) (2) and 11,164,829 hyperlinks in the data set. It also contains
j=1 50 queries from the topic distillation task in Web Track of
TREC 2003. The relevance judgments (relevant or irrele-
With some derivation (please refer to Appendix), we can
vant) on the web pages with respect to the queries are given.
get the gradient of L(y(i) , z(i) ( fω )) with respect to the pa-
There are about 20 features extracted from each query doc-
rameter ω as follow
ument pair, including content features and hyperlink fea-
n (i)
∂ fω (x(i) tures.
j )
∂L(y(i) , z(i) ( fω )) %
(i)
&ω = =− Py(i) (x j ) OHSUMED (Hersh et al., 1994) is a collection of docu-
∂ω j=1
∂ω
(3) ments and queries on medicine, consisting of 348,566 doc-
1
n(i)
% ∂ fω (x(i)
j ) uments and 106 queries. There are in total 16,140 query-
+ 'n(i) exp( fω (x(i)
j )) document pairs upon which relevance judgments are made.
j=1 exp( fω (x(i)
j )) j=1
∂ω
The relevance judgments are either definitely relevant, pos-
sibly relevant, or not relevant. The standard features in
Eq.(3) is then used in Gradient Descent. Algorithm 1 shows document retrieval (Nallapati, 2004) are extracted for each
the learning algorithm of ListNet. query-document pair. There are 30 features in total.
Notice that ListNet is similar to RankNet. The only major CSearch is a data set from a commercial web search en-
difference lies in that the former uses document lists as in- gine. It contains about 25,000 queries, and each query has
stances while the latter uses document pairs as instances; one thousand associated documents. There are about 600
the former utilizes a listwise loss function while the latter features in total for each query-document pair, including
utilizes a pairwise loss function. Interestingly, when there query dependent features and independent features. This
are only two documents for each query, i.e., n(i) = 2, then data set provides five levels of relevance judgments, rang-
the listwise loss function in ListNet becomes equivalent to ing from 4 (”perfect match”) to 0 (”bad match”).
the pairwise loss function in RankNet.
To get a ground truth rank list for each query, we simply use
The time complexity of RankNet is of order O(m · n2max ) *ranks* of instances to create lists (i.e. discrete relevance
(Burges et al., 2005) where m denotes number of training judgments). 3
queries and nmax denotes maximum number of documents
per query. In contrast the time complexity of ListNet is In ranking performance evaluation, we adopted two com-
only of order O(m · nmax ). Therefore, ListNet is more effi- 2
Note that Eq. (3) and Algorithm 1 can be applied with any
cient than RankNet. continuous ranking function.
3
This is only one approach for such discrete relevance judg-
ments. If pairwise data is available (such as clicks-through as
6. Experiments proposed by Joachims (2002)), then we need employ a different
approach, i.e., to create listwise data from pairwise data (for ex-
We compared the ranking accuracies of ListNet with those ample, using the algorithm proposed by Cohen et al. (1998)).
of three baseline methods: RankNet (Burges et al., 2005), This will be our future work.
Learning to Rank: From Pairwise Approach to Listwise Approach

Table 1. Ranking accuracies in terms of MAP

A LN RB RSVM RN

TREC 0.216 0.174 0.193 0.197
OHSUMED 0.305 0.297 0.297 0.303

mon IR evaluation measures: Normalized Discounted Cu-

mulative Gain (NDCG) (Jarvelin & Kekanainen, 2000) and Figure 1. Ranking accuracies in terms of NDCG@n on TREC
Mean Average Precision (MAP)(Baeza-Yates & Ribeiro-
Neto, 1999). NDCG is designed to measure ranking ac-
curacy when there are more than two levels of relevance
judgments. For MAP it is assumed that there are two lev-
els: relevant and irrelevant. In calculation of MAP for
OHSUMED, we treated ‘definitive relevant’ as relevant and
the other two levels as irrelevant. For CSearch, we only
used NDCG.

6.2. Ranking Accuracy

For TREC and OHSUMED we divided each data set into
five subsets, and conducted 5-fold cross-validation. In each
trial, three folds were used for training, one fold for valida- Figure 2. Ranking accuracies in terms of NDCG@n on
tion, and one fold for testing. For RankNet and ListNet the OHSUMED
validation set in each trial was used to determine the num-
ber of iterations. For Ranking SVM it was used to tune the
coefficient C and for RankBoost it was used for selection 6.3. Discussions
of the number of weak learners. The accuracies we report
We investigated why the listwise method ListNet can out-
in this section are those averaged over five trials.
perform the pairwise methods of RankNet, Ranking SVM,
Figure 1 and Table 1 give the results for TREC. We can and RankBoost.
see that ListNet outperforms the three baseline methods of
As explained in Section 1, for the pairwise approach the
RankNet, Ranking SVM, and RankBoost in terms of all
number of document pairs varies largely from query to
measures. Especially for NDCG@1 and NDCG@2, List-
query. As a result, the trained model may be biased toward
Net achieves more than 4 point gain, which is about 10%
those queries with more document pairs. We observed the
relative improvement.
tendencies in all data sets. As example, Table 2 shows the
Figure 2 and Table 1 show the results for OHSUMED. distribution of the number of document pairs per query in
Again, ListNet outperforms RankNet and RankBoost in OHSUMED. We can see that the distribution is skewed:
terms of all measures. Moreover, ListNet works better most queries only have a small number of document pairs
than Ranking SVM in terms of NDCG@1, NDCG@2, (e.g. less than 5, 000), while a few queries have a large
NDCG@4 and MAP, with exceptions of NDCG@3 and number of document pairs (e.g. more than 15,000). In the
NDCG@5. listwise approach the loss function is defined on each query,
the problem does not exist. This appears to be one of the
CSearch is a large data set, and thus we did not conduct
reasons for the higher performance by ListNet.
cross-validation. Instead, we randomly selected one third
of the data for training, one third for validation, and the re- The pairwise approach actually employs a ‘pairwise’ loss
maining one third for testing. Figure 3 shows the results of function, which might be too loose as an approximation of
ListNet, RankNet and RankBoost. Again, ListNet outper- the performance measures of NDCG and MAP. By con-
forms RankNet and RankBoost in terms of all measures. trast, the listwise loss function used in the listwise ap-
Since the size of training data is large, we were not able proach can more properly represent the performance mea-
to run Ranking SVM with the SVMlight tool (Joachims, sures. This appears to be another reason that ListNet out-
1999). performs RankNet, etc. To verify the correctness of the
Learning to Rank: From Pairwise Approach to Listwise Approach

0.5 0.34
NDCG@5
Pairwise loss

0.49

0.32
0.48

Measure

Loss
0.47
0.3

0.28
20 40 60 80 100 120 140 160 180 200
Epoch number

Figure 3. Ranking accuracies in terms of NDCG@n on CSearch Figure 4. Pairwise loss v.s. NDCG@5 in RankNet

0.49 0.03
Table 2. Document-pair number distribution NDCG@5
Listwise loss

P N Q N

<5000 61
<10000 29

Measure

Loss
<15000 8 0.485 0.02

<20000 6
>=20000 2

0.48 0.01
claim, we further examined the optimization processes of 20 30 40 50
Epoch number
60 70 80

the two methods. We looked at the correlation between the

loss functions used by ListNet and RankNet and the mea- Figure 5. Listwise loss v.s. NDCG@5 in ListNet
sure of NDCG during the learning phase. Note that the
major difference between the two methods is the loss func-
tion. The results using the TREC data are shown in Figures
The key issue for the listwise approach is to define a list-
4 and 5. From the figures, we can see that the pairwise loss
wise loss function. In this paper, we have proposed em-
of RankNet does not inversely correlate with NDCG. From
ploying a probabilistic method to solve it. Specifically, we
iteration 20 to iteration 50, NDCG@5 increases while pair-
make use of probability models: permutation probability
wise loss of RankNet decreases. However, after iteration
and top one probability to transform ranking scores into
60, NDCG@5 starts to drop, although pairwise loss is still
probability distributions. We can then view any metric be-
decreasing. In contrast, the listwise loss of ListNet com-
tween probability distributions (e.g., Cross Entropy) as the
pletely inversely correlates with NDCG. More specifically,
listwise loss function.
from iteration 20 to iteration 50, listwise loss decreases,
NDCG@5 increases accordingly. After iteration 50, list- We have then developed a learning method based on the ap-
wise loss reaches its limit, while NDCG@5 also converges. proach, using Neural Network and Gradient Descent. Ex-
Moreover, pairwise loss converges more slowly than list- perimental results with three data sets show that the method
wise loss, which means RankNet needs run more iterations works better than the existing pairwise methods such as
in training than ListNet. Similar trends were observed on RanNet, Ranking SVM, and RankBoost, suggesting that it
the results evaluated in terms of MAP. is better to take the listwise approach to learning to rank.
We conclude that the listwise approach is more effective Future work includes exploring the performance of other
than the pairwise approach for learning to rank. objective function besides cross entropy and the perfor-
mance of other ranking model instead of linear Neural
7. Conclusions Network model. We will also investigate the relationship
between listwise loss function and performance measures
In this paper, we have proposed a new approach to learning such as NDCG and MAP used in information retrieval.
to rank, referred to as the listwise approach. We argue that
it is better to take this approach than the traditional pair- Acknowledgments
wise approach in learning to rank. In the listwise approach,
instead of using object pairs as instances, we use list of ob- Bin Gao has given many valuable suggestions for this work.
jects as instances in learning. We would also like to thanks Kai Yi for his help in our
Learning to Rank: From Pairwise Approach to Listwise Approach

experiments. Matveeva, I., Burges, C., Burkard, T., Laucius, A., &
Wong, L. (2006). High accuracy retrieval with multiple
References nested ranker. Proceeings of SIGIR 2006 (pp. 437–444).

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern in- Nallapati, R. (2004). Discriminative models for informa-
formation retrieval. Addison Wesley. tion retrieval. Proceedings of SIGIR 2004 (pp. 64–71).

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Plackett, R. L. (1975). The analysis of permutations. Ap-
Hamilton, N., & Hullender, G. (2005). Learning to rank plied Statistics, 24(2), 193–202.
using gradient descent. Proceedings of ICML 2005 (pp. Shashua, A., & Levin, A. (2002). Taxonomy of large mar-
89–96). gin principle algorithms for ordinal regression problems.
Proceedings of NIPS 2002.
Cao, Y. B., Xu, J., Liu, T. Y., Li, H., Huang, Y. L., & Hon,
H. W. (2006). Adapting ranking SVM to document re- Yu, H. (2005). SVM selective sampling for ranking with
trieval. Proceedings of SIGIR 2006 (pp. 186–193). application to data retrieval. Proceedings of KDD 2005
(pp. 354–363).
Cohen, W. W., Schapire, R. E., & Singer, Y. (1998). Learn-
ing to order things. Advances in Neural Information Pro-
cessing Systems. The MIT Press. Appendix
A: Proof of Lemma 2
Crammer, K., & Singer, Y. (2001). Pranking with ranking.
Proceedings of NIPS 2001. Proof According to the definition of φ(.), we have
P s (π) > 0 for any π ∈ Ωn . Furtheremore,
Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. % %& n
φ(sπ( j) )
(2003). Overview of the TREC 2003 web track. Pro- P s (π) = 'n
k= j φ(sπ(k) )
ceedings of TREC 2003 (pp. 78–92). π∈Ωn π∈Ωn j=1
%n %n % n

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (1998). = ... ...
π(1)=1 π(2)=1,π(2)!π(1) π(q)=1,π(q)!π(l),∀l<q
An efficient boosting algorithm for combining prefer- % n & n
ences. Proceedings of ICML 1998 (pp. 170–178). φ(sπ( j) )
'n
π(n)=1,π(n)!π(l),∀l<n j=1 k= j φ(sπ(k) )
Herbrich, R., Graepel, T., & Obermayer, K. (1999). Sup- % n %n
φ(sπ(1) ) φ(sπ(2) )
port vector learning for ordinal regression. Proceedings = 'n n ...
k=1 φ(sπ(k) ) π(2)=1,π(2)!π(1)
'
of ICANN 1999 (pp. 97–102). π(1)=1 φ(sπ(k) )
k=2
n
% % n
Hersh, W. R., Buckley, C., Leone, T. J., & Hickam, D. H. φ(sπ(q) ) φ(sπ(n) )
'n ... 'n
(1994). OHSUMED: An interactive retrieval evaluation k=s φ(sπ(q) ) π(n)=1,π(n)!π(l),∀l<n k=n φ(sπ(k) )
π(q)=1,π(q)!π(l),∀l<q
and new large test collection for research. Proceedings Since for any 1 ≤ q ≤ n,
of SIGIR 1994 (pp. 192–201). % n
φ(sπ(q) )
'n =1
Jarvelin, K., & Kekanainen, J. (2000). IR evaluation meth- π(q)=1,π(q)!π(l),∀l<q%k=q
φ(sπ(q) )
ods for retrieving highly relevant documents. Proceed- Then, we have, P s (π) = 1
ings of SIGIR 2000 (pp. 41–48). π∈Ωn
Given the two properties above, we conclude that P s (π),
Joachims, T. (1999). Making large-scale support vector π ∈ Ωn forms a probability distribution over the set Ωn .
machine learning practical. Advances in kernel methods:
support vector learning, 169–184.
B: Proof of Theorem 3
Joachims, T. (2002). Optimizing search engines using
Proof From Definition 1, we have
clickthrough data. Proceedings of KDD 2002 (pp. 133–
142). n
& φ(sπ( j) )
P s (π) = 'n
Lebanon, G., & Lafferty, J. (2002). Cranking: Combining j=1 k= j φ(sπ(k) )

rankings using conditional probability models on permu-

tations. Proceedings of ICML 2002 (pp. 363–370). and
n
& φ(sπ! ( j) )
P s (π! ) = 'n .
Luce, R. D. (1959). Individual choice behavior. Wiley. j=1 k= j φ(sπ! (k) )
Learning to Rank: From Pairwise Approach to Listwise Approach

In order to prove P s (π) > P s (π! ), we need to prove we have

q
& q
&
φ(sπ( j) ) φ(sπ! ( j) ) ∂ log(Pz(i) ( fω ) (x(i)
j )) ∂ fω (x(i)
j )
'n > 'n . =
j=p k= j φ(sπ(k) ) j=p k= j φ(sπ! (k) ) ∂ω ∂ω
n(i) (i) (7)
(q (q %
Notice that 1 (i) ∂ fω (xk )
j=p φ(sπ( j) ) = j=p φ(sπ ( j) ). Thus, we need − 'n(i) exp( f (x ))
!
(i) ω k
to prove k=1 exp( fω (xk )) k=1
∂ω
q
& & q
1 1 Substitute Eq. (7) into Eq. (6) we obtain
'n > 'n . (4)
j=p k= j φ(sπ(k) ) j=p k= j φ(sπ! (k) )
∂L(y(i) , z(i) ( fω ))
&ω =
For any p < j ≤ q, because sπ(p) > sπ(q) and φ(.) is an ∂ω
increasing function, we have φ(sπ(p) ) > φ(sπ(q) ). Conse- 
n(i)
%  ∂ fω (x(i)
j )
(i) 
quently, we have =− Py(i) (x j ) 
j=1
∂ω
1 1 
'n > 'n . (5) n(i)
φ(s ) φ(sπ! (k) ) 1 % ∂ fω (xk(i) ) 
exp( fω (xk(i) ))
k= j π(k) k= j
− 'n(i) 
exp( fω (xk(i) )) ∂ω 
With (6) and (5) we can validate that P s (π) > P s (π! ) holds. k=1 k=1
n(i) (i)
% ∂ fω (x j )
=− Py(i) (x(i)
j )
j=1
∂ω
C: Proof of Theorem 6 (i)
n(i)
n
% / 1 % ∂ fω (xk(i) ) 0
Proof From% Definition 2, we have + Py(i) (x(i)
j ) 'n(i) exp( fω (xk(i) ))
P s ( j) = P s (π) = j=1 k=1 exp( fω (xk(i) )) k=1
∂ω
π∈Ωn ,π(1)= j
n
n(i)
% ∂ fω (x(i)
j )
% & φ(sπ(p) ) =− Py(i) (x(i)
j )
= 'n ∂ω
k=p φ(sπ(k) )
j=1
π∈Ωn ,π(1)= j p=1
n n / n(i)
% n(i)
(i) 0 %
(i) ∂ fω (xk )
% % 1
= ... ... + 'n(i) exp( fω (xk )) Py(i) (x(i)
j )
π(1)= j,π(2)=1,π(2)!π(1) π(q)=1,π(q)!π(l),∀l<q k=1 exp( fω (xk(i) )) k=1
∂ω j=1
% n &n (8)
φ(sπ(p) )
'n
π(n)=1,π(n)!π(l),∀l<n p=1 k=p φ(sπ(k) ) Since
%n n(i)
%
φ(sπ(1) ) φ(sπ(2) )
= 'n 'n ... Py(i) (x(i)
j ) = 1,
k=1 φ(s )
π(k) π(1)= j,π(2)=1,π(2)!π(1) k=2 φ(sπ(k) ) j=1
% n
φ(sπ(q) )
'n ... we have
π(q)=1,π(q)!π(l),∀l<m k=q φ(sπ(q) )
% n ∂L(y(i) , z(i) ( fω ))
φ(sπ(n) ) &ω =
'n ∂ω
k=n φ(sπ(n) )
π(n)=1,π(n)!π(l),∀l<n n(i)
%
(i)
∂ fω (x(i)
j )
φ(s j ) =− Py(i) (x j )
= 'n . ∂ω (9)
k=1 φ(sπ(k) )
j=1
n(i)
1 % ∂ fω (xk(i) )
D: Derivation of gradient + 'n(i) exp( fω (xk(i) ))
k=1 exp( fω (xk(i) )) k=1
∂ω
For Eq. (2)
which is equivalent to Eq. (3).
(i) (i)
∂L(y , z ( fω ))
n(i)
%
(i)
∂ log(Pz(i) ( fω ) (x(i)
j ))
&ω = =− Py(i) (x j )
∂ω j=1
∂ω
(6)
Furthermore, from
 n(i) 
% 
log(Pz(i) ( fω ) (x(i)
j )) = fω (x(i)
j )
(i)
− log  exp( fω (xk )) ,
k=1

Common questions