Modeling The Internet and The Web PDF
Modeling The Internet and The Web PDF
Pierre Baldi
School of Information and Computer Science,
University of California, Irvine, USA
Paolo Frasconi
Department of Systems and Computer Science,
University of Florence, Italy
Padhraic Smyth
School of Information and Computer Science,
University of California, Irvine, USA
Copyright 2003 Pierre Baldi, Paolo Frasconi and Padhraic Smyth
Published by John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,
West Sussex PO19 8SQ, England
Phone (+44) 1243 779777
Email (for orders and customer service enquiries): [email protected]
Visit our Home Page on www.wileyeurope.com or www.wiley.com
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or
transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or
otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of
a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP,
UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed
to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West
Sussex PO19 8SQ, England, or emailed to [email protected], or faxed to (+44) 1243 770620.
This publication is designed to provide accurate and authoritative information in regard to the subject matter
covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services.
If professional advice or other expert assistance is required, the services of a competent professional should
be sought.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic books.
ISBN 0-470-84906-1
Preface xiii
1 Mathematical Background 1
1.1 Probability and Learning from a Bayesian Perspective 1
1.2 Parameter Estimation from Data 4
1.2.1 Basic principles 4
1.2.2 A simple die example 6
1.3 Mixture Models and the Expectation Maximization Algorithm 10
1.4 Graphical Models 13
1.4.1 Bayesian networks 13
1.4.2 Belief propagation 15
1.4.3 Learning directed graphical models from data 16
1.5 Classication 17
1.6 Clustering 20
1.7 Power-Law Distributions 22
1.7.1 Denition 22
1.7.2 Scale-free properties (80/20 rule) 24
1.7.3 Applications to Languages: Zipfs and Heaps Laws 24
1.7.4 Origin of power-law distributions and Fermis model 26
1.8 Exercises 27
3 Web Graphs 51
3.1 Internet and Web Graphs 51
3.1.1 Power-law size 53
3.1.2 Power-law connectivity 53
3.1.3 Small-world networks 56
3.1.4 Power law of PageRank 57
3.1.5 The bow-tie structure 58
3.2 Generative Models for the Web Graph and Other Networks 60
3.2.1 Web page growth 60
3.2.2 Lattice perturbation models: between order and disorder 61
3.2.3 Preferential attachment models, or the rich get richer 63
3.2.4 Copy models 66
3.2.5 PageRank models 67
3.3 Applications 68
3.3.1 Distributed search algorithms 68
3.3.2 Subgraph patterns and communities 70
3.3.3 Robustness and vulnerability 72
3.4 Notes and Additional Technical References 73
3.5 Exercises 74
4 Text Analysis 77
4.1 Indexing 77
4.1.1 Basic concepts 77
4.1.2 Compression techniques 79
4.2 Lexical Processing 80
4.2.1 Tokenization 80
4.2.2 Text conation and vocabulary reduction 82
4.3 Content-Based Ranking 82
4.3.1 The vector-space model 82
4.3.2 Document similarity 83
4.3.3 Retrieval and evaluation measures 85
4.4 Probabilistic Retrieval 86
4.5 Latent Semantic Analysis 88
4.5.1 LSI and text documents 89
4.5.2 Probabilistic LSA 89
4.6 Text Categorization 93
CONTENTS ix
References 257
Index 277
This Page Intentionally Left Blank
Preface
Since its early ARPANET inception during the Cold War, the Internet has grown
by a staggering nine orders of magnitude. Today, the Internet and the World Wide
Web pervade our lives, having fundamentally altered the way we seek, exchange,
distribute, and process information. The Internet has become a powerful social force,
transforming communication, entertainment, commerce, politics, medicine, science,
and more. It mediates an ever growing fraction of human knowledge, forming both
the largest library and the largest marketplace on planet Earth.
Unlike the invention of earlier media such as the press, photography, or even the
radio, which created specialized passive media, the Internet and the Web impact all
information, converting it to a uniform digital format of bits and packets. In addition,
the Internet and the Web form a dynamic medium, allowing software applications
to control, search, modify, and lter information without human intervention. For
example, email messages can carry programs that affect the behavior of the receiving
computer. This active medium also promotes human intervention in sharing, updating,
linking, embellishing, critiquing, corrupting, etc., information to a degree that far
exceeds what could be achieved with printed documents.
In common usage, the words Internet and Web (or World Wide Web or WWW)
are often used interchangeably. Although they are intimately related, there are of
course some nuances which we have tried to respect. Internet, in particular, is the
more general term and implicitly includes physical aspects of the underlying networks
as well as mechanisms such as email and peer-to-peer activities that are not directly
associated with the Web. The term Web, on the other hand, is associated with the
information stored and available on the Internet. It is also a term that points to other
complex networks of information, such as webs of scientic citations, social relations,
or even protein interactions. In this sense, it is fair to say that a predominant fraction
of our book is about the Web and the information aspects of the Internet. We use Web
every time we refer to the World Wide Web and web when we refer to a broader
class of networks or other kinds of networks, i.e. web of citations.
As the Internet and the Web continue to expand at an exponential rate, it also
evolves in terms of the devices and processors connected to it, e.g. wireless devices
and appliances. Ever more human domains and activities are ensnared by the Web,
thus creating challenging problems of ownership, security, and privacy. For instance,
xiv PREFACE
we are quite far from having solved the security, privacy, and authentication problems
that would allow us to hold national Internet elections.
As scientists, the Web has also become a tool we use on a daily basis for tasks
ranging from the mundane to the intractable, to search and disseminate information,
to exchange views and collaborate, to post job listings, to retrieve and quote (by
Uniform Resource Locator (URL)) bibliographic information, to build Web servers,
and even to compute. There is hardly a branch of computer science that is not affected
by the Internet: not only the most obvious areas such as networking and protocols,
but also security and cryptography; scientic computing; human interfaces, graphics,
and visualization; information retrieval, data mining, machine learning, language/text
modeling and articial intelligence, to name just a few.
What is perhaps less obvious and central to this book is that not only have the
Web and the Internet become essential tools of scientic enterprise, but they have
also themselves become the objects of active scientic investigation. And not only for
computer scientists and engineers, but also for mathematicians, economists, social
scientists, and even biologists.
There are many reasons why the Internet and the Web are exciting, albeit young,
topics for scientic investigation. These reasons go beyond the need to improve the
underlying technology and to harness the Web for commercial applications. Because
the Internet and the Web can be viewed as dynamic constellations of interconnected
processors and Web pages, respectively, they can be monitored in many ways and
at many different levels of granularity, ranging from packet trafc, to user behavior,
to the graphical structure of Web pages and their hyperlinks. These measurements
provide new types of large-scale data sets that can be scientically analyzed and
mined at different levels. Thus researchers enjoy unprecedented opportunities to,
for instance:
gather, communicate, and exchange ideas, documents, and information;
monitor a large dynamic network with billions of nodes and one order of mag-
nitude more connections;
gather large training sets of textual or activity data, for the purposes of modeling
and predicting the behavior of millions of users;
analyze and understand interests and relationships within society.
The Web, for instance, can be viewed as an example of a very large distributed and
dynamic system with billions of pages resulting from the uncoordinated actions of
millions of individuals. After all, anyone can post a Web page on the Internet and link
it to any other page. In spite of this complete lack of central control, the graphical
structure of the Web is far from random and possesses emergent properties shared
with other complex graphs found in social, technological, and biological systems.
Examples of properties include the power-law distribution of vertex connectivities
and the small-world property any two Web pages are usually only a few clicks
away from each other. Similarly, predictable patterns of congestion (e.g. trafc jams)
PREFACE xv
have also been observed in Internet trafc. While the exploitation of these regularities
may be benecial to providers and consumers, their mere existence and discovery has
become a topic of basic research.
is not on the history of a rapidly evolving eld, but rather on what we believe are
the primary relevant methods and algorithms, and a general way of thinking about
modeling of the Web that we hope will prove useful.
Chapter 1 covers in succinct form most of the mathematical background needed
for the following chapters and can be skipped by those with a good familiarity with its
material. It contains an introduction to basic concepts in probabilistic modeling and
machine learning from the Bayesian framework and the theory of graphical models
to mixtures, classication, and clustering these are all used throughout various
chapters and form a substantial part of the glue of this book.
Chapter 2 provides an introduction to the Internet and the Web and the foundations
of the WWW technologies that are necessary to understand the rest of the book,
including the structure of Web documents, the basics of Internet protocols, Web server
log les, and so forth. Server log les, for instance, are important to thoroughly
understand the analysis of human behavior on the Web in Chapter 7. The chapter also
deals with the basic principles of Web crawlers. Web crawling is essential to gather
information about the Web and in this sense is a prerequisite for the study of the Web
graph in Chapter 3.
Chapter 3 studies the Internet and the Web as large graphs. It describes, models,
and analyzes the power-law distribution of Web sizes, connectivity, PageRank, and
the small-world properties of the underlying graphs. Applications of graphical prop-
erties, for instance to improve search engines, are also covered in this chapter and
further studied in later chapters.
Chapter 4 deals with text analysis in terms of indexing, content-based ranking,
latent semantic analysis, and text categorization, providing the basic components
(together with link analysis) for understanding how to efciently retrieve information
over the Web.
Chapter 5 builds upon the graphical results of Chapter 4 and deals with link analysis,
inferring page relevance from patterns of connectivity, Web communities, and the
stability and evolution of these concepts with time.
Chapter 6 covers advanced crawling techniques selective, focused, and distributed
crawling and Web dynamics. It is essential material in order to understand how to
build a new search engine for instance.
Chapter 7 studies human behavior on the Web. In particular it builds and stud-
ies several probabilistic models of human browsing behavior and also analyzes the
statistical properties of search engine queries.
Finally, Chapter 8 covers various aspects of commerce on the Web, including
analysis of customer Web data, automated recommendation systems, and Web path
analysis for purchase prediction.
Appendix A contains a number of technical sections that are important for reference
and for a thorough understanding of the material, including an informal introduction
to basic concepts in graph theory, a list of standard probability densities, a short
section on Singular Value Decomposition, a short section on Markov chains, and a
brief, critical, overview of information theory.
xviii PREFACE
Notation
In terms of notation, most of the symbols used are listed at the end of the book, in
Appendix B. A symbol such as D represents the data, regardless of the amount or
complexity. Boldface letters are usually reserved for matrices and vectors. Capital
letters are typically used for matrices and random variables, lowercase letters for
scalars and random variable realizations. Greek letters such as typically denote
the parameters of a model. Throughout the book P and E are used for probability
and expectation. If X is a random variable, we often write P (x) for P (X = x),
or sometimes just P (X) if no confusion is possible. E[X], var[X], and cov[X, Y ],
respectively, denote the expectation, variance, and covariance associated with the
random variables X and Y with respect to the probability distributions P (X) and
P (X, Y ).
We use the standard notation f (n) = o(g(n)) to denote a function f (n) that satises
f (n)/g(n) 0 as n , and f (n) = O(g(n)) when there exists a constant C > 0
such that f (n) Cg(n) when n . Similarly, we use f (n) = (g(n)) to denote
a function f (n) such that asymptotically there are two constants C1 and C2 with
C1 g(n) f (n) C2 g(n). Calligraphic style is reserved for particular functions,
such as error or energy (E ), entropy and relative entropy (H ). Finally, we often deal
with quantities characterized by many indices. Within a given context, only the most
relevant indices are indicated.
Acknowledgements
Over the years, this book has been supported directly or indirectly by grants and
awards from the US National Science Foundation, the National Institutes of Health,
PREFACE xix
NASA and the Jet Propulsion Laboratory, the Department of Energy and Lawrence
Livermore National Laboratory, IBM Research, Microsoft Research, Sun Microsys-
tems, HNC Software, the University of California MICRO Program, and a Laurel
Wilkening Faculty Innovation Award. Part of the book was written while P.F. was
visiting the School of Information and Computer Science (ICS) at UCI, with partial
funding provided by the University of California. We also would like to acknowledge
the general support we have received from the Institute for Genomics and Bioinfor-
matics (IGB) at UCI and the California Institute for Telecommunications and Infor-
mation Technology (Cal(IT)2 ). Within IGB, special thanks go the staff, in particular
Suzanne Knight, Janet Ko, Michele McCrea, and Ann Marie Walker. We would like
to acknowledge feedback and general support from many members of Cal(IT)2 and
thank in particular its directors Bill Parker, Larry Smarr, Peter Rentzepis, and Ramesh
Rao, and staff members Catherine Hammond, Ashley Larsen, Doug Ramsey, Stuart
Ross, and Stephanie Sides. We thank a number of colleagues for discussions and
feedback on various aspects of the Web and probabilistic modeling: David Eppstein,
Chen Li, Sharad Mehrotra, and Mike Pazzani at UC Irvine, as well as Albert-Lszl
Barabsi, Nicol Cesa-Bianchi, Monica Bianchini, C. Lee Giles, Marco Gori, David
Heckerman, David Madigan, Marco Maggini, Heikki Mannila, Chris Meek, Amnon
Meyers, Ion Muslea, Franco Scarselli, Giovanni Soda, and Steven Scott. We thank all
of the people who have helped with simulations or provided feedback on the various
versions of this manuscript, especially our students Gianluca Pollastri, Alessandro
Vullo, Igor Cadez, Jianlin Chen, Xianping Ge, Joshua OMadadhain, Scott White,
Alessio Ceroni, Fabrizio Costa, Michelangelo Diligenti, Sauro Menchetti, and Andrea
Passerini. We also acknowledge Jean-Pierre Nadal, who brought to our attention the
Fermi model of power laws. We thank Xinglian Yie and David OHallaron for pro-
viding their data on search engine queries in Chapter 8. We also thank the staff from
John Wiley & Sons, Ltd, in particular Senior Editor Sian Jones and Robert Calver,
and Emma Dain at T&T Productions Ltd. Finally, we acknowledge our families and
friends for their support in the writing of this book.
Mathematical Background
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
2 PROBABILITY AND LEARNING FROM A BAYESIAN PERSPECTIVE
A probability, P (e), can be viewed as a number that reects our uncertainty about
whether e is true or false in the real world, given whatever information we have avail-
able. This is known as the degree of belief or Bayesian interpretation of probability
(see, for instance, Berger 1985; Box and Tiao 1992; Cox 1964; Gelman et al. 1995;
Jaynes 2003) and is the one that we will use by default throughout this text. In fact,
to be more precise, we should use a conditional probability P (e | I) in general to
represent degree of belief, where I is the background information on which our belief
is based. For simplicity of notation we will often omit this conditioning on I, but it
may be useful to keep in mind that everywhere you see a P (e) for some proposition
e, there is usually some implicit background information I that is known or assumed
to be true.
The Bayesian interpretation of a probability P (e) is a generalization of the more
classic interpretation of a probability as the relative frequency of successes to total
trials, estimated over an innite number of hypothetical repeated trials (the so-called
frequentist interpretation). The Bayesian interpretation is more useful in general,
since it allows us to make statements about propositions such as the number of Web
pages in existence where a repeated trials interpretation would not necessarily apply.
It can be shown that, under a small set of reasonable axioms, degrees of belief can
be represented by real numbers and that when rescaled to the [0, 1] interval these
degrees of condence must obey the rules of probability and, in particular, Bayes
theorem (Cox 1964; Jaynes 1986, 2003; Savage 1972). This is reassuring, since it
means that the standard rules of probability still apply whether we are using the degree
of belief interpretation or the frequentist interpretation. In other words, the rules for
manipulating probabilities such as conditioning or the law of total probability remain
the same no matter what semantics we attach to the probabilities.
The Bayesian approach also allows us to think about probability as being a dynamic
entity that is updated as more data arrive as we receive more data we may naturally
change our degree of belief in certain propositions given these new data. Thus, for
example, we will frequently refer to terms such as P (e | D) where D is some data.
In fact, by Bayes theorem,
P (D | e)P (e)
P (e | D) = . (1.1)
P (D)
The interpretation of each of the terms in this equation is worth discussing. P (e) is your
belief in the event e before you see any data at all, referred to as your prior probability
for e or prior degree of belief in e. For example, letting e again be the statement that
the number of Web pages in existence on 1 January 2003 was greater than ve billion,
P (e) reects your degree of belief that this statement is true. Suppose you now receive
some data D which is the number of pages indexed by various search engines as of
1 January 2003. To a reasonable approximation we can view these numbers as lower
bounds on the true number and lets say for the sake of argument that all the numbers
are considerably less than ve billion. P (e | D) now reects your updated posterior
belief in e given the observed data and it can be calculated by using Bayestheorem via
MATHEMATICAL BACKGROUND 3
Equation (1.1). The right-hand side of Equation (1.1) includes the prior, so naturally
enough the posterior is proportional to the prior.
The right-hand side also includes P (D | e), which is known as the likelihood of the
data, i.e. the probability of the data under the assumption that e is true. To calculate the
likelihood we must have a probabilistic model that connects the proposition e we are
interested in with the observed data D this is the essence of probabilistic learning.
For our Web page example, this could be a model that puts a probability distribution
on the number of Web pages that each search engine may nd if the conditioning event
is true, i.e. if there are in fact more than ve billion Web pages in existence. This could
be a complex model of how the search engines actually work, taking into account all
the various reasons that many pages will not be found, or it might be a very simple
approximate model that says that each search engine has some conditional distribution
on the number of pages that will be indexed, as a function of the total number that
exist. Appendix A provides examples of several standard probability models these
are in essence the building blocks for probabilistic modeling and can be used as
components either in likelihood models P (D | e) or as priors P (e).
Continuing with Equation (1.1), the likelihood expression reects how likely the
observed data are, given e and given some model connecting e and the data. If P (D | e)
is very low, this means that the model is assigning a low probability to the observed
data. This might happen, for example, if the search engines hypothetically all reported
numbers of indexed pages in the range of a few million rather than in the billion range.
Of course we have to factor in the alternative hypothesis, e, here and we must ensure
that both P (e) + P (e) = 1 and P (e | D) + P (e | D) = 1 to satisfy the basic axioms
of probability. The normalization constant in the denominator of Equation (1.1) can
be calculated by noting that P (D) = P (D | e)P (e) + P (D | e)P (e). It is easy to see
that P (e | D) depends both on the prior and the likelihood in terms of competing
with the alternative hypothesis e the larger they are relative to the prior for e and
the likelihood for e, then the larger our posterior belief in e will be.
Because probabilities can be very small quantities and addition is often easier to
work with than multiplication, it is common to take logarithms of both sides, so that
To apply Equation (1.1) or (1.2) to any class of models, we only need to specify the
prior P (e) and the data likelihood P (D | e).
Having updated our degree of belief in e, from P (e) to P (e | D), we can continue
this process and incorporate more data as they become available. For example, we
might later obtain more data on the size of the Web from a different study call this
second data set D2 . We can use Bayes rule to write
P (D2 | e, D)P (e | D)
P (e | D, D2 ) = . (1.3)
P (D2 | D)
Comparing Equations (1.3) and (1.2) we see that the old posterior P (e | D) plays the
role of the new prior when data set D2 arrives.
4 PARAMETER ESTIMATION FROM DATA
The use of priors is a strength of the Bayesian approach, since it allows the incor-
poration of prior knowledge and constraints into the modeling process. In general,
the effects of priors diminish as the number of data points increases. Formally, this is
because the log-likelihood log P (D | e) typically increases linearly with the number
of data points in D, while the prior log P (e) remains constant. Finally, and most impor-
tantly, the effects of different priors, as well as different models and model classes,
can be assessed within the Bayesian framework by comparing the corresponding
probabilities.
The computation of the likelihood is of course model dependent and is not addressed
here in its full generality. Later in the chapter we will briey look at a variety of
graphical model and mixture model techniques that act as components in a exible
toolbox for the construction of different types of likelihood functions.
constraints such as smoothness. Note that the term P (D) in (1.4) plays the role of
a normalizing constant that does not depend on the parameters , and is therefore
irrelevant for this optimization. If the prior P () is uniform over parameter space,
then the problem reduces to nding the maximum of P (D | ), or log P (D | ). This
is known as maximum-likelihood (ML) estimation.
A substantial portion of this book and of machine-learning practice in general is
based on MAP estimation, that is, the minimization of
E () = log P (D | ) log P (), (1.5)
or the simpler ML estimation procedure, that is, the minimization of
E () = log P (D | ). (1.6)
In many useful and interesting models the function being optimized is complex and
its modes cannot be found analytically. Thus, one must resort to iterative and pos-
sibly stochastic methods such as gradient descent, expectation maximization (EM),
or simulated annealing. In addition, one may also have to settle for approximate or
sub-optimal solutions, since nding global optima of these functions may be computa-
tionally infeasible. Finally it is worth mentioning that mean posterior (MP) estimation
is also used, and can be more robust than the mode (MAP) in certain respects. The
MP is found by estimating by its expectation E[ ] with respect to the posterior
P ( | D), rather than the mode of this distribution.
Whereas nding the optimal model, i.e. the optimal set of parameters, is common
practice, it is essential to note that this is really useful only if the distribution P ( | D)
is sharply peaked around a unique optimum. In situations characterized by a high
degree of uncertainty and relatively small amounts of available data, this is often not
the case. Thus, a full Bayesian approach focuses on the function P ( | D) over the
entire parameter space rather than at a single point. Discussion of the full Bayesian
approach is somewhat beyond the scope of this book (see Gelman et al. (1995) for
a comprehensive introductory treatment). In most of the cases we will consider, the
simpler ML or MAP estimation approaches are sufcient to yield useful results this
is particularly true in the case of Web data sets which are often large enough that the
posterior distribution can be reasonably expected to be concentrated relatively tightly
about the posterior mode.
The reader ought also to be aware that whatever criterion is used to measure the
discrepancy between a model and data (often described in terms of an error or energy
function), such a criterion can always be dened in terms of an underlying probabilistic
model that is amenable to Bayesian analysis. Indeed, if the t of a model M = M()
with parameters is measured by some error function f (, D) 0 to be minimized,
one can always dene the associated likelihood as
ef ( ,D)
P (D | M()) = , (1.7)
Z
where Z = ef (,D) d is a normalizing factor (the partition function in statisti-
cal mechanics) that ensures the probabilities integrate to unity. As a result, minimizing
6 PARAMETER ESTIMATION FROM DATA
the error function is equivalent to ML estimation or, more generally, MAP estimation,
since Z is a constant and maximizing log P (D | M( )) is the same as minimizing
f ( , D) as a function of . For example, when the sum of squared differences is used
as an error function (a rather common practice), this implies an underlying Gaus-
sian model on the errors. Thus, the Bayesian point of view claries the probabilistic
assumptions that underlie any criteria for matching models with data.
Another way of looking at these results is to say that, except for a constant entropy
term, the negative log-likelihood is essentially the relative entropy between the xed
die probabilities pi and the observed frequencies ni /n. In Chapter 7 we will see how
this idea can be used to quantify how well various sequential models can predict
which Web page a Web surfer will request next.
The observed frequency estimate iML = ni /n is of course intuitive when n is
large. The strong law of large numbers tells us that for large enough values of n, the
observed frequency will almost surely be very close to the true value of i . But what
happens if n is small relative to m and some of the symbols (or words) are never
observed in D? Do we want to set the corresponding probability to zero? In general
this is not a good idea, since it would lead us to assign probability zero to any new
data set containing one or more symbols that were not observed in D.
This is a problem that is most elegantly solved by introducing a Dirichlet prior
on the space of parameters (Berger 1985; MacKay and Peto 1995a). This approach
is used again in Chapter 7 for modeling Web surng patterns with Markov chains
and the basic denitions are described in Appendix A. A Dirichlet distribution on a
probability vector = (1 , . . . , m ) with parameters and q = (q1 , . . . , qm ) has
the form
q 1
() qi 1 i i
m m
Dq ( ) = i = , (1.13)
i (qi ) i=1 i=1
Z(i)
with , i , qi 0 and i = qi = 1. Alternatively, it can be parameterized by
a single vector , with i = qi . When m = 2, it is also called a Beta distribution
(Figure 1.1). For a Dirichlet distribution, E(i ) = qi , var(i ) = qi (1 qi )/( + 1),
and cov(i j ) = qi qj /( + 1). Thus, q is the mean of the distribution, and
determines how peaked the distribution is around its mean. Dirichlet priors are the
natural conjugate priors for multinomial distributions. In general, we say that a prior
is conjugate with respect to a likelihood when the functional form of the posterior is
identical to the functional form of the prior. Indeed, the likelihood in Equation (1.8) and
the Dirichlet prior have the same functional form. Therefore the posterior associated
with the product of the likelihood with the prior has also the same functional form
and must be a Dirichlet distribution itself.
8 PARAMETER ESTIMATION FROM DATA
10 10
8 8
6 6
1 = 0.5, 2 = 0.8 1 = 1, 2 = 1
4 4
2 2
10 10
8 8
6 6
1 = 8, 2 = 8 1 = 20, 2 = 3
4 4
2 2
Figure 1.1 Beta distribution or Dirichlet distribution with m = 2. Different shapes can be
obtained by varying the parameters 1 and 2 . For instance, 1 = 2 = 1 corresponds to
the uniform distribution. Likewise, 1 = 2 = 1 corresponds to a bell-shaped distribution
centered on 0.5, with height and width controlled by 1 + 2 .
Z is the normalization constant of the Dirichlet distribution and does not depend
on the parameters i . Thus the MAP optimization problem is very similar to the
one previously solved, except that the counts ni are replaced by ni + qi 1. We
immediately get the estimates
ni + qi 1
iMAP = for all ai a, (1.15)
n+m
provided this estimate is positive. In particular, the effect of the Dirichlet prior is
equivalent to adding pseudocounts to the observed counts. When q is uniform, we
say that the Dirichlet prior is symmetric. Notice that the uniform distribution over
is a special case of a symmetric Dirichlet prior, with qi = 1/ = 1/m. It is also clear
from (1.14) that the posterior distribution P (M | D) is a Dirichlet distribution Dr
with = n + and ri = (ni + qi )/(n + ).
The expectation of the posterior is the vector ri , which is slightly different from the
MAP estimate (1.8). This suggests using an alternative estimate for i , the predictive
MATHEMATICAL BACKGROUND 9
2
1.5 PRIOR:
= 2, q1 = 0.5
1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
LIKELIHOOD:
0.04
n = 10, n1 = 1
0.03
ML = 0.1
0.02
0.01
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
POSTERIOR:
4 = 12, r1 = 0.167
MAP = 0.1
2
MP = 0.167
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.04 LIKELIHOOD:
n = 10, n1 = 1
0.03
ML = 0.1
0.02
0.01
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
POSTERIOR:
4
= 20, r1 = 0.3
3
MAP = 0.278
2
MP = 0.3
1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
better model and make more accurate predictions on future data. A standard class of
models for sequence data is the nite-state Markov chain, described in Appendix A.
The application of Markov chains to the problem of inferring the importance of Web
pages from the Web graph structure is described in the context of the PageRank
algorithm in Chapter 5. A second application of Markov chains to the problem of
predicting the page that a Web surfer will request next is treated in detail in Chapter 7.
K
P = l Pl , (1.17)
l=1
MATHEMATICAL BACKGROUND 11
where the i 0 are called the mixture coefcients and satisfy i i = 1. The dis-
tributions Pi are called the components of the mixture and have their own parameters
(means, standard deviations, etc.). Mixture distributions provide a exible way for
modeling complex distributions, combining together simple building-blocks, such as
Gaussian distributions. A review of mixture models can be found in Everitt (1984),
Titterington et al. (1985), McLachlan and Peel (2000), and Hand et al. (2001). Mix-
ture models are used, for instance, in clustering problems, where each component in
the mixture corresponds to a different cluster from which the data can be generated
and the mixture coefcients represent the frequency of each cluster.
To be more precise, imagine a data set D = (d1 , . . . , dn ) and an underlying mixture
model with K components of the form
K
K
P (di | M) = P (Ml )P (di | Ml ) = l P (di | Ml ), (1.18)
l=1 l=1
where l 0, l l = 1, and Ml is the model for mixture l. Assuming that the data
points are conditionally independent given the model, we have
n
P (D | M) = P (di | M).
i=i
The Lagrangian associated with the log-likelihood and with the normalization con-
straints on the mixing coefcients is given by
n
K
K
L= log l P (di | Ml ) + 1 l (1.19)
i=1 l=1 l=1
L P (di | Ml )
n
= = 0. (1.20)
l P (di )
i=1
Multiplying each critical equation by l and summing over l immediately yields the
value of the Lagrange multiplier = n. Multiplying the critical equation again by
P (Ml ) = l , and using Bayes theorem in the form
P (Ml )
P (Ml | di ) = P (di | Ml ) (1.21)
P (di )
yields
1
n
ML
l = P (Ml | di ). (1.22)
n
i=1
Thus, the ML estimate of the mixing coefcients for class l is the sample mean of
the conditional probabilities that di comes from model l. Observe that we could have
12 MIXTURE MODELS AND THE EM ALGORITHM
estimated these mixing coefcients using the MAP framework and, for example, a
Dirichlet prior on the mixing coefcients.
Consider now that each model Ml has its own vector of parameters l . Differenti-
ating the Lagrangian with respect to each parameter of each component lj gives
L l P (di | Ml )
n
= . (1.23)
lj P (di ) lj
i=1
Substituting Equation (1.21) into Equation (1.23) nally provides the critical equation
n
log P (di | Ml )
P (Ml | di ) =0 (1.24)
lj
i=1
for each l and j . The ML equations for estimating the parameters are weighted aver-
ages of the ML equations
log P (di | Ml )
=0 (1.25)
lj
arising from each point separately. As in Equation (1.22), the weights are the proba-
bilities of di being generated from model l.
The ML Equations (1.22) and (1.24) can be used iteratively to search for ML
estimates. This is a special case of a more general algorithm known as the Expectation
Maximization (EM) algorithm (Dempster et al. 1977). In its most general form the EM
algorithm is used for inference in the presence of missing data given a probabilistic
model for both the observed and missing data. For mixture models, the missing data
are considered to be a set of n labels for the n data points, where each label indicates
which of the K components generated the corresponding data point.
The EM algorithm proceeds in an iterative manner from a starting point. The starting
point could be, for example, a randomly chosen setting of the model parameters. Each
subsequent iteration consists of an E step and M step. In the E step, the membership
probabilities p(Ml | di ) of each data point are estimated for each mixture component.
The M step is equivalent to K separate estimation problems with each data point
contributing to the log-likelihood associated with each of the K components with
a weight given by the estimated membership probabilities. Variations of the M step
are possible depending, for instance, on whether the parameters lj are estimated by
gradient descent or by solving Equation (1.24) exactly.
A different avor of this basic EM algorithm can be derived depending on whether
the membership probabilities P (Ml | di ) are estimated in hard (binary) or soft (actual
posterior probabilities) fashion during the E step. In a clustering context, the hard
version of this algorithm is also known as K-means, which we discuss later in the
section on clustering.
Another generalization occurs when we can specify priors on the parameters of the
mixture model. In this case we can generalize EM to the MAP setting by using MAP
equations for parameter estimates in the M step of the EM algorithm, rather than ML
equations.
MATHEMATICAL BACKGROUND 13
2 5
Figure 1.4 A simple Bayesian network with ve nodes and ve random variables and
the global factorization property, P (X1 , X2 , X3 , X4 , X5 ) = P (X1 )P (X2 )P (X3 | X1 , X2 )
P (X4 | X3 )P (X5 | X3 , X4 ), associated with the Markov independence assumptions. For
instance, conditioned on X3 , X5 is independent of X1 and X2 .
Xt2 X X X
t1 t t+1
Figure 1.5 The underlying Bayesian network structure for both hidden Markov models and
Kalman lter models. The two independence assumptions underlying both models are that
(1) the current state Xt only depends on the past state Xt1 and (2) the current state Yt
only depends on the current state Xt . In the hidden Markov model the state variables Xt are
discrete-valued, while in the Kalman lter model the state variables are continuous.
The local conditional probabilities can be specied in terms of lookup tables (for
categorical variables). This is often impractical, due to the size of the tables, requiring
in general O(k p+1 ) values if all the variables take k values and have p parents. A
number of more compact but also less general representations are often used, such
as noisy OR models (Pearl 1988) or neural-network-style representations such as
sigmoidal belief networks (Neal 1992). In these neural-network representations, the
local conditional probabilities are dened by local connection weights and sigmoidal
functions for the binary case, or normalized exponentials for the general multivariate
case. Another useful representation is a decision tree approximation (Chickering et
al. 1997), which will be discussed in more detail in Chapter 8 in the context of
probabilistic models for recommender systems.
It is easy to see why the graph must be acyclic. This is because in general it is not
possible to consistently dene the joint probability of the variables in a cycle from
the product of the local conditioning probabilities. That is, in general, the product
P (X2 | X1 )P (X3 | X2 )P (X1 | X3 ) does not consistently dene a distribution on
X1 , X2 , X3 .
The direction of the edges can represent causality or time course if this interpretation
is natural, or the direction can be chosen more for convenience if an obvious causal
ordering of the variables is not present.
The factorization in Equation (1.26) is a generalization of the factorization in sim-
ple Markov chain models, and it is equivalent to any one of a set of independence
properties which generalize the independence properties of rst-order Markov chains,
where the future depends on the past only through the present (see Appendix A). For
instance, conditional on its parents, a variable Xi is independent of all other nodes,
except for its descendants. Another equivalent statement is that, conditional on a set
of nodes I , Xi is independent of Xj if and only if i and j are d-separated, that is,
if there is no d-connecting path from i to j with respect to I (Charniak 1991; Pearl
1988).
A variety of well-known probabilistic models can be represented as Bayesian net-
works, including nite-state Markov models, mixture models, hidden Markov models,
Kalman lter models, and so forth (Baldi and Brunak 2001; Bengio and Frasconi 1995;
Ghahramani and Jordan 1997; Jordan et al. 1997; Smyth et al. 1997). Representing
MATHEMATICAL BACKGROUND 15
these models as Bayesian networks provides a unifying language and framework for
what would otherwise appear to be rather different models. For example, both hid-
den Markov and Kalman lter models share the same underlying Bayesian network
structure as depicted in Figure 1.5.
Table 1.1 Four basic Bayesian network learning problems depending on whether the structure
of the network is known in advance or not, and whether there are hidden (unobserved) variables
or not.
No hidden Hidden
There are several levels of learning in graphical models in general and Bayesian
networks in particular. These range from learning the entire graph structure (the
edges in the model) to learning the local conditional distributions when the structure is
known. To a rst approximation, four different situations can be considered, depending
on whether the structure of the network is known or not, and whether the network
contains unobserved data, such as hidden nodes (variables), which are completely
unobserved in the model (Table 1.1).
When the structure is known and there are no hidden variables, the problem is a rel-
atively straightforward statistical question of estimating probabilities from observed
frequencies. The die example in this chapter is an example of this situation, and we
discussed how ML, MAP, and MP ideas can be applied to this problem. At the other
end of the spectrum, learning both the structure and parameters of a network that con-
tains unobserved variables can be a very difcult task. Reasonable approaches exist
in the two remaining intermediary cases. When the structure is known but contains
hidden variables, algorithms such as EM can be applied, as we have described earlier
for mixture models. Considerably more details and other pointers can be found in
Buntine (1996), Heckerman (1997, 1998), and Jordan (1999).
When the structure is unknown, but no hidden variables are assumed, a variety
of search algorithms can be formulated to search for the structure (and parameters)
that optimize some particular objective function. Typically these algorithms operate
by greedily searching the space of possible structures, adding and deleting edges to
effect the greatest change in the objective function. When the complexity (number
of parameters) of a model M is allowed to vary, using the likelihood as the objective
function will inevitably lead to the model with the greatest number of parameters,
since it is this model that can t the training data D the best. In the case of searching
through a space of Bayesian networks, this means that the highest likelihood network
will always be the one with the maximal number of edges. While such a network
may provide a good t to the training data, it may in effect overt the data. Thus,
it may not generalize well to new data and will often be outperformed by more
parsimonious (sparse) networks. Rather than selecting the model that maximizes
the likelihood P (D | M), the Bayesian approach is to select the model with the
maximum posterior probability given the data, P (M | D), where we average over
MATHEMATICAL BACKGROUND 17
parameter uncertainty
P (M | D) P (D | , M)P ( | M) d ,
1.5 Classication
Classication consists of learning a mapping that can classify a set of measurements
on an object, such as a d-dimensional vector of attributes, into one of a nite number
K of classes or categories. This mapping is referred to as a classier and it is typically
learned from training data. Each training instance (or data point) consists of two parts:
an input part x and an associated output class target c, where c {1, 2, . . . , K}. The
classier mapping can be thought of as a function of the inputs, g(x), which maps
any possible input x into an output class c.
For example, email ltering can be formulated as a classication problem. In this
case the input x is a set of attributes dened on an email message, and c = 1 if and
only if the message is spam. We can construct a training data set of pairs of email
messages and labels:
D = {[x1 , c1 ], . . . , [xn , cn ]}.
The class labels in the training data c can be obtained by having a human manually
label each email message as spam or non-spam.
The goal of classication learning is to take a training data set D and estimate
the parameters of a classication function g(x). Typically we seek the best function
g, from some set of candidate functions, that minimizes an empirical loss function,
namely
n
E= l(ci , g(xi )),
i=1
where l(ci , g(xi )) is dened as the loss that is incurred when our predicted class label
is g(xi ), given input xi , and the true class label is ci . A widely used loss function for
classication is the so-called 0-1 loss function, where l(a, b) is zero if a = b and
one otherwise, or in other words we incur a loss of zero when our prediction matches
the true class label and a loss of one otherwise. Other loss functions may be more
18 CLASSIFICATION
To make a class label prediction for a new input that is not in the training data x, we
calculate P (c = k | x) for each of the K classes. If we are using the 0-1 loss function,
then the optimal decision is to choose the most likely class, i.e.
c = arg max{P (c = k | x)}.
k
where xj is the j th attribute and m is the total number of attributes in the input x.
A limitation of this general approach is that by modeling P (x | c = k) directly,
we may be doing much more work than is necessary to discriminate between the
classes. For example, say the number of attributes is m = 100 but only two of these
attributes carry any discriminatory information the other 98 are irrelevant from the
point of view of making a classication decision. A good classier would ignore these
98 features. Yet the full probabilistic approach we have prescribed here will build a
full 100-dimensional distribution model to solve this problem. Another way to state
this is that by using Bayes rule, we are solving the problem somewhat indirectly: we
are using a generative model of the inputs and then inverting this via Bayes rule to
get P (c = k | x).
A probabilistic solution to this problem is to instead focus on learning the posterior
(conditional) probabilities P (c = k | x) directly, and to bypass Bayes rule. Con-
ceptually, this is somewhat more difcult to do than the previous approach, since the
MATHEMATICAL BACKGROUND 19
training data provide class labels but do not typically provide examples of values
of P (c = k | x) directly. One well-known approach in this category is to assume a
logistic functional form for P (c | x),
1
P (c = 1 | x) = ,
1 + ew
T xw
0
where for simplicity we assume a two-class problem and where w is a weight vector
of dimension d, wT is the transpose of this vector, and w T x is the scalar inner product
of w and x. Equivalently, we can represent this equation in log-odds form
P (c = 1 | x)
log = w T x + w0 ,
1 P (c = 1 | x)
where now the role of the weights in the vector w is clearer: a large positive (negative)
weight wj for attribute xj means that, as xj gets larger, the probability of class c1
increases (decreases), assuming all other attribute values are xed. Estimation of
the weights or parameters of this logistic model from labeled data D can be carried
out using iterative algorithms that maximize ML or Bayesian objective functions.
Multilayer perceptrons, or neural networks, can also be interpreted as logistic models
where multiple logistic functions are combined in various layers.
The other alternative to the probabilistic approach is to simply seek a function
f that optimizes the relevant empirical loss function, with no direct reference to
probability models. If x is a d-dimensional vector where each attribute is numerical,
these models can often be interpreted as explicitly searching for (and dening) deci-
sion regions in the d-dimensional input space x. There are many non-probabilistic
classication methods available, including perceptrons, support vector machines and
kernel approaches, and classication trees. In Chapter 4, in the context of document
classication, we will discuss one such method, support vector machines, in detail.
The advantage of this approach to classication is that it seeks to directly maximize
the chosen loss function for the problem, and no more. In this sense, if the loss function
is well dened, the direct approach can in a certain sense be optimal. However, in many
cases the loss function is not known precisely ahead of time, or it may be desirable to
have posterior class probabilities available for various reasons, such as for ranking or
for passing probabilistic information on to another decision making algorithm. For
example, if we are classifying documents, we might not want the classier to make
any decision on documents whose maximum class probability is considered too low
(e.g. less than 0.9), but instead to pass such documents to a human decision-maker
for closer scrutiny and a nal decision.
Finally, it is important to note that our ultimate goal in building a classier is to
be able to do well on predicting the class labels of new items, not just the items in
the training data. For example, consider two classiers where the rst one is very
simple with only d parameters, one per attribute, and the second is very complex with
100 parameters per attribute. Further assume that the functional form of the second
model includes the rst one as a special case. Clearly the second model can in theory
always do as well, if not better than, the rst model, in terms of tting to the training
20 CLUSTERING
data. But the second model might be hopelessly overtting the data and on new unseen
data it might produce less accurate predictions than the simpler model.
In this sense, minimizing the empirical loss function on the training data is only a
surrogate for what we would really like to minimize, namely the expected loss over all
future unseen data points. Of course this future loss is impossible to know. Nonetheless
there is a large body of work in machine learning and statistics on various methods
that try to estimate how well we will do on future data using only the available training
data D and that then use this information to construct classiers that generalize more
accurately. A full discussion of these techniques is well beyond the scope of this book,
but an excellent treatment for the interested reader can be found in Hastie et al. (2001)
and Devroye et al. (1996).
1.6 Clustering
Clustering is very similar to classication, except that we are not provided with
class labels in the training data. For this reason classication is often referred to as
supervised learning and clustering as unsupervised learning. Clustering essentially
means that we are trying to nd natural classes that are suggested by the data.
The problem is that this is a rather vague prescription and as a result there are many
different ways to dene what precisely is meant by a cluster, the quality of a particular
clustering of the data, and algorithms to optimize a particular cluster qualityfunction
or objective function given a set of data. As a consequence there are a vast number of
different clustering algorithms available. In this section we briey introduce two of
the more well-known methodologies and refer the reader to other sources for more
complete discussions (Hand et al. 2001; Hastie et al. 2001).
One of the simplest and best known clustering algorithms is the K-means algorithm.
In a typical implementation the number of clusters is xed a priori to some value K.
K representative points or centers are initially chosen for each cluster more or less at
random. These points are also called centroids or prototypes. Then at each step:
(i) each point in the data is assigned to the cluster associated with the closest
representative;
(ii) after the assignment, new representative points are computed for instance by
averaging or by taking the center of gravity of each computed cluster;
(iii) the two procedures above are repeated until the system converges or uctuations
remain small.
Notice that the K-means algorithm requires choosing the number of clusters, being
able to compute a distance or similarity between points, and computing a representa-
tive for each cluster given its members.
From this general version of K-means one can dene different algorithmic varia-
tions, depending on how the initial centroids are chosen, how symmetries are broken,
whether points are assigned to clusters in a hard or soft way, and so forth. A good
MATHEMATICAL BACKGROUND 21
implementation ought to run the algorithm multiple times with different initial con-
ditions, since any individual run may converge to a local extremum of the objective
function.
We can interpret the K-means algorithm as a special case of the EM algorithm for
mixture models described earlier in this chapter. More specically, the description
of K-means given above corresponds to a hard version of EM for mixture models,
where the membership probabilities are either zero or one, each point being assigned
to only one cluster. It is well known that the center of gravity of a set of points
minimizes its average quadratic distance to any xed point. Therefore, in the case of a
mixture of spherical Gaussians, the M step of the K-means algorithm described above
maximizes the corresponding quadratic log-likelihood and provides an ML estimate
for the center of each Gaussian component.
When the objective function corresponds to an underlying probabilistic mixture
model (Everitt 1984; McLachlan and Peel 2000; Titterington et al. 1985), K-means
is an approximation to the classical EM algorithm (Baldi and Brunak 2001; Dempster
et al. 1977), and as such it typically converges toward a solution that is at least a local
ML or maximum posterior solution. A classical case is when Euclidean distances are
used in conjunction with a mixture of Gaussian models.
We can also use the EM algorithm for mixture models in its original form with
probabilistic membership weights to perform clustering. This is sometimes referred
to as model-based clustering. The general procedure is again to x K in advance,
select specic parametric models for each of the K clusters, and then use EM to
learn the parameters of this mixture model. The resulting component models provide
a generative probabilistic model for the data in each cluster. For example, if the data
points are real-valued vectors in a d-dimensional space, then each component model
can be a d-dimensional Gaussian or any other appropriate multivariate distribution.
More generally, we can use mixtures of graphical models where each mixture com-
ponent is encoded as a particular graphical model. A particularly simple and widely
used graphical model is the so-called Naive Bayes model, described in the previ-
ous section on classication, where the attributes are assumed to be conditionally
independent given the component.
A useful feature of model-based clustering (and mixture models in general) is that
the data need not be in vector form to dene a clustering algorithm. For example,
we can dene mixtures of Markov chains to cluster sequences, where the underlying
model assumes that the observed data are being generated by K different Markov
chains and the goal is to learn these Markov models without knowing which sequence
came from which model. In Chapter 7 we will describe this idea in more detail in an
application involving clustering of Web users based on their Web page requests.
The main limitation of model-based clustering is the requirement to specify a
parametric functional form for the probability distribution of each component in
some applications this may be difcult to do. However, if we are able to assume some
functional form for each cluster, the gains can be substantial. For example, we can
include a component in the model to act as a background cluster to absorbdata points
that are outliers or that do not t any of the other components very well. We can also
22 POWER-LAW DISTRIBUTIONS
f (x) = Cx (1.29)
for x [1, +). In many real life situations associated with power-law distributions,
the distribution for small values of k or x may deviate from the expressions in Equa-
tions (1.28) and (1.29). Thus, a more exible denition is to say that Equations (1.28)
and (1.29) describe the behavior for sufciently large values of x or k.
In a loglog plot, the signature of a power-law distribution is a line with a slope
determined by the coefcient . This provides a simple means for testing power-law
behavior and for estimating the exponent . An example of ML estimation of is
provided in Chapter 7.
In both the discrete and continuous cases, moments of order m 0 are nite if and
only if > m + 1. In particular, the expectation is nite if and only if > 2, and the
variance is nite if and only if > 3. In the discrete case,
E[Xm ] = Ck m = C ( m), (1.30)
k=1
where is Riemanns zeta function ( (s) = k 1/k s ). In the continuous case,
C
E[X m ] = Cx m dx = . (1.31)
1 m1
In particular, C = 1, E[X] = 1/( 2) and
1 1 2
var[X] = .
3 2
A simple consequence is that in a power-law distribution the average behavior is
not the most frequent or the most typical, in sharp contrast with what is observed
with, for instance, a Gaussian distribution. Furthermore, the total mass of the points
to the right of the average is greater than the total mass of points to the left. Thus, in
a random sample, the majority of points are to the right of the average.
Another interesting property of power-law distributions is that the cumulative tail,
that is the area under the tail of the density,
also has a power-law behavior with
exponent 1. This is obvious since x u du = x +1 /( 1).
There is also a natural connection between power-law distributions and log-normal
distributions (see Appendix A). X has a log-normal density if log X has a normal
density with mean and variance 2 . This implies that the density of X is
1
e(log x)
2 /2 2
f (x) = (1.32)
2 x
and thus log f (x) = C log x (log x )2 /2 2 . In the range where is large
compared with |log x |, we have log f (x) C log x, which corresponds to a
straight line. Very basic generative models can lead to log-normal or to power-law
24 POWER-LAW DISTRIBUTIONS
In other words, the ratio of the sums depends only on the ratio b/a and not on the
absolute scale of a or b things look the same at all length scales. This scale-free
property is also known in more folkloristic terms as the 80/20 rule, which basically
states that 80% of your business comes from 20% of your clients. Of course it is
not the particular value 80/20 = 4/1 ratio that matters, since this may vary from one
phenomenon to the next, it is the existence of such a ratio across different scales, or
business sizes (and business types). This property is also referred to as self-scaling
or self-similar, since the ratio of the areas associated with a and b remains constant,
regardless of their actual values or frequencies. One important consequence of this
behavior is that observations made at one scale can be used to infer behavior at other
scales.
100000
10000
1000
100
10
Rank
1
1 10 100 1000 10000 100000
Figure 1.6 Zipfs Law for data from the Web KB project at Carnegie Mellon Univer-
sity consisting of 8282 Web pages collected from Computer Science departments of various
universities. Tokenization was carried out by removing all HTML tags and by replacing each
non-alphanumeric character by a space. Loglog plot of word ranks versus frequencies.
Vocabulary size
Empirical
100 K
0.76
|N|
80 K
60 K
40 K
20 K
Text length N
0
0 500 K 1M 1.5 M 2M 2.5 M 3M 3.5 M 4M
Figure 1.7 Heaps Law for data from Web KB project at Carnegie Mellon University,
consisting of 8282 Web pages collected from Computer Science departments of various uni-
versities. Tokenization was carried out by removing all HTML tags and by replacing each
non-alphanumeric character by a space. Plot of total number of words (text length) versus the
number of distinct words (vocabulary size). The characteristic exponent for this Web data set
is 0.76.
26 POWER-LAW DISTRIBUTIONS
This is equivalent to
/
1 w0 1 w0 + dw w0 w0 + dw /
P log t log = (1.35)
C C C C
by using the exponential distribution of t. This nally yields the density
w / 1
(w) = , (1.36)
C w
with power-law exponent = 1 + /. Thus a power-law density results naturally
from the competition of two exponential phenomena: a positive exponential control-
ling the growth of the energy and a negative exponential controlling the age distribu-
tion.
A different generative model of power-law behavior based on the notion of prefer-
ential attachment will be given in Chapter 3. A third and more indirect model based
on information theory is due to Mandelbrot (see Mandelbrot 1977 and references
MATHEMATICAL BACKGROUND 27
therein; Carlson and Doyle 1999; Zhu et al. 2001) and is given in one of the exercises
below.
1.8 Exercises
Exercise 1.1. Prove that Equations (1.15) and (1.16) are correct.
Exercise 1.2. In the die model, compute the posterior when the prior is a mixture of
Dirichlet distributions. Can you infer a general principle?
Exercise 1.3. The die model with a single Dirichlet prior is simple enough that one
can proceed with higher levels of Bayesian inference, beyond the computation of the
posterior or its mode. For example, explicitly compute the normalizing factor (also
called the evidence) P (D). Use the evidence to derive a strategy for optimizing the
parameters of the Dirichlet prior.
Exercise 1.4. Consider a mixture model with a Dirichlet prior on the mixing coef-
cients. Study the MAP and MP estimates for the mixing coefcients.
Exercise 1.5. Compute the differential entropy (see Appendix A) of the continuous
power-law distribution f (x) = Cx , x 1.
Exercise 1.6. Simulate a rst-order Markov language (by selecting an alphabet con-
taining the space symbol, dening a die probability vector over the alphabet, tossing
the die many times, and computing word statistics) whether Zipfs and Heaps Laws
are true or not in this case. Experiment with the size of the alphabet, the letter proba-
bilities, the order of the Markov model, and the length of the generated sequence.
Exercise 1.7. Find a large text data set (for instance from Web KB) and compute
the corresponding Zipf and Heaps curves, e.g. word rank versus word frequency and
text size versus vocabulary, or the respective logarithms.
Exercise 1.8. Study the connection between Zipfs and Heaps Laws. In particular,
is Heaps Law a consequence of Zipfs Law?
Exercise 1.9. Collect statistics about the size of cities, or computer les, and deter-
mine whether there is an underlying power-law distribution. Provide an explanation
for your observation.
Exercise 1.10. Mandelbrots information theoretic generative model for power-law
distributions considers random text generated using a set of M different words
with associated probabilities p1 , . . . , pM , ranked in decreasing order, and associated
transmission costs C1 , . . . , CM . A possible cost function is to take Ci proportional
to log i. Why?
Now suppose that the goal is to design the probabilities pi in order to optimize
the average amount
of information per unit transmission cost. The average cost per
word is E[C] = i pi C i and the average information, or entropy (see Appendix A),
per word is H(p) = i pi log pi . Compute the values of the pi to maximize the
quantity H/C and show that, with logarithmic costs, it corresponds to a power law.
28 EXERCISES
Exercise 1.11. Show that the product of two independent log-normal random vari-
ables is also log-normal. Use this simple fact to derive a generative model for log-
normal distributions.
2
visit to the library, whether an electronic or a brick and mortar one. Hence, in turn,
under an extended set of protocols the Web encompasses more than just the Internet.
and its formal grammar. In other words, the DTD of HTML species the metalevel
of a generic HTML document. For example, the following declaration in the DTD
of HTML 4.01 species that an unordered list is a valid element and that it should
consist of one or more list items,
<!ELEMENT UL - - (LI)+ -- unordered list -->
where ELEMENT is a keyword that introduces a new element type, UL is the new
element type, which represents an unordered list being dened and LI is the ele-
ment type that represents a list item. Here, metasymbols (), for grouping and +, for
repetition, are similar to the corresponding metasymbols in regular expressions. In
particular, + means that the preceding symbol (i.e. LI) should appear one or more
times. The declaration of an element in a document is generally obtained by enclosing
some content within a pair of matching tags. A start tag is realized by the element
name enclosed in angle brackets, as in <LI>. The end tag is distinguished by adding a
slash before the element name, as in </LI>. The semantics in this case says that any
text enclosed between <LI> and </LI> should be treated as a list item. Note that,
besides enriching the document with structure, markup may affect the appearance of
the document in the window of a visual browser (e.g. the text of a list item should be
rendered with a certain right indentation and preceded by a bullet).
Some elements may be enriched with attributes. For example, the following is a
portion of the declaration of the element IMG, used to embed inline images in HTML
documents:
<!ELEMENT IMG - O EMPTY -- Embedded image -->
<!ATTLIST IMG
src %URI; #REQUIRED -- URI of image to embed --
alt %Text; #REQUIRED -- short description --
...
The attribute src is a required resource identier (see Section 2.2 below) that allows
us to access the le containing the image to be displayed. In the simplest case this
is just a local lename. The attribute alt is text that should be used in place of the
image, for example when the document is rendered in a terminal-based browser. An
actual image would be then inserted as
<img src="Web.png" alt="a Web">
This is an example of an element that must not use a closing tag (as implied by the
keywords O and EMPTY in the element declaration).
Version information. This is a declarative section that species which DTD is used
in the document (see lines 12 in Figure 2.1). The W3C recommends three possible
DTDs: strict, transitional, or frameset. Strict contains all the elements that are not
deprecated (i.e. outdated by new elements) and are not related to frames, transitional
allows deprecated elements, and frameset contains all of the above plus frames.
Header. This is a declarative section that contains metadata, i.e. data about the docu-
ment. This section is enclosed within <head> and </head> in Figure 2.1 (lines 4
8). In our example, line 5 species the character set (ISO-8859-1, aka Latin-1) and
line 6 describes the authors. Metadata in this case are specied by setting attributes
of the element meta. The title is instead an element (line 7).
Body. This is the actual document content. This element is enclosed within <body>
and </body> in Figure 2.1 (lines 918).
In Figure 2.1 we see some other examples of HTML elements. <h1> and <h2> are
used for headings, <ul> consists of an unordered list, whose (bulleted) items are
tagged by <li>, and so on.
2.1.3 Links
The <A> element (see, for example, lines 15 and 16) is used in HTML to implement
links, an essential feature of hypertexts. As we will see later in this book, a link
is rather similar to an edge in a directed graph. It connects two objects referred to
as anchors. In our example, the source anchors are the elements of the document
delimited by <A> and </A>. In the World Wide Web, the target anchor is a resource
that may be physically stored in the same server as the source document, or may be
in a different server, possibly located in another country or continent (see Figure 2.2).
The target resource is identied by means of a special string called the Uniform
Resource Identier (URI). If the attribute href of the element <A> is set, its value
is the URI of the target anchor. In our example there are two URIs, namely the two
strings http://www.w3.org/ and toc.html.
Note that links can be implemented in several other ways. For HTML documents,
source anchors can be associated with images, forms, or active elements imple-
mented using scripting languages or applets. But linking is not limited to HTML
documents, since many document formats now support mechanisms for encoding
URIs.
In 2000, HTML was refomulated as an XML application. The resulting language
was called XHTML and released as a W3C recommendation. Details can be found at
http://www.w3.org/TR/html/.
BASIC WWW TECHNOLOGIES 33
The W3 Consortium
Table of Contents
www.w3.org
ibook.ics.uci.edu
--------
--------
-------- ------------
-------- ------------
------------
------------
http://www.w3.org/
http://ibook.ics.uci.edu/toc.html
and a tilde by %7E. Similarly, the question mark ? is reserved in URIs that represent
queries to delimit the identier of a queryable object and the query itself.
An absolute URI consists of two portions separated by a colon: a scheme specier
followed by a second portion whose syntax and semantics depend on the particu-
lar scheme. In BNF (BackusNaur Form), the general syntax of absolute URIs is
expressed as
The authority can be either a server with an IP, or a reference to a registry of a naming
authority:
For Internet URIs we are interested in the rst case. While in the simplest case a
server is specied by its full hostname, in general we must allow the possibility of
password-protected sites and servers that do not listen to the default TCP/IP port for
a given protocol. Thus
By website we generally mean the collection of resources that share the same
authority in their URI. Anchors in actual HTML pages may also contain relative URIs.
For example, the absolute URI http://www.gnacz.org/foo/bar.html
may be simply referenced to as /foo/bar.html in any source anchor contained
in a page belonging to the site www.gnacz.org.
URI-reference ::= [ absoluteURI | relativeURI ] [ # fragment ]
relativeURI ::= ( networkPath | absolutePath
| relativePath ) [ ? query ]
The full grammar specication for URIs can be found in Berners-Lee et al. (1998).
2.3 Protocols
Protocols describe how messages are encoded and specify how messages should be
exchanged. Protocols that regulate the functioning of computer networks are layered
in a hierarchical fashion. The lower layer in the hierarchy is typically related to the
physical communication mechanisms (e.g. it may be concerned with optical bers
or wireless communication). Higher levels are related to the functioning of specic
applications such as email or le transfer. A hierarchical organization of protocols
allows one to employ different levels of abstraction when describing and implement-
ing the software and the hardware components of a networked environment. For
example, a host in the Internet is characterized by a 32 bit address, independently of
whether the physical connection goes through a home DSL cable, through an ofce
Ethernet LAN, or through an airport wireless network. Similarly, email protocols are
a lingua franca spoken by clients and servers. Sending email just involves connect-
ing to a server and using this language. All of the details about how the information
is actually transmitted to the recipient are hidden by the hierarchical mechanism of
encapsulation by which higher level messages are embedded into a format understood
by the lower level protocols.
Application layer
(Process )
DNS
LDAP
Network layer
IP (Internet )
to an Internet host for which the name is known (e.g. ftp.uci.edu), it rst must
determine the IP number. The mapping from names to numbers, or resolution, is
realized by the domain name service (DNS).
Names are organized hierarchically. The rightmost component of a name corre-
sponds to the highest level in the hierarchy and is known as the top-level domain
(TLD). Some historical TLDs include .com, .edu, .org, and .gov. Other TLDs
are, for example, associated with ISO 3166 country codes, such as .de, .fr, .it,
or .uk. The management of the domain name system and the root server system
(that handles resolution for TLDs), as well as the allocation of the IP address space, is
currently the responsibility of the Internet Corporation for Assigned Names and Num-
bers (ICANN). Within each TLD, several subdomains are registered (e.g. under the
management of a local naming authority) and the process can be repeated recursively.
Resolution is carried out by consulting a hierarchical distributed database. We
illustrate the mechanism by using an example. Suppose the user agent running
on lnx32.abcd.net needs to translate the name ftp.uni-gnat.edu. The
agent may rst query a local database that holds the most commonly used host-
names (this is normally stored in the le /etc/hosts in Unix-like systems). If
the search is unsuccessful, or no local database is maintained, a recursive query
is typically sent to a local DNS server, say dns.abcd.net. This server may be
able to provide an answer by consulting its cache. This happens if the same query
has been answered before and the result has been stored in the servers cache.
If not, then dns.abcd.net will need to consult other DNS servers, becoming
itself a client. For example, it might be congured to send a non-recursive query to
the DNS server of its Internet service provider, say dns.abcdprov.com. Hav-
ing received a non-recursive query, if dns.abcdprov.com cannot provide an
answer, it does not forward the query to another server but rather it returns to
dns.abcd.net a referral to another DNS server that might be able to answer.
Suppose it returns a referral to a root server. dns.abcd.net will now repeat
the query to the root server, which replies with a referral to the DNS server of the
.edu domain. The iteration continues and dns.abcd.net sends the query to the
server for the .edu, which suggests that dns.uni-gnat.edu is the authority for
the zone to which ftp.uni-gnat.edu belongs. Eventually, dns.abcd.net
receives the answer from dns.uni-gnat.edu and returns it to the original client
lnx32.abcd.net.
Method
name Description
case the user agent directly connects to the server but, more generally, intermediary
agents like proxies, tunnels, or gateways can actually be present.
Messages are exchanged in ASCII format and pertain to one of several possible
methods, as detailed in Table 2.1. HTTP is essentially a request/response protocol.
A method essentially corresponds to a request from the user agent to the server. The
server responds to the request with a response message.
The most common method is GET, which allows the user agent to fetch an HTML
Web page, or another document such as an image or a le, from the server. Its usage
is illustrated in Figure 2.4. In this example, the response of the server is simply
the HTML code of the Web page associated with the requested URI (http://
www.ics.uci.edu/). In HTTP 1.0 the communication would be released upon
completion of the request, since HTTP 1.1 (Fielding et al. 1999; Krishnamurthy
et al. 1999) connections are persistent, i.e. the link remains active after a request.
Persistence offers considerable savings in overhead when several requests are sent to
the server, for instance while downloading a page with several images.
While some entities are simply documents stored in the servers le system, it is
also possible to have requests for entities that are actively generated by the server. This
is normally the case for pages served as a result of user queries: the servers software
could query a database management system to retrieve the requested information,
create on-the-y HTML code for display, and return the resulting ASCII stream as
an HTTP response. The collection of documents behind query forms is generally
referred to as the hidden (or invisible) Web (Bergman 2000).
40 PROTOCOLS
telnet www.ics.uci.edu 80
Trying 128.195.1.77...
User agent's
request Connected to lolth.ics.uci.edu.
Escape character is ^].
Server's GET http://www.ics.uci.edu/ HTTP/1.1
response Host: www.ics.uci.edu
HTTP/1.1 200 OK
Date: Wed, 25 Sep 2002 19:43:12 GMT
Server: Apache/1.3.26 (Unix) PHP/4.1.2 mod_ssl/2.8.10 OpenSSL/0.9.6e
X-Powered-By: PHP/4.1.2
HTML code of Transfer-Encoding: chunked
returned Content-Type: text/html
webpage
f00
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Information and Computer Science at the University of
California, Irvine</title>
...
Figure 2.4 Example of the use of the GET method in an HTTP 1.1 session.
The example shown in Figure 2.4 can be reproduced on almost any platform using a
terminal or a telnet application and is simply intended to demonstrate the mechanics of
the protocol. Figure 2.5, however, shows a simple programming example that requires
some basic knowledge of the Perl programming language (Wall et al. 1996). The
program fetches the header returned by the HTTP server and can be easily modied
to collect interesting pieces of information, such as the server software being run
(third line of output) or the timestamp (fourth line of output) that indicates the last
modication time of the page (in Chapter 6 we discuss how this information can
be useful to estimate the age distribution of Web pages). Lines 18 in Figure 2.5
are simply directives and variable declarations. A communication channel (socket)
is established in lines 1012 by invoking the constructor of the INET class with
three arguments: the host (either specied by an IP address or a domain name), the
connection port (80) and a connection timeout in seconds. If a connection cannot be
established, the program is terminated in line 13. Line 15 prints a string to the socket
(that behaves like a character output device) issuing the HEAD request to the server,
using HTTP 1.0 (the commented line below shows how to use HTTP 1.1). Finally,
the loop in lines 1820 is used to read and print (to the standard output), one line at
a time, the response returned by the server.
A similar program can be used to sample the IP address space. This technique,
explained in more detail in Chapter 3, can be used to estimate global properties of the
Web (e.g. the number of active sites). For example we may be interested in determining
the fraction of the addressable IPv4 space (232 = 4 294 967 296 hosts in total) with
running Web servers. Figure 2.6 shows a basic Perl script for sampling IPv4 addresses,
testing the presence of an HTTP server running on port 80. The program counts a
success if a socket can be established and, in addition, the server returns the status
BASIC WWW TECHNOLOGIES 41
1 #!/usr/bin/perl
2 # getHeader.pl
3 use strict;
4 use IO::Socket qw(:DEFAULT :crlf);
5 my $host = shift || "www.unifi.it";
6 my $path = shift || "/";
7 my $timeout= shift || 2;
8 my $socket;
9 # Create a socket for standard HTTP connection (port 80)
10 my $socket = IO::Socket::INET->new(PeerAddr => $host,
11 PeerPort => http(80),
12 Timeout => $timeout)
13 or die "Cannot connect\n";
14 # Sends a HEAD request by printing on the socket
15 print $socket "HEAD $path HTTP/1.0",CRLF,CRLF;
16 #print $socket "HEAD $path HTTP/1.1",CRLF,"host: $host",CRLF,CRLF;
17 my $head;
18 while ($head = <$socket>) {
19 print $head;
20 }
% ./getHeader.pl
HTTP/1.1 200 OK
Date: Tue, 03 Dec 2002 23:59:59 GMT
Server: Apache/1.3.27 (Unix) PHP/4.2.3
Last-Modified: Mon, 04 Nov 2002 08:16:25 GMT
ETag: "f209-f91-3dc62cd9"
Accept-Ranges: bytes
Content-Length: 3985
Connection: close
Content-Type: text/html
Figure 2.5 Sample Perl script demonstrating the usage of the HEAD method in HTTP 1.0
and 1.1. A sample terminal output obtained by running the script is shown at the bottom.
code 200, which indicates a successful connection (Fielding et al. 1999). Although
useful as a starting point, this basic program has several limitations (see Exercise 2.4).
Figure 2.6 Basic Perl script for sampling the IPv4 address space.
Table 2.2 An example of an entry in the www.ics.uci.edu Web server log le.
be found (code 404) or a server unavailable response (code 503) (see Fielding et
al. (1999) for details). Server-side logging software can be congured so that these
various log les are combined into a single server log le that records all of these
elds for each request we will assume this to be the case in what follows.
BASIC WWW TECHNOLOGIES 43
Table 2.3 A second example of an entry in the www.ics.uci.edu Web server log le.
request came via the Google search engine (www.google.com) by issuing the
query maximum+length+of+a+url to Google and then following a hyperlink
in the displayed results to the www.ics.uci.edu website. The particular page
being requested here is a page from the mailing list archives of the Internet Engineer-
ing Task Force (IETF) HTTP Working Group and is being provided via ftp.
Results
End user
Crawling control
Document repository
Indexer
Inverted index
able pages (or URLs) that are up-to-date in the internal repository data pages stored
by the engine. One approach for maintaining a fresh repository consists of crawl-
ing the Web periodically in order to replace the contents of the index with updated
versions of documents and possibly to add newly created documents. It is important
to stress that fetching a large portion of the Web can take several days because of
bandwidth limitations and that during the elapsed time between consecutive crawls
the index is not up to date. As a result, achieving high performance in terms of both
coverage and recency is a major challenge for general-purpose search engines, as
earlier noted by Lawrence and Giles (1998b). This issue is explained in more detail
in the following.
2.5.2 Coverage
Although several companies claimed by the mid 1990s that their search engines were
able to offer an almost complete coverage of Web content, it soon became clear that
only a relatively small fraction of existing Web pages could be fetched by a single
search engine. Lawrence and Giles (1998b, 1999) describe two experiments aimed
at measuring the performance of search engines in terms of coverage and freshness.
They use the following approach, known as overlap analysis, for estimating the size
of the indexable Web (see also Bharat and Broder 1998). Let W denote the set of Web
pages and let Wa W , Wb W be the pages crawled by two independent engines, a
and b. How big are Wa and Wb with respect to W ? Suppose we can uniformly sample
documents and test for membership in the two sets. Denote by P (Wa ) and P (Wb ) the
46 SEARCH ENGINES
extracts and returns one element out of the list. If Q is managed according to a rst-
in rst-out (FIFO) policy, then the crawler performs a breadth-rst search (BFS).
Assuming that the purpose is to download all pages reachable from S0 , this crawling
strategy may be reasonable, at least as a simple strategy to start with. However, we
will see later that the URLs in Q may be sorted according to different criteria in order
to focus the crawl in particular directions so as to improve the quality of the collected
documents with respect to an assigned goal. In line 4, we invoke the function Fetch,
which downloads the document addressed by the URL u. The fetched document is
then stored in the repository D. In line 6, the function Parse extracts all of the URLs
contained in the document (the set of children of u) and puts them in a temporary list
L. All the elements of L are in turn inserted into the crawling queue Q, unless they
have already been crawled or they are already enqueued for crawling. The latter test
is necessary to avoid loops, as the Web is not acyclic. Compared to standard graph
visit algorithms, storing reached and fetched URLs replaces traditional node coloring.
Note that in line 8 we also store the edges of the Web graph. We will see later how
recent scoring algorithms can take advantage of this information.
Simple-Crawler(S0 , D, E)
1 Q S0
2 while Q =
3 do u Dequeue(Q)
4 d(u) Fetch(u)
5 Store(D, (d(u), u))
6 L Parse(d(u))
7 for each v in L
8 do Store(E, (u, v))
9 if (v D v Q)
10 then Enqueue(Q, v).
several search engines can (and do) return links to documents they have never
downloaded.
The time required to complete the downloading of a document is unknown
because of many factors, including connection delays and network congestions.
Moreover, the bandwidth available at the crawling site(s) may be signicantly
larger than at the location where the document to be downloaded is located.
As a result, a single crawling process that just waits for the completion of
each download before moving to the next one would be a poor design choice.
Running concurrent fetching threads is thus the normal solution to maximize the
exploitation of available bandwidth. Note that a multithreaded implementation
with a single queue Q is still a single crawling process. Parallelization of the
crawling process is discussed in Section 6.3.
Crawlers should be respectful of servers and should not abuse the resources at
the target site otherwise administrators may decide to block access (Koster
1995). The Robots Exclusion Protocol (http://www.robotstxt.org/)
allows webmasters to dene which portions of a site are permissible for robots
to fetch from. Under this protocol, certain directories or dynamically generated
pages can be declared as off-limits for a crawler if they are listed in the le
robots.txt located in the root directory. A well-behaved crawler should
always avoid fetching excluded documents. Similarly, by inserting special a
META tag in the HTML code, a webmaster may indicate that the contents of
certain pages should not be indexed.
Overloading sites or network links should also be avoided. For example, multi-
ple threads should not fetch from the same server simultaneously or too often.
Unfortunately, outlinks often point within the same site (this happens in par-
ticular for dot com sites), leading to a relatively high degree of locality in
a single queue. For example, Mercator (Heydon and Najork 2001) is a Web
crawler that implements a sophisticated strategy for broadening as much as
possible the crawling fringe and increasing the elapsed time between two con-
secutive requests to the same server. It maintains a data structure formed by
several FIFO queues, each containing only URLs pointing to a unique server,
and by an index that associates a timestamp to each queue. Fetching threads
must check in the index if enough time has passed since the last access to the
same queue, and once they download a document they insert a timestamp in
the corresponding queue.
URLs that have already been fetched or discovered must not be inserted again
into the crawling queue Q. The test in line 9, however, is not straightforward
to implement. In particular, since Q and D are stored on disk, special care is
needed in order to avoid any overheads due to external memory management.
Heydon and Najork (2001) propose to use two caches. The rst cache contains
recently added URLs, exploiting locality during the crawl. The second cache
contains popular URLs, as determined during the crawl. Heydon and Najork
BASIC WWW TECHNOLOGIES 49
(2001) report high hit rate for the second cache and this can be explained by the
scale-free distribution for the number of incoming links of a given Web page
(see Chapter 3 for details on these graphical properties of the Web).
The real world is more complex than a mathematical model of the Web and in
practice several precautions must be enforced in a real crawler to avoid problems
like aliases and traps (Heydon and Najork 1999). Aliases occur if the same
document can be addressed by many distinct URLs, for example, if a site is
registered under multiple domain names. It is possible to cope with aliases by
using canonicalization, i.e. issuing a DNS request in order to get the canonical
name of the host (see Section 2.3.2). A related problem of multiple URLs
pointing to the same document occurs when the server embeds session IDs
into the URL. Traps can be generated by malicious common gateway interface
(CGI) scripts that keep generating on the y fake documents pointing to other
fake documents.
DNS lookup is necessary to map hostnames in URLs to IP addresses before
downloading, or to canonicalize URLs before inserting them in the discov-
ered queue. A DNS lookup can often take considerable time due to slow DNS
server responses. Efcient crawlers may need to rewrite DNS resolution in a
multithreaded fashion (Heydon and Najork 1999).
We should also bear in mind another important difference between graph visiting and
crawling. A graph is normally thought of as a static object during the execution of a
crawling algorithm. However, in reality, the Web is dynamic and changes continuously
in terms of both content and topology. Fetching a billion documents can take weeks.
Crawling is thus like taking a picture of a living scene with some objects remaining
xed and others moving at a very high speed. At the end of the process, D and E will
unavoidably contain stale information. The problem can be somewhat addressed by
more frequently downloading sites that are expected to change more often (e.g. portals,
press, etc.) and less frequent downloading of more static sites.Accounting for temporal
effects during a crawl is a relatively recent research topic in Section 6.4 we will
discuss specic crawling approaches that address Web dynamics.
2.6 Exercises
Exercise 2.1. Explain the differences amongst URI, URL, and URN using Venn
diagrams and provide examples for each intersection amongst these sets.
Exercise 2.2. Write a program that extracts all the URIs from a given HTML docu-
ment.
Exercise 2.3. Write a program to automatically query a given search engine. The
program should accept a query string as input and build up a list of URIs extracted
from the HTML pages returned by the search engine.
50 EXERCISES
Exercise 2.4. Improve the program of Figure 2.6 and perform some basic sampling
experiments. In particular:
write a multithreaded version so that there will be no problems waiting for a
longer timeout in the case of connection failure; and
correct the random generation of IP addresses to exclude special values that
are associated for example with default routers, or unassigned portions of the
address space (see ONeill et al. (1997) for details).
After completing your program collect some statistics from the Web, for example, the
distribution of usage of different HTTP server software.
3
Web Graphs
In short, we are dealing with a fuzzy graph that is constantly evolving on several
different time scales. For simplicity the main analysis focuses on a large and slowly
evolving subgraph of this graph. The dynamic aspect of course suggests other inter-
esting questions, related to the temporal evolution of relevant variables. For example,
how is the total number of nodes and links growing with time and what is the life
expectancy of a typical page? Some of these temporal issues will also be addressed
in this chapter and again later in Chapter 6.
There are many reasons, from mathematical, to social, to economic to study the
Web graph. It is an example of a large, dynamic, and distributed graph where we have
the privilege of being able to make a large number of measurements. In addition,
as we are about to see, the Web graph shares many properties with several other
complex graphs found in a variety of systems, ranging from social organizations to
biological systems. The graph may reect psychological and sociological properties
of how humans organize information and their societies. Perhaps on a more pragmatic
note, the structure of the graph may also provide insights regarding the robustness and
vulnerability of the Internet and help us develop more efcient search and navigation
algorithms. Indeed, one way to locate almost any kind of information today is to
navigate to the appropriate Web page by following a sequence of hyperlinks, i.e. a
path in the graph. Efcient navigation algorithms ought to be able to nd short paths,
at least most of the time. And nally, the behavior of users as they traverse the Web
graph is also of great interest and is the focus of Chapter 7.
Having set our focus on the graph of relatively stable Web pages and their hyper-
links, several different questions can be asked regarding the size and connectivity of
the graph, the number of connected components, the distribution of pages per site,
the distribution of incoming and outgoing connections per site, and the average and
maximal length of the shortest path between any two vertices (diameter).
Several empirical studies have been conducted to study the properties of the Web
graph. These are based on random sampling methods or, more often, on some form of
crawling applied to subgraphs of various sizes, ranging from a few thousand to a few
hundred million pages. To a rst-order approximation, these studies have revealed
consistent emerging properties of the Web graph, observable at different scales. One
particularly striking property is the fact that connectivity follows a power-law distri-
bution. Remarkably, these results have had a sizable impact in graph theory, with the
emergence of new problems and new classes of random graphs that are the focus of
active mathematical investigation (see Bollobs and Riordan (2003) and additional
references in the notes at the end of this chapter).
One preliminary but essential observation that holds for all the graphs to be con-
sidered in this chapter is that they are sparse, i.e. have a number of edges that is small
(|E| = O(n)) or at least o(n2 ) compared to the number O(n2 ) of edges in a complete
or dense graph. For the Web, this is intuitively obvious (at least in the foreseeable
future), because the average number of hyperlinks per page can be expected to be
roughly a constant, in part related to human information processing abilities. This
sparseness is already a departure from the uniform random graph model where the
probability of an edge p is a constant and the total number of edges |E| pn2 /2.
WEB GRAPHS 53
graph created by many agents, each of whom is completely free to create documents
and hyperlinks.
The emergence of a power-law distribution is by itself intriguing enough to require
an explanation. But in addition, the need for an explanation is exacerbated by the fact
that a similar distribution has been observed for many other networks: business net-
works, social networks, transportation networks, telecommunication networks, bio-
logical networks (both molecular and neural) and so forth (Table 3.1). The following
list of examples is not exhaustive.
The Internet at the router and inter-domain level has a connectivity distribution
that falls off as a power-law distribution with 2.48 (Faloutsos et al. 1999).
Power-law connectivity has also been reported at the level of peer-to-peer net-
works (Ripeanu et al. 2002).
The call graphs associated with the calls handled by some subset of telephony
carriers over a certain time period (Abello et al. 1998).
The power grid of the western United States (Albert et al. 1999; Phadke and
Thorp 1988) where, for instance, nodes represent generators, transformers, and
substations and edges represent high-voltage transmission lines between them.
The citation network where nodes represent papers and links are associated
with citations (Redner 1998). A similar graph, smaller but famous in the math-
ematics community, is the graph of collaborators of Paul Erds (see http://
www.acs.oakland.edu/grossman/erdoshp.html).
The collaboration graph of actors (http://us.imdb.com), where nodes
correspond to actors and links to pairs of actors that have costarred in a lm
(Barabsi and Albert 1999).
The networks associated with metabolic pathways (Jeong et al. 2000) where
the probability that a given substrate participates in k reactions decays as k
in a representative variety of organisms. These reactions have directions and
both the indegrees and outdegrees follow a power-law distribution with similar
exponents in general. In Escherichia coli, for instance, the exponent is = 2.2
for both the indegrees and the outdegrees. In contrast with what happens for the
other networks, Jeong et al. (2000) report that in metabolic networks the diam-
eter, measured by the average distance between substrates, does not seem to
grow logarithmically with the number of molecular species, but rather appears
to be constant for all organisms and independent of the number of molecular
species. (They also observe that the ranking of the most connected substrates
is essentially the same across all organisms). A constant diameter may confer
increased exibility during evolution.
The networks formed by interacting genes and proteins as described in
Maslov and Sneppen (2002). These authors also report the existence of a
richer level of structure in the corresponding graphs. In particular, direct
WEB GRAPHS 55
Table 3.1 Power-law exponent, average degree, and average diameter (e.g. average minimal
distance between pairs of vertices). Data partly from Barabasi and Albert (1999) and Jeong et
al. (2000). Ex., entries left as an exercise; NA, not applicable.
Average Average
Exponent degree diameter
Second, humans do not traverse links from one randomly selected site to another.
Exploratory browsing aside, they go to a given page in response to an informational
need, and generally start from a page that they believe will lead them to their target
quickly after all, this is what search engines are for.
One additional issue related to the power-law distribution, small network, and
sampling size, is the behavior of the degree distribution for very large values of k and
the difculty in assessing the behavior of the tail of the distribution with nite data.
Amaral et al. (2000) have analyzed several naturally occurring networks and reported
the existence of three different classes of small-world networks:
(a) scale-free networks characterized by a power-law distribution of connectivity
over the entire range of values;
(b) broad-scale networks, where the power-law distribution applies over a broad
range but is followed by a sharp cutoff; and
(c) single scale networks with a connectivity distribution that decays exponentially,
as in the case of standard random graph models.
The authors argue that in single and broad-scale networks there are in general addi-
tional constraints that limit the addition of new links, related for instance to the aging
of vertices, their limited capacity, or the cost of adding links. It is not clear, however,
that a given nite network can always be classied in a clean fashion into one of these
three categories. Furthermore, it is also not clear whether the very large-scale tail of
the degree distribution of the Web graph exhibits a cutoff or not, and if so whether it
is due to additional operating constraints.
chain, representing the probability of being in any state, is obtained by looking at the
eigenvector and eigenvalue of the transition matrix. Specically, the page rank r(v)
of page v is the steady-state distribution obtained by solving the system of n linear
equations given by
1 r(v)
r(v) = + , (3.2)
n |ch[u]|
upa[v]
where pa[v] is the set of parent nodes, i.e. of pages that point to page v, and |ch[u]|
denotes the outdegree, i.e. the size of the set of children nodes.
Analysis of the distribution of PageRank values in Pandurangan et al. (2002) indi-
cates that PageRank also follows a power-law distribution with the same exponent
(namely 2.1) as the indegree distribution and, as in the case of degree distributions,
this distribution seems to be relatively insensitive to the particular snapshot of the
Web used for the measurements.
TENDRILS
IN OUT
SCC
TUBES
DISCONNECTED COMPONENTS
Figure 3.1 Internet bow-tie mode, adapted from Broder et al. (2000), with four broad regions
of roughly equal size. A strongly connected component (SCC) within which each node can
reach any other node via a directed path. An IN component of nodes that can reach nodes in SCC
via a directed path but cannot be reached by nodes in SCC via a directed path. A similar OUT
component downstream of the SCC component. The fourth component is more heterogeneous
and contains (a) small disconnected components; and (b) tendrils associated with nodes that
are not in SCC and can be reached by a directed path originating in IN or terminating in OUT.
Tubes connect tendrils emanating from IN to tendrils projecting into OUT.
pages, although these proportions could vary in time and would need to be revisited
periodically.
Because this study contains disconnected components, it is clear that both the max-
imal and average diameter are innite. Thus, the author use the maximal minimal path
length and the average path length restricted to pairs of points that can be connected
to each other. With this caveat, they report in their study a maximal minimal diameter
of at least 28 for the central core, and over 500 for the entire graph. For any randomly
selected pair of points, the probability that a directed path exists from the rst to the
second is 0.24 and if a directed path exists, its average length is about 16, reasonably
close to the estimate of 19 obtained in the previous section. The average length of
an undirected path, when it exists, is close to 7. Thus in particular the core has a
small-world structure: the shortest directed path from any page in the core to any
other page in the core involves 1620 links on average.
The study in Broder et al. (2000) also reports a power-law distribution in the sizes
of the connected components with undirected edges, with an exponent = 2.5 over
roughly ve orders of magnitude. Bow-tie structure in subgraphs of the Webgraph
associated, for instance, with a particular topic are described in Dill et al. (2001).
60 GENERATIVE MODELS FOR THE WEB GRAPH
so that
1
T
log S(T ) = T log + log S(0) + log(1 + t ). (3.5)
i=0
WEB GRAPHS 61
l log(1 + ) + (T l) log(1 ),
where l is the number of positive uctuations. Assuming that the variables t are
independent, by the central limit theorem it is clear that for large values of T the
variable log S(T ) is normally distributed. Alternatively, log S(T ) can be associated
with a binomial distribution counting the number of times t is equal to +1 and the
binomial distribution can be approximated by a Gaussian. In any case, it follows
that S(T ) has a log-normal distribution which, as seen in Chapter 1, is related to
but different from the observed power-law distribution. However, Huberman and
Adamic (1999) report that, if this model is modied to include a wide distribution of
growth rates across different sites and/or the fact that sites have different ages (with
many more young sites than old ones), then simulations show that the distribution of
sizes across sites obeys a power-law distribution (see Mitzenmacher (2002) for more
general background about the relationship between the log-normal and power-law
distributions).
It must be pointed out, however, that the simple Fermi model of Chapter 1 pro-
vides a somewhat cleaner and analytically tractable model that seems to capture the
power-law distribution of website sizes, at least to a rst degree of approximation.
To see this, instead on focusing on daily uctuations, which are more of a second-
order phenomenon, it is sufcient to consider that websites are being continuously
created, that they grow at a constant rate during a growth period, after which their
size remains approximately constant, and that the periods of growth follow an expo-
nential distribution. Assuming, consistently with observations, a power-law exponent
= 1.08 for the size, this would give a relationship = 0.8 between the rate of
the exponential distribution and the growth rate .
Two quantities can be used to monitor the evolution of the undirected graph structure
during the rewiring process. First, the average diameter, d = d(p), corresponding
to the average distance (number of edges) between any two vertices, which is a
global property of the graph. Second, a more local property measuring the average
density of local connections, or cliquishness, dened as follows. If a vertex v has kv
neighbors, there are at most kv (kv 1)/2 edges between the corresponding nodes. If
cv is the corresponding fraction of allowable edges that
actually occur, then the global
cliquishness is measured by the average c = c(p) = v cv /|V |, where |V | is the total
number of vertices. In a social network, d is the average number of friendships in the
shortest path between two people, and cv reects the degree to which friends of v are
friends of each other.
To ensure that the graph is both sparse but connected, the value k satises
n k log n 1,
where k log n ensures that a corresponding random uniform graph remain con-
nected (Bollobs 1985). In this regime, Watts and Strogatz (1998) nd that d
n/2k 1 and c 3/4 as p 0, while d drandom log n/ log k and
c crandom k/n 1 as p 1. Thus, the regular lattice with p = 0 is a highly
clustered network where the average diameter d grows linearly with n, whereas the
random uniform network p = 1 is poorly clustered, with d growing logarithmically
with n. These extreme cases may lead one to conjecture that large c is associated with
large d, and small c with small d, but this is not the case.
Through simulations, in particular for the case of a ring lattice, Watts and Stro-
gatz nd that for small values of p the graph is a small-world network, with a high
cliquishness like a regular lattice and a small characteristic path length like a traditional
random graph. As p is increased away from zero, there is a rapid drop in d = d(p)
associated with the small-world phenomena, during which c = c(p) remains almost
constant and equal to the value assumed for the regular lattice over a wide range of
p. Thus, in this model, the transition to a small-world topology appears to be almost
undetectable at the local level. For small values of p, each new long-range connection
has a highly nonlinear effect on the average diameter d. In contrast, removing an edge
from a clustered neighborhood has at most a small linear effect on the cliquishness c.
In the case of a one-dimensional ring lattice, a mean eld solution for the average
path length and for the distribution of path lengths in the model is given in Newman
et al. (2000). The basic idea behind the mean eld approximation is to represent the
distribution of relevant variables over many realizations by their average values. The
authors apply the approximation to the continuous case rst and use the fact that when
the density of shortcuts is low, the discrete and continuous models are equivalent (see
also Barthlmy and Amaral (1999) for an analysis of the transition from regular
local lattice to small-world behavior and Amaral et al. (2000) for a classication of
small-world networks).
Even as a very vague model of the Web graph, however, the lattice perturbation
model has several limitations. First, there is no clear concept of an underlying lattice
and the notion of short and long links on the Web is not clearly dened. While short
WEB GRAPHS 63
Table 3.2 Small-world networks according to Watts and Strogatz (1998). n, number of ver-
tices; k, average degree; d, average distance or diameter; c, cliquishness. Random graphs have
the same number of vertices and same average connectivity. The difference in average connec-
tivity in the actor networks with respect to Table 3.1 results apparently from the inclusion of
TV series in addition to lms (A.-L. Barabasi 2002, personal communication).
n k d drandom c crandom
links could correspond to links within organizations and some correlation between
link density and geographical distance may be expected, the whole point of the Web
is to make a link between pages that are geographically distant as easy to create
as a link between pages that are geographically close. More importantly perhaps, it
can be shown that the edge rewiring procedure does not yield a graph with power-
law distributed connectivity (see also Barthlmy and Amaral 1999). The degree
distribution remains bounded and strongly concentrated around its mean value. Thus,
although the lattice rewiring procedure yields small-world networks, it does not yield
a good model for the connectivity of the Web graph. Other models are needed to try
to account for the scale-free connectivity distribution.
of the exponent. Using this precise model, Bollobs et al. (2001) prove that = 3,
rst for the case of m = 1 and then for the general case for degrees up to O( n), by
deriving the general case from the case of m = 1.
It has been shown, via simulations, that this model of preferential attachment leads
to small-world networks, i.e. graphs with small diameter, with an asymptotic diameter
of roughly log n, where n is the number of growth steps (or, equivalently, the number
of vertices). Bollobs and Riordan (2003) have shown rigorously that log n is the
right asymptotic value when m = 1 and that for m 2 the correct asymptotic
bound is in fact log n/ log log n. It is worth observing that for m = 1 the graphs
are essentially trees so that the bound in this case, as well as other results, can be
derived from the theory of random recursive trees (see Mahmoud and Smythe 1995;
Pittel 1994). (A recursive tree is a tree on vertices numbered {1, 2, . . . , n}, where
each vertex other than the rst is attached to an earlier vertex. In a random recursive
tree this attachment occurs at random with a particular distribution such as uniform,
preferential attachment, and so forth.)
Of interest is the observation that a diameter growing like log n is in fact very com-
mon in a variety of random graphs, including random regular graphs (Bollobs and
de la Vega 1982). Thus, from a random graph standpoint, the small-world properties of
the Internet are perhaps not that surprising. In fact, rather than our naive six-degree-
of-separation surprise, the question becomes rather why the diameter would be so
large! The smaller diameter of log n/ log log n obtained when m > 1 is a step in that
direction.
The model above only yields a characteristic exponent value of = 3 and produces
only undirected edges. The graphs it produces are also too structured to be good
models of the Internet: for instance when m = 1 the graph consists of M0 trees, since
there are M0 initial components and m = 1 ensures that each component remains a
tree. It is clear, however, that the model can be modied to produce directed edges,
accommodate other exponents, and to produce more complex topologies.
To address the edge orientation problem, the orientation of each new edge could
be chosen at random with a probability of 0.5 and the preferential attachment rule
could take into consideration both indegrees and outdegrees. The xed exponent
problem can be obviated, for instance, with a richer mixture model where not only
new nodes, but also new links are added (without adding nodes), and links are also
rewired (Albert and Barabsi 2000) (see also Cooper and Frieze (2001) for a similar
model and an asymptotic mathematical analysis based on linear difference equations).
In one implementation, the model above is extended by incorporating at each step
the possibility of adding m new edges (without adding nodes) or rewiring m existing
edges guided, in both cases, by the preferential attachment distribution. The three basic
processes adding nodes, rewiring, adding links are assigned mixture probabilities
p, q and 1 p q. An extension of the mean eld treatment given above shows that
depending on the relative probability of each one of the three basic processes and m
(the number of new nodes or edges at each step), the system can lead to either an
exponential regime or a power-law regime, with different power-law exponents .
Clearly these variations also break the tree-like structure of the simple model.
66 GENERATIVE MODELS FOR THE WEB GRAPH
The preferential attachment model is quite different from the standard random graph
models where the size is xed. Even without the preferential attachment mechanism,
when graphs grow older, nodes tend to have higher connectivity simply due to their
age and this fact alone tends to remove some of the randomness (Callaway et al. 2001).
The simplicity of the preferential attachment model is its main virtue but also its main
weakness and it can only be viewed at best as a rst-order approximation. While
the model does reproduce the scale-free and small-world properties associated with
the Web graph, it is clear that more realistic models need to take into account other
higher order effects, including deletion of pages or links, differences in attachment
rates, weak correlations to a variety of variables such as Euclidean distance, and so
forth. Deviation from power-law scaling has been observed, especially for nodes with
low connectivity. In addition, the deviations seem to vary for different categories of
pages (Pennock et al. 2002). For example, the distribution of hyperlinks to university
home pages diverges strongly from a power-law distribution and seems to follow a far
more uniform distribution. Additional work is needed to better understand the details
of the distribution and its uctuations, and to create more precise generative models
that can capture such uctuations and other higher order effects.
to an existing node with a probability that in the directed case strongly depends on
its indegree. If an existing page or node v has indegree |pa[v]| and each of its par-
ents i = 1, . . . , |pa[v]|has outdegree |ch[i]|, then the probability of connecting to v
through the copy mechanism is given by
|pa[v]|
1
.
n|ch[i]|
i=1
This is because once the copy mechanism has been selected, there is a 1/n chance of
choosing a given parent i of v, and a 1/|ch[i]| chance of choosing the edge running
from i to v. Thus, everything else being equal, the copying process favors creating a
link to nodes that already have a large indegree and therefore is a form of preferential
attachment based on vertex degrees (to be exact, the most favored attachment nodes
are those with very high indegree and whose parent pages have low outdegree).
3.3 Applications
The graphical properties of the Web and other networks are worth investigating for
their own sake. In addition, these properties may also shed light on some of the
functional properties of these networks, such as robustness. In the case of the Web,
they may also lead to better algorithms for exploiting the resources offered by the
underlying network. Systems with small-world properties may have enhanced signal-
propagation speed and synchronizability. In particular, on the Web, an intelligent
agent that can interpret the links and follow the most relevant ones should be able
to nd information rapidly. An intelligent agent ought to be able also to exploit
the correlations that exist between content and connectivity. Furthermore, thematic
communities of various sorts ought to be detectable by patterns of denser connectivity
that stand above the general background of hyperlinks.
be forwarded at each step by the sender to someone they knew by their rst name.
The experiment revealed that the typical letter path had a length of six, hence the
term six degrees of separation. The success of this experiment suggests that it may
possible for intelligent agents to nd short paths in complex networks using only local
address information. Furthermore, the experiment also provides some clues about
possible algorithms. The geographic movement of the [message] from Nebraska to
Massachusetts is striking. There is a progressive closing in on the target areas as each
new person is added to the chain (Milgram 1967; see also Hunter and Shotland 1974;
Killworth and Bernard 1978; White 1970).
This observation, as well as casual observation of how we seek information over the
Web, suggests a trivial greedy algorithm, where at each step the agent locally chooses
whichever link is available to move as close as possible to the target. The problem
lies in the denition of closeness and the kind of information available regarding
closeness at each node. Clearly, in the extreme case, where distances are measured in
terms of path lengths in the underlying graph and where the distance to the target node
is available at each node (but not necessarily the path), an agent can nd the shortest
path requiring a number of steps equal to the distance (hence a polynomial in log n
steps on average in a small-world network) using the obvious greedy algorithm. But
agents navigating the Web using local information would not have precise access to
graph distances. Thus, more sophisticated or mathematically interesting models must
rely on a weaker form of local knowledge and weaker estimates or approximations
to the distance to the target. At one other simple extreme, consider that, when the
agent examines all the links emanating from one node, instead of having access to the
precise information on how much closer or further away the link would bring him to
the target node, the only local information he is given is noisy binary information on
whether each link brings him closer to or further away from the target.
In Kleinberg (2000a,b, 2001) local greedy navigation algorithms are discussed for
lattices that have been perturbed, along the lines of the model in Watts and Strogatz
(1998), with directed connections. These are square n = l l lattices in two dimen-
sions or, more generally, n = l m cubic lattices in m dimensions. Each vertex is fully
connected to all its neighbors within lattice distance l0 , with an additional q inde-
pendent long-ranged connections to nodes at lattice distance r chosen randomly with
probability proportional to r . When = 0, this corresponds to a uniform choice
of long-ranged connections.
The greedy algorithm to navigate from u to v always chooses the connection that
gets the closest to the target. In two dimensions (m = 2), the greedy algorithm has
a rapid delivery time, measured by the expected number of steps, which is bounded
by a function proportional to (log n)2 provided = m = 2. In particular, in the case
of uniform long-range connections ( = 0), no decentralized algorithm can achieve
rapid delivery time. This uniform case corresponds to a world in which short chains
exist but individuals, faced with a disorienting array of social contacts, are unable to
nd them (Kleinberg 2000b). For = m, the delivery time is asymptotically much
larger and greater than a polynomial in l. Similar results hold in m dimensions, with
rapid (polynomial in log n) delivery if and only if = m. Likewise, in m dimensions,
70 APPLICATIONS
the greedy algorithm has a rapid delivery time bounded by a polynomial in log n
provided that = m. For = m, the delivery time is asymptotically much larger.
This result is interesting but should be viewed only as a rst step. Its signicance
is undermined by the fact that the graphs being considered are regular instead of
power-law distributed. Furthermore, the behavior does not seem to be robust and
requires a very specic value of the exponent . Finally, the routing algorithm requires
knowledge of lattice distances between nodes and the global position of the target
that may not be trivial to replicate in a realistic Web setting. Further extensions and
analysis are the object of ongoing research in the context, for instance, of peer-to-
peer systems, such as Gnutella (Adamic et al. 2001), or Freenet (Zhang and Iyengar
2002). A model for searchability in social networks that leverages the list of attributes
identifying each individual is described in Watts et al. (2002).
From a different perspective, given the problem of navigating from u to v, an
intelligent agent starting at u can be modeled as an agent that somehow knows whether
any given edge brings it closer to its target v, with some error probability. This captures
the idea that a human trying to nd the date of birth of Albert Einstein is more likely
to follow a link to a page on Nobel prizes in physics than a link to a page on Nobel
prize in economics. If, for the sake of the argument, in a small-world network the
typical distance between u and v is log n, then an agent that can detect good hyperlinks
with an error smaller than 1/ log n will typically make no mistakes along a path of
length log n and therefore will typically reach the target. However, even if the agent
has an error rate of 0.5, then, by a simple diffusion (or random walk) argument, it will
take on the order of log2 n steps to reach the target, which is still polynomial in log n.
Thus, in this sense, the problem of building an intelligent agent, can be reduced to
the problem of building an agent that can discern good links with a 0.5 error rate or
better.
density of the Web graph corresponds to a sparse graph where the number of edges
scales linearly with the number of vertices. Deviations from this background can be
dened in many ways. In one possible denition, a community can be dened as any
cluster of nodes such that the density of links between members of the community
(in either direction) is higher than the density of links between members of the com-
munity and the rest of the network. Variations on this denition are possible to detect
communities with varying levels of cohesiveness.
Another characteristic pattern in many communities is the hubsauthority bipartite
graph pattern (Kleinberg and Lawrence 2001), where a hub is an index or resource
page that points to many authority or reference pages in a correlated fashion. This is
studied further in Chapter 5. Likewise, an authority page is a page that is pointed to
by many hub pages. Other patterns discussed in the literature include the case where
authorities are linked directly to other authorities (Brin and Page 1998).
Once a particular subgraph pattern is selected as a signature of a community, the
Web graph can be searched to identify the occurrence of such signatures. In many
cases, the general graphical search problem is NP-complete. However, polynomial
time algorithms can be derived with some additional assumptions. In Flake et al.
(2002), for instance, communities are associated with maximum ow algorithms.
Large-scale studies reported in the literature have identied hundreds of thousands
of communities, including some unexpected ones (e.g. people concerned with oil
spills off the coast of Japan). Although analyses that are purely structural can discover
interesting results, it is clear that it is possible to integrate structure and content
analysis (Chakrabarti et al. 2002). For instance, once a community is found based
on hyperlink structure, words that are common to the pages in the community can
be identied and used to further dene and reliably extract the community from the
rest of the network. Likewise, textual analysis can be used to seed a community
search.
Discovering communities and their evolution may be useful for a number of appli-
cations including:
The importance of link analysis for search engines and information retrieval will be
explored further in Chapter 5. It must be pointed out, however, that the utilization of
link analysis in these applications opens the door for second-order applications. For
instance, search engines based on link analysis may tend to increase the popularity
of pages that are already popular and it may be important to develop methods to
attenuate this effect.
72 APPLICATIONS
There is a clear sense that many of the networks studied in this chapter are in some
sense robust. In spite of frequent router problems, website problems, and temporary
unavailability of many Web pages, we have not yet suffered a global Web outage.
Likewise, mutation or even removal of a few enzymes in E. coli is unlikely to sig-
nicantly disrupt the function of the entire system. Mathematical support for such
intuition is provided by studying how the diameter or connectivity of the graph is
affected by deleting nodes randomly. When a few nodes are deleted at random, the
graph remains by and large connected and the small-world property remains intact.
In other words, the systems are robust with respect to random errors or mutations.
However, as pointed out in Albert et al. (2000), the systems considered in this chapter
are vulnerable to targeted attacks that focus on the nodes with the highest connectivity.
These targeted attacks can signicantly disrupt the properties of the systems.
To be more specic, with the same number of nodes and connections, a scale-free
graph has a smaller diameter in terms of average shortest path than a uniform random
graph. In a random uniform graph (sometimes also called random exponential graph,
when referring to the decay of the connectivity) all nodes are more or less equivalent.
In simulations reported in Albert et al. (2000) using random exponential and scale-
free graphs, when a small fraction f (up to a few percentage points) of nodes are
removed at random, the diameter increases monotonically (linearly) in an exponential
graph, whereas the diameter remains essentially constant in a scale-free graph with
the same number of nodes and edges. Thus, in scale-free graphs, the ability of nodes
to communicate is unaffected even by high rates of random failures. Intuitively, this
is not surprising and is rooted in the properties of the degree distribution of each kind
of graph. In an exponential network, all the nodes are essentially equivalent and have
similar degrees. As a result, each node contributes equally to the connectivity and to
the value of the diameter. In a scale-free graph with a skewed degree distribution,
randomly selected nodes are likely to have small degrees and their removal does not
disrupt the overall connectivity.
This situation, however, is somewhat reversed during an attack where specic nodes
are targeted. An attack is simulated in Albert and Barabsi (2000) by removing nodes
in decreasing order of connectivity. In an exponential graph, there is essentially no
difference between random deletions and attack. In contrast, the diameter of the scale-
free graph increases rapidly in the scale-free network, roughly doubling when 5% of
the nodes are removed. A higher degree of robustness of the undirected Web graph to
attacks, however, has been reported in other studies (Broder et al. 2000).
Similar effects are also observed when the connected components of both kinds of
graphs are studied, both under random error and targeted attack scenarios. In partic-
ular, the scale-free network suffers a substantial change in its topology under attack,
in the sense that it undergoes fragmentation and breaks down into many different
connected components when under attack, whereas it basically remains connected
during random removal of a few nodes.
WEB GRAPHS 73
To be more precise, two statistics that can be used to monitor the connected compo-
nents during node deletion are the size S of the largest connected component relative
to the size of the system and the average size s of the connected components other
than the largest one. In the case of an exponential graph, when f is small both S 1
and s 1 and only isolated nodes fall off the network. As f increases, s increases,
until a phase transition occurs around f = 0.28, where the connectivity breaks down,
the network fragments into many components, and s reaches a maximum value of 2.
Beyond the transition point, s decreases progressively toward unity and S rapidly
decreases toward zero. This behavior is observed both under random node deletion
and under attack.
In contrast, in a scale-free network under random deletion of nodes as f is increased
S decreases linearly and s remains close to unity. In case of attack, however, the
network breaks down in a manner similar to but faster than the breakdown of the
exponential network under random deletion: S drops rapidly to zero with a critical
threshold of 0.18 instead of 0.28. s has a peak of two, reached at the critical point.
When tested on a directed subgraph of the entire Web, containing 325 729 nodes and
1 469 680 links, this critical point is found to occur at f = 0.067 (Albert and Barabsi
2000).
There are many other important aspects of Internet/Web networksrobustness, secu-
rity, and vulnerability that cannot be addressed here for lack of space. We encourage
the interested reader to explore the literature further and limit ourselves to providing
two additional pointers. On the vulnerability side, it is worth mentioning that the
architecture and protocols of the Internet can be exploited to carry parasitic computa-
tions in which servers unwittingly perform computations on behalf of a remote node
(Barabsi et al. 2001). On the robustness side, peer-to-peer networks such as Gnutella
(Ripeanu et al. 2002) or Freenet (Clarke et al. 2000, 2002) are robust Internet network
architectures that do not rely on centralized servers.
3.5 Exercises
Exercise 3.1. Provide at least two different back-of-the-envelope estimates for the
size (in terms of the number of vertices and the number of edges) for the graph
associated with the documents and hyperlinks of the Web. Rene such estimates by
sampling IP addresses (using the ping command) or Web pages by some kind of
crawling. Discuss the limitations of these estimates and how they could be improved.
For instance, the ping command samples only active machines that are authorized
to reply to the ping query. Consider also the possibility of probing typical ports
(e.g. 80, 8080).
Exercise 3.2. Crawl the entire website of your university or some other large
organization and compute the statistics of the corresponding graph including the
distribution of the connectivity and the diameter both in the directed and undi-
rected cases. [Easier alternatives to the crawl consist in downloading a publicly
available data set or using a program such as wget (http://www.gnu.org/
software/wget/wget.html).]
WEB GRAPHS 75
Exercise 3.3. Show that, if the connectivity of a large graph with n vertices has a
power law with exponent 2, then the graph is sparse in the sense that it has O(n)
vertices.
Exercise 3.4. In this chapter, we have listed the power-law exponent observed for
several phenomena. In general, the values of the exponent are relatively small, say
less than 10. How general is this and why?
Exercise 3.5. Fill in some of the entries in Table 3.1.
Exercise 3.6. Explore possible relationships between the Fermi model of Chapter 1
and the models introduced in this chapter to account for the power-law distribution
of the size of the sites, or the power-law distribution of the connectivity.
Exercise 3.7. Simulate directly random directed graphs where the indegrees and
outdegrees of the nodes follow a power-law distribution with the exponents given in
Table 3.1. Estimate the average and maximal shortest distance between vertices, and
how it scales with the number n of vertices. Study the distribution of this shortest
distance across all possible pairs of vertices.
Exercise 3.8. Run simulations of lattice perturbation models, along the lines of those
found in Watts and Strogatz (1998), on ring, square, and cubic lattices. Progressively
increase the amount of random long-ranged connections in these graphs and monitor
relevant quantities such as degree distribution, cliquishness, and average/maximal
distance between vertices.
Exercise 3.9. Simulate the rich get richer model and monitor relevant quantities
such as degree distribution, cliquishness, and average/maximal distance between ver-
tices. Examine modications of the basic model incorporating any of the following:
edge orientation, edge rewiring, edge creation and monitor relevant quantities. Try to
analyze these modications using the mean eld approach associated with Equa-
tions (3.6) and (3.7). Examine a growth model with uniform, rather than preferential,
attachment. Show through simulations or a mean eld type of analysis that the dis-
tribution of degrees is geometric rather than power law.
Exercise 3.10. Simulate a version of the copy model and model relevant quanti-
ties such as degree distribution, cliquishness, and average/maximal distance between
vertices.
Exercise 3.11. First simulate a random network with power-law connectivity and
small network properties using, for instance, the preferential attachment model of
Barabsi and Albert (1999). Then consider an agent that can navigate the network
and is faced with the typical task of nding a short path between two vertices u and v
using only local information. Suppose each time the agent considers an edge, he has
binary information on whether this edge brings it closer or not to the target, and that
this information comes with some error rate p. Design a navigation strategy for the
agent and estimate its time complexity by simulations and by a back-of-the-envelope
calculation. Simulate this strategy on several of the networks discussed in this chapter.
76 EXERCISES
Study how the typical travel time varies with p. Study the robustness of this strategy
with respect to uctuations in p. Modify the strategy to include a bias toward nodes
with high degree, or high PageRank and study its time complexity. How would you
model the effect of a search engine in this framework?
Exercise 3.12. Run simulations to study the effect of random and selective node
deletion on the diameter and fragmentation of scale-free and exponential graphs.
4
Text Analysis
Having focused in earlier chapters on the general structure of the Web, in this chapter
we will discuss in some detail techniques for analyzing the textual content of individ-
ual Web pages. The techniques presented here have been developed within the elds
of information retrieval (IR) and machine learning and include indexing, scoring,
and categorization of textual documents.
The focus of IR is that of accessing as efciently as possible and as accurately as
possible a small subset of documents that is maximally related to some user inter-
est. User interest can be expressed for example by a query specied by the user.
Retrieval includes two separate subproblems: indexing the collection of documents
in order to improve the computational efciency of access, and ranking documents
according to some importance criterion in order to improve accuracy. Categoriza-
tion or classication of documents is another useful technique, somewhat related to
information retrieval, that consists of assigning a document to one or more predened
categories. A classier can be used, for example, to distinguish between relevant
and irrelevant documents (where the relevance can be personalized for a particular
user or group of users), or to help in the semiautomatic construction of large Web-
based knowledge bases or hierarchical directories of topics like the Open Directory
(http://dmoz.org/).
A vast portion of the Web consists of text documents thus, methods for auto-
matically analyzing text have great importance in the context of the Web. Of course,
retrieval and classication methods for text, such as those reviewed in this chapter can
be specialized or modied for other types of Web documents such as images, audio
or video (see, for example, Del Bimbo 1999), but our focus in this chapter will be on
text.
4.1 Indexing
4.1.1 Basic concepts
In order to retrieve text documents efciently it is necessary to enrich the collection
with specialized data structures that facilitate access to documents in response to
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
78 INDEXING
1
A manufacturer, importer, or seller of
digital media devices may not (1)
2, 83 sell, or offer for sale, in interstate
commerce, or (2) cause to be
3, 79 transported in, or in a manner
affecting, interstate commerce, a
computer digital media device unless the device
includes and utilizes standard security
2, 57 technologies that adhere to the
2, 245 security system standards....
books 2
Of course, Lissa did not necessarily
intend to read his books. She might
want the computer only to write her
security midterm. But Dan knew she came
from a middle-class family and could
1, 278 hardly afford the tuition, let alone her
1, 319 reading fees. Reading his books
protected might be the only way she could
3, 142 graduate....
3
Research in analysis (i.e., the
evaluation of the strengths and
3, 167 weaknesses of computer systems) is
essential to the development of
effective security, both for works
protected by copyright law and for
information in general. Such research
can progress only through the open
publication and exchange of complete
scientific results....
Inverted index Buckets Documents
Figure 4.1 Structure of a typical inverted index for information retrieval. Each entry
in the occurrence lists (buckets) is a pair of indices that identify the document and
the offset within the document where the term associated to the bucket appears. Docu-
ment (1) is an excerpt from Sen. Fritz Hollingss Consumer Broadband and Digital Televi-
sion Promotion Act, published at http://www.politechbot.com/docs/cbdtpa/
hollings.s2048.032102.html. Document (2) is an excerpt from Stallman (1997)
and (3) is an excerpt from the declaration submitted by ACM in the District Court of New
Jersey on the case of Edward Felten versus the Recording Industry Association of America
(http://www.acm.org/usacm/copyright/felten declaration.html).
user queries. A substring search, even when implemented using sophisticated algo-
rithms like sufx trees or sufx arrays (Manber and Myers 1990), is not adequate for
searching very large text collections. Many different methods of text retrieval have
been proposed in the literature, including early attempts such as clustering (Salton
1971) and the use of signature les (Faloutsos and Christodoulakis 1984). In practice,
inversion (Berry and Browne 1999; Witten et al. 1999) is the only effective technique
for dealing with very large sets of documents. The method relies on the construc-
tion of a data structure, called an inverted index, which associates lexical items to
their occurrences in the collection of documents. Lexical items in text retrieval are
called terms and may consist of words as well as expressions. The set of terms of
interest is called the vocabulary, denoted V . In its simplest form, an inverted index
is a dictionary where each key is a term V and the associated value b() is a
TEXT ANALYSIS 79
structural information. However, the reader should be aware that, although HTML
has a clearly dened formal grammar, real world browsers do not strictly enforce
syntax correctness and, as a result, most Web pages fail to rigorously comply to the
HTML syntax. 1 Hence, an HTML parser must tolerate errors and include recovery
mechanisms to be of any practical usefulness. A public domain parser is distributed
with Libwww, the W3C Protocol Library http://www.w3.org/Library/.
An HTML parser written in Perl (with methods for text extraction) is available at
http://search.cpan.org/dist/HTML-Parser/.
Obviously, after plain text is extracted, punctuation and other special characters
need to be stripped off. In addition, the character case may be folded (e.g. to all
lowercase characters) to reduce the number of index terms.
Besides HTML, textual documents in the Web come in a large variety of formats.
Some formats are proprietary and undisclosed and extracting text from such le types
is severely limited by the fact that the associated formats have not been disclosed.
Other formats are publicly known, such as PostScript 2 or Portable Document Format
(PDF).
Tokenization of PostScript or PDF les can be difcult to handle because these are
not data formats but algorithmic descriptions of how the document should be rendered.
In particular, PostScript is an interpreted programming language and text rendering
is controlled by a set of show commands. Arguments to show commands are text
strings and two-dimensional coordinates. However, these strings are not necessarily
entire words. For example, in order to perform typographical operations such as kern-
ing, ligatures, or hyphenation, words are typically split into fragments and separate
show commands are issued for each of the fragments. Show commands do not need to
appear in reading order, so it is necessary to track the two-dimensional position of each
shownstring and use information about the font in order to correctly reconstruct word
boundaries. PostScript is almost always generated automatically by other programs,
such as typesetting systems and printer drivers, which further complicates matters
because different generators follow different conventions and approaches. In fact per-
fect conversion is not always possible. As an example of efforts in this area, the reader
can consult Neville-Manning and Reed (1996) for details on PreScript, a PostScript-to-
plain-text converter developed within the New Zealand Digital Library project (Witten
et al. 1996). Another converter is Pstotext, developed within the Virtual Paper project
(http://research.compaq.com/SRC/virtualpaper/home.html).
PDF is a binary format that is based on the same core imaging model as PostScript
but can contain additional pieces of information, including descriptive and adminis-
trative metadata, as well as structural elements, hyperlinks, and even sound or video.
In terms of delivered contents, PDF les are therefore much closer in structure to Web
pages than PostScript les are. PDF les can (and frequently do, in the case of digital
libraries) embed raster images of scanned textual documents. In order to extract text
document d '
document d
x'
x
approach is based on the vector-space representation and the metric dened by the
cosine coefcient (Salton and McGill 1983). This measure is simply the cosine of the
angle formed by the vector-space representations of the two documents, x and x (see
Figure 4.2),
xTx xTx
cos(x, x ) = = , (4.1)
x x xTx xTx
where the superscript T denotes the transpose operator and x T y indicates the dot
product or inner product between two vectors x, y Rm , dened as
.
m
xTy = xi yi . (4.2)
i=1
Note that in the case of two sparse vectors x and y associated with two documents
d and d , the above sum can be computed efciently in time (|d| + |d |) (see
Exercise 4.2).
Several renements can be obtained by extending the Boolean vector model and
introducing real-valued weights associated with terms in a document. A more infor-
mative weighting scheme consists of counting the actual number of occurrences of
each term in the document. In this case xj N counts term occurrences in the cor-
responding document (see Figure 4.3). x may be multiplied by the constant 1/|d| to
obtain a vector of term frequencies (TF) within the document.
An important family of weighting schemes combines term frequencies (which are
relative to each document) with an absolute measure of term importance called
inverse document frequency (IDF). IDF decreases as the number of documents in
which the term occurs increases in a given collection. So terms that are globally rare
receive a higher weight.
Formally, let D = {d1 , . . . , dn } be a collection of documents and for each term j
let nij denote the number of occurrences of j in di and nj the number of documents
that contain j at least once. Then we dene
. nij
TFij = , (4.3)
|di |
. nj
IDFj = log . (4.4)
n
Here the logarithmic function is employed as a damping factor.
The TFIDF weight (Salton et al. 1983) of j in di can be computed as
xij = TFij IDFj (4.5)
or, alternatively, as
TFij IDFj
xij = . (4.6)
maxk di TFik maxk di IDFk
The IDF weighing is commonly used as an effective heuristic. A theoretical justi-
cation has been recently proposed by Papineni (2001), who proved that IDF is the
TEXT ANALYSIS 85
According to news.com, Apple has
warned one of its own dealers to stop
handing out a patch to allow DVD
burning with iDVD on non-Apple
hardware
According 1 .69 0 0
Act 0 0 1 .69
Apple 1 0 1 0
Computer 0 0 1 .69
Digital 0 0 1 .69
DVD 1 0 1 0
DVDs 0 0 0 0
Millennium 1 .69 0 .69
a 1 .69 0 0
allow 1 .69 0 0
burning 1 0 1 0
com 1 .69 0 0
customers 0 0 1 .69
dealers 1 .69 0 0
drivers 0 0 1 .69
external 0 0 1 .69
handing 1 .69 0 0
hardware 1 .69 0 0
has 1 0 1 0
iDVD 1 .69 0 0
invoked 0 0 1 .69
its 0 0 0 0
news 1 .69 0 0
non 1 .69 0 0
of 1 .69 0 0
on 1 0 1 0
one 1 .69 0 0
out 1 .69 0 0
own 1 .69 0 0
patch 1 .69 0 0
prevent 0 0 1 .69
stop 1 .69 0 0
to 3 0 1 0
the 0 0 1 .69
warned 1 .69 0 0
with 1 .69 0 0
Figure 4.3 Vector-space representations. For each of the two documents, the left vector counts
the number of occurrences of each term, while the right vector is based on TFIDF weights
(Equation 4.5).
optimal weight of a term with respect to the minimization of a distance function that
generalizes KullbackLeibler divergence or relative entropy (see Appendix A).
the collection. Let q Rm denote the vector associated with a user query (terms that
are present in the query but not in V will be stripped off). Each document is then
assigned a score, relative to the query, by computing s(xi , q), i = 1, . . . , n. The set
R of retrieved documents that are presented to the user can be formed by collecting
the top-ranking documents according to the similarity measure. The quality of the
returned collection can be dened by comparing R to the set of documents R that is
actually relevant to the query. 3
Two common metrics for comparing R and R are precision and recall. Precision
is dened as the fraction of retrieved documents that are actually relevant. Recall
that is dened as the fraction of relevant documents that are retrieved by the system.
More precisely,
. |R R | . |R R |
= , = .
|R| |R |
Note that in this context the ratio between relevant and irrelevant documents is
typically very small. For this reason, other common evaluation measures like accuracy
or error rate (see Section 4.6.5), where the denominator consists of |D|, would be
inadequate (it would sufce to retrieve nothing to get very high accuracy). Sometimes
precision and recall are combined into a single number called F measure dened as
. ( 2 + 1)
F = . (4.7)
2 +
Note that the F1 measure is the harmonic mean of precision and recall. If tends to
zero () the F measure tends to precision (recall).
3 Of course this is possible only on controlled collections, such as those prepared for the Text REtrieval
Conference (TREC) (data, papers describing methodologies and description of evaluation measures and
assessment criteria are available at the TREC website http://trec.nist.gov/).
TEXT ANALYSIS 87
The PRP resembles the optimal Bayes decision rule for classication, a concept
that is well known for example in the pattern recognition community (Duda and Hart
1973). The Bayes optimal separation is obtained by ensuring that the posterior prob-
ability of the correct class (given the observed pattern) should be greater than the
posterior probability of any other class. PRP can be mathematically stated by intro-
ducing a Boolean variable R (relevance) and by dening a model for the conditional
probability P (R | d, q) of relevance of a document d for a given user need (for exam-
ple, expressed through a query q). Its justication follows from a decision-theoretic
argument as follows. Let c denote the cost of retrieving a relevant document and c
the cost of retrieving an irrelevant document, with c < c. Then in order to minimize
the overall cost, a document d should be retrieved next if
cP (R | d, q) + c(1 P (R | d, q)) cP (R | d , q) + c(1 P (R | d , q)) (4.8)
for every other document d that has not yet been retrieved. But, since c < c, the
above condition is equivalent to
P (R | d, q) P (R | d , q),
that is, documents should be retrieved in order of decreasing probability of relevance.
In order to design a model for the probability of relevance some simplications are
needed. The simplest possible approach is called binary independence retrieval (BIR)
(Robertson and Sprck Jones 1976) as also used in the Bernoulli model for the Naive
Bayes classier, which we will discuss later, in Section 4.6.2. This model postulates
a form of conditional independence amongst terms. Following Fuhr (1992), let us
introduce the odds of relevance and apply Bayes theorem:
P (R = 1 | d, q) P (d | R = 1, q) P (R = 1 | q)
O(R | d, q) = = . (4.9)
P (R = 0 | d, q) P (d | R = 0, q) P (R = 0 | q)
Note that the last fraction is the odds of R given q, a constant quantity across
the collection of documents (it only depends on the query). The BIR assumption
concerns the rst fraction and was originally misidentied as a marginal form of term
independence (Cooper 1991). We can state it as
|V |
P (d | R = 1, q) xj P (j | R = 1, q) + (1 xj )(1 P (j | R = 1, q))
= .
P (d | R = 0, q) xj P (j | R = 0, q) + (1 xj )(1 P (j | R = 0, q))
j =1
(4.10)
The parameters to be estimated are therefore
.
j = P (j | R = 1, q)
.
and j = P (j | R = 0, q). If we further assume that j = j if j does not appear
in q we nally have
j 1 j
O(R | d, q) = O(R | q) xj (1 xj ) , (4.11)
j 1 j
j :j q
88 LATENT SEMANTIC ANALYSIS
where the product only extends to indices j whose associated terms j appear in the
query. This can also be rewritten as
1 j j (1 j )
O(R | d, q) = O(R | q) , (4.12)
1 j j (1 j )
j q j :j q,
xj =1
where the last factor is the only part that depends on the document. The retrieval
status value (RSV) of a document is thus computed by taking the logarithm of the
last factor:
j (1 j )
RSV(d) = log . (4.13)
j (1 j )
j :j dq
X = [x1 xn ]T . (4.14)
Table 4.1 Example application of LSI. Top: a collection of documents. Terms used in the
analysis are underlined. Center: the termdocument matrix XT . Bottom: the reconstructed
termdocument matrix X T after projecting on a subspace of dimension K = 2.
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
open-source 1 0 0 0 1 0 0 0 0 0
software 1 0 0 1 0 0 0 0 0 0
Linux 1 0 0 1 0 0 0 0 0 0
released 0 1 1 1 0 0 0 0 0 0
Debian 0 1 1 0 0 0 0 0 0 0
Gentoo 0 0 1 0 1 0 0 0 0 0
database 0 0 0 0 1 0 0 1 0 0
Dolly 0 0 0 0 0 1 0 0 0 1
sheep 0 0 0 0 0 1 0 0 1 0
genome 0 0 0 0 0 0 1 1 1 0
DNA 0 0 0 0 0 0 2 0 0 1
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01
software 0.44 0.37 0.50 0.55 0.31 0.01 0.03 0.06 0.00 0.02
Linux 0.44 0.37 0.50 0.55 0.31 0.01 0.03 0.06 0.00 0.02
released 0.63 0.53 0.72 0.79 0.45 0.01 0.05 0.09 0.00 0.04
Debian 0.39 0.33 0.44 0.48 0.28 0.01 0.03 0.06 0.00 0.02
Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01
database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12
Dolly 0.01 0.01 0.01 0.02 0.03 0.08 0.45 0.13 0.14 0.21
sheep 0.00 0.00 0.00 0.01 0.03 0.06 0.34 0.10 0.11 0.16
genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53
DNA 0.03 0.04 0.04 0.06 0.11 0.30 1.70 0.51 0.55 0.81
TEXT ANALYSIS 91
P (di ) P (zk | di ) P (j | zk) P (di | zk) P (zk) P (j | zk)
D Z D Z
(a) (b)
Figure 4.4 Bayesian networks describing the aspect model. Concepts (Z) are never observed
in the data but their knowledge would render words () and documents (D) independent. The
two networks are independence equivalent but use different parameterizations.
K). However, the L2 matrix norm is algebraically well suited for Gaussian variables
and is hard to justify its use in the case of the discrete space dened by vectors of
counts of words.
We now discuss a probabilistic approach for describing a latent semantic space
called the aspect model (Hofmann et al. 1999) or aggregate Markov model (Saul and
Pereira 1997). Let an event in this model be the occurrence of a term in a document
d. Let z {z1 , . . . , zK } denote a latent (hidden) variable associated with each event
according to the following generative scheme. First, select a document from a density
P (d). Second, select a latent concept z with probability P (z | d). Third, choose a
term to describe the concept linguistically, sampling from P ( | z), i.e. assume that
once the latent concept has been specied the document and the term are conditionally
independent, as shown by the Bayesian network representation of Figure 4.4. The
probability of each event (, d) is therefore
P (, d) = P (d) P ( | z)P (z | d). (4.15)
z
P = U V T . (4.16)
This matrix can be directly compared to the SVD representation used in traditional
LSI. Unlike X in LSI, P is a properly normalized probability distribution and its
entries cannot be negative. Moreover, the coordinates of terms in the latent space can
be interpreted as the distribution of terms, conditional on the hidden zk values.
Hofmann (2001) shows how to t the parameters of this model from data. In the
simplest case, they are estimated by maximum likelihood using the EM algorithm
(see Chapter 1) because of the latent variable Z. We assume the parameterization of
92 LATENT SEMANTIC ANALYSIS
Figure 4.4b. During the E step, the probability that the occurrence of j in di is due
to latent concept zk is computed as
This can be seen as a special case of belief propagation (see Chapter 1) in the network
of Figure 4.4b. The M step simply consists of updating parameters as follows:
n
P (j | zk ) nij P (zk | di , j ), (4.18)
i=1
|V |
P (zk | di ) nij P (zk | di , j ), (4.19)
j =1
|V |
n
P (zk ) nij P (zk | di , j ). (4.20)
i=1 j =1
is computed through vector-space similarities between the new documents and the
subset of the k neighbors that belong to class c (Yang 1999), where Nc (x, D, k) is the
subset of N (x, D, k) containing only points of class c. Despite the simplicity of the
method, the performance of k-NN in text categorization is quite often satisfactory in
practice (Joachims 1998; Lam and Ho 1998; Yang 1999; Yang and Liu 1999). Han et
al. (2001) have proposed a variant of k-NN where the weights associated with features
are learned iteratively other statistically motivated techniques that extend the basic
k-NN classier are discussed in Hastie et al. (2001).
94 TEXT CATEGORIZATION
X1 X2 Xj Xk X |V |
Figure 4.5 A Bayesian network for the Naive Bayes classier under the Bernoulli docu-
ment-based event model. The example document is the sentence burn DVDs and we will burn
you.
independence retrieval model that was popular for probabilistic retrieval in the 1970s
(Robertson and Sprck Jones 1976) and has even more ancient origins (Maron 1961).
The conditional independence assumption in this model can be depicted graphically
using a Bayesian network such as that in Figure 4.5, suggesting that the class is the only
cause of the appearance of each word in a document. Under this model, generating a
document is like tossing |V | independent coins and the occurrence of each word in
the document is a Bernoulli event. Therefore we can rewrite the generative portion of
Equation (4.23) as
|V |
P (d | c, ) = xj P (j | c) + (1 xj )[1 P (j | c)], (4.24)
j =1
where xj = 1 [0] means that word j does [does not] occur in d and P (j | c) is
the probability of observing word j in documents of class c. Here represents the
set of probabilities (or parameters) P (j | c), which is the probability of the binary
event that word j is turned on within class c.
Alternatively, we may think of a document as a sequence of events W1 , . . . , W|d| .
Each observed Wt has a vocabulary entry (from 1 to |V |) as an admissible realization.
Note that the number of occurrences of each word, as well as the length of the
document, are taken into account under this interpretation. In addition, since the
document is a sequence, serial order among words should also be taken into account
when modeling P (W1 , . . . , W|d| | c). This could be done, for example, by using
a Markov chain. A simplifying assumption, however, is that word occurrences are
independent of their (relative) positions, given the class (Lewis 1992; Lewis and Gale
1994; Mitchell 1997). Equivalently, we assume that the bag-of-words representation
retains all the relevant information for assessing the probability of a document whose
class is known.
Under the word-based event model, generating a document is like throwing a die
with |V | faces |d| times, and the occurrence of each word in the document is a multi-
nomial event. Hence, the generative portion of the model is a multinomial distribution
|V |
P (d | ) = GP (|d|) P (j | c)nj , (4.25)
j =1
Neither P (|d|) nor G are needed for classication, since |d|, the number of words or
terms in a document, is assumed to be independent of the class. This last assumption
can be removed and P (|d| | c) explicitly modeled. Models of document length
(e.g. based on Poisson distributions) have been used for example in the context of
probabilistic retrieval (Robertson and Walker 1994). Note that in the case of the
Bernoulli model there are 2|V | possible different documents, while in the case of the
multinomial model there is an innite (but countable) number of different documents.
An additional model that may be developed and that lies somewhat in between
the Bernoulli and the multinomial models consists of keeping the document-based
event model but extending Bernoulli distributions to integer distributions, such as
the Poisson (Lewis 1998). Finally, extensions of the basic Naive Bayes approach
that allow limited dependencies among features have been proposed (Friedman and
Goldszmidt 1996; Pazzani 1996). However, these models are characterized by a larger
set of parameters and may overt the data (Koller and Sahami 1997).
Learning a Naive Bayes classier consists of estimating the parameters from the
available data. We will assume that the training data set consists of a collection of
labeled documents {(di , ci ), i = 1, . . . , n}. In the Bernoulli model, the parameters
include c,j = P (j | c), j = 1, . . . , |V |, c = 1, . . . , K. These are estimated as
normalized sufcient statistics
1
n
c,j = xij , (4.26)
Nc
i:ci =c
to play a role in determining the nal class prediction for the document. We might also
have more informative priors available, e.g. a Dirichlet prior on the distribution of
words in each class that reect word distributions as estimated from previous studies,
or that reect an experts belief in what words are likely to appear in each class.
In the case of the multinomial model of Equation (4.25), the generative
parameters
are c,j = P (j | c). Note that these parameters must satisfy j c,j = 1 for each
class c. To estimate these parameters it is common practice to introduce Dirichlet
priors (see Chapter 1 for details). The resulting estimation equations are derived as
follows. In the case of the distributions of terms given the class, we introduce a
Dirichlet prior with hyperparameters qj and , resulting in the estimation formula
qj + ni:ci =c nij
c,j = |V | , (4.28)
+ l=1 i:ci =c nil
Learning in this class of models consists of determining w and w0 from data. The
training examples are said to be linearly separable if there exists a hyperplane whose
associated classication function is consistent with all the labels, i.e. if yi f (xi ) > 0
for each i = 1, . . . , n. Under this hypothesis, Rosenblatt (1958) proved that the
following simple iterative algorithm terminates and returns a separating hyperplane:
Perceptron(D)
1 w0
2 w0 0
3 repeat
4 e0
5 for i 1, . . . , n
6 do s sgn(yi (w T xi + w0 ))
7 if s < 0
8 then w w + yi xi
9 w0 w0 + yi
10 e e+1
11 until e = 0
12 return (w, w0 ).
It can be shown that a sufcient condition for D to be linearly separable is that the
number of training examples n = |D| is less than or equal to m + 1 (see Exercise 4.4).
This is particularly likely to be true in the case of text categorization, where the
vocabulary typically includes several thousands of terms, and is often larger than the
number of available training examples n (see Exercise 4.5 for details).
Unfortunately, learning with the Perceptron algorithm offers little defense against
overtting. A thorough explanation of this problem requires concepts from statistical
learning theory that are beyond the scope of this book. To gain an intuition of why
this is the case, consider the scenario in Figure 4.6. Here we suppose that positive
and negative examples are generated by two Gaussian distributions (see Appendix A)
with the same covariance matrix and that positive and negative points are generated
with the same probability. In such a setting, the optimal (Bayes) decision boundary
is the one that minimizes the posterior probability that a new point is misclassied
and, as it turns out, this boundary is the hyperplane that is orthogonal to the segment
TEXT ANALYSIS 99
+
+
-
- +
+ +
-
- - +
- +
-
+ +
-
+
-
-
-
- - +
connecting the centers of mass of the two distributions (dotted line in Figure 4.6).
Clearly, a random hyperplane that just happens to separate training points (dashed
line) can be substantially far away from the optimal separation boundary, leading
to poor generalization to new data. The difculty grows with the dimensionality
of the input space m since for a xed n the set of separating hyperplanes grows
exponentially with m (a problem known as the curse of dimensionality). Remember
that in the case of text categorization m may be signicantly large (several thou-
sands).
The statistical learning theory developed by Vapnik (1998) shows that we can dene
an optimal separating hyperplane (relative to the training set) having two important
properties: it is unique for each linearly separable data set, and its associated risk of
overtting is smaller than for any other separating hyperplane. We dene the margin
M of the classier to be the distance between the separating hyperplane and the
closest training examples. The optimal separating hyperplane is then the one having
maximum margin (see Figure 4.7). Going back to Figure 4.6, the theory suggests that
the risk of overtting for the maximum margin hyperplane (solid line) is smaller than
for the dashed hyperplane. Indeed, in our example the maximum margin hyperplane
is signicantly closer to the Bayes optimal decision boundary.
In order to compute the maximum margin hyperplane, we begin by observing that
the distance of a point x from the separating hyperplane is
1
(w T x + w0 )
w
100 TEXT CATEGORIZATION
xTw + 0=0
+
+
+
-
- +
+
-
- - +
- +
-
+
+
-
+
-
-
- xTw + 0=M
- - + +
2
xTw + 0 = M
=2M
|| w ||
Figure 4.7 Illustration of the optimal separating hyperplane and margin.
Circled points are support vectors.
(see Figure 4.7). Thus, the optimal hyperplane can be obtained by solving the con-
strained optimization problem
1
max M subject to yi (w T xi + w0 ) M, i = 1, . . . , n, (4.32)
w,w0 w
where the constraints require that each training point should lie in the correct semis-
pace and at a distance not less than M from the separating hyperplane. Note that
although (w, w0 ) comprise n + 1 real numbers, there are only n degrees of freedom
since multiplying w and w0 by a scalar constant does not change the hyperplane. Thus
we can arbitrarily set w = 1/M and rewrite the optimization problem (4.32) as
The above problem can be transformed to its dual by rst introducing the vector of
Lagrangian multipliers Rn and writing the Lagrangian function
n
L(D) = 21 w2 + i [yi (w T x + w0 ) 1] (4.34)
i=1
TEXT ANALYSIS 101
yielding another QP problem. The classier obtained in this way is commonly referred
to as a support vector machine (SVM).
Note that solving the QP problem using standard optimization packages would take
time O(n3 ) (assuming the number of support vector grows linearly with the number of
training example). This time complexity is a practical drawback for SVMs. However,
several approaches that can reduce complexity substantially have been proposed in
the literature (Joachims 1999a; Platt 1999).
If the data are considerably nonlinearly separable then an SVM classier will have
a low accuracy, i.e. even the best linear hyperplane may be quite inferior in terms
102 TEXT CATEGORIZATION
about the class that is provided by the observation of each term. More precisely,
let us denote by Wj an indicator variable such that Wj = 1 means that j appears
in a certain document. The information gain of Wj is then the mutual information
I(C, Wj ) between the class C and Wj (see Appendix A for a review of the main
concepts in information theory). This is also the difference between the marginal
entropy of the class H (C) and the conditional entropy H (C | Wj ) of the class, given
Wj :
K
1
P (c, j )
G(Wj ) = H (C) H(C | Wj ) = P (c, j ) log . (4.40)
P (c)P (j )
c=1 j =0
Note that if C and Wj are independent, the mutual information is zero. Filtering
index terms simply consists of sorting terms by information gain and keeping only the
k terms with highest gain. Information gain has been used extensively for text catego-
rization (Craven et al. 2000; Joachims 1997; Lewis and Ringuette 1994; McCallum
and Nigam 1998; Yang 1999) and has been generally found to improve classication
performance compared to using all words.
One limitation of information gain is that relevance assessment is done separately
for each attribute and the effect of co-occurrences is not taken into account. However,
terms that individually bring little information about the class might bring signicant
information when considered together. In the framework proposed by Koller and
Sahami (1996), whole sets of features are tested for relevance about the class. Let
x denote a point in the complete feature space (i.e. the entire vocabulary V in the
case of text documents) and xG be the projection of x into a subset G V . In order
to evaluate the quality of G to represent the class, we measure the distance between
P (c | x) and P (c | xG ) using the average relative entropy:
P (c | x)
G = P (x)P (c | x) log . (4.41)
x
P (c | xG )
The optimal set of features should yield a small G . Clearly this setup is only
theoretical, since Equation (4.41) is computationally intractable and the distributions
involved are hard to estimate accurately. Koller and Sahami (1996) use the notion of
a Markov blanket to reduce complexity. A set of features M V (with Wj not in M)
is a Markov blanket for Wj if Wj is conditionally independent of all the features in
V \ (M {Wj }) given M (see also the theory of Bayesian networks in Chapter 1).
In other words, once the features in M are known, Wj and the remaining features
are independent. Thus the class C is also conditionally independent of Wj given
M and, as a result, if G contains a Markov blanket for Wj then Gj = G , where
Gj = G \ {Wj }. The feature selection algorithm can then proceed by removing those
features for which a Markov blanket can be found. Koller and Sahami (1996) prove
that a greedy approach where features are removed iteratively is correct in the sense
that a feature deemed irrelevant and removed during a particular iteration cannot
became relevant again after some other features are later removed. Moreover, since
104 TEXT CATEGORIZATION
nding a Markov blanket may be computationally very hard and an exact blanket
may not exist, they suggest the following approximate algorithm:
ApproxMarkovBlanket(D, V , k, n )
1 GV
2 repeat
3 for Wj G
4 do for Wi G \ {Wj }
cov[W ,Wj ]
5 do ij var[W ]i var[W
i i]
6 Mj k features having highest ij
7 pj P (c | XMj = xMj , Wj = xj )
8 pj P (c | XMj = xMj )
9 D(xMj , xj ) H (pj , pj )
10 (Wj |Mj ) xM ,xj P (xMj , xj )D(xMj , xj )
j
11 j arg minj (Wj |Mj )
12 G G \ {Wj }
13 until |G| = n
14 return G
At each step, the algorithm computes for each feature Wj the set Mj (G \ {Fj })
containing the k features that have the highest correlation with Wj , where k is a
parameter of the algorithm and where correlation is measured by the Pearson corre-
lation coefcient in line 5. Then, in line 8, the quantity (Wj | Mj ) is computed as
the average cross entropy between the conditional distributions of the class that result
from the inclusion and the exclusion of feature Wj . This quantity is clearly zero if
Mj is a Markov blanket. Thus, picking j that minimizes it (line 9) selects a feature
for which an approximate Markov blanket exists. Such a feature is removed and the
process iterated until n features remain in G.
Several other lter approaches have been proposed in the context of text catego-
rization, including the use of minimum description length (Lang 1995), and symbolic
rule learning (Raskinis and Ganascia 1996). A comparison of alternative techniques
is reported in Yang (1999).
where TP, TN, FP, and FN mean true positives, true negatives, false positives, and
false negatives, respectively. In the case of balanced domains (i.e. where the uncon-
ditional probabilities of the classes are roughly the same) accuracy A is often used
to characterize performance. Under the 0-1 loss (see Section 1.5 for a discussion),
accuracy is dened as
TN + TP
A= . (4.42)
|Dt |
Classication error is simply E = 1 A. If the domain is unbalanced, measures such
as precision and recall are more appropriate. Assuming (without losing generality)
that the number of positive documents is much smaller than the number of negative
ones, precision is dened as
TP
= (4.43)
TP + FP
and recall is dened as
TP
= . (4.44)
TP + FN
A complementary measure that is sometimes used is specicity
TN
= . (4.45)
TN + FN
As in retrieval, there is clearly a trade-off between false positives and false nega-
tives. For example, when using a probabilistic classier like Naive Bayes we might
introduce a decision function that assigns a document the class + if and only if
P (c | d) > t. In so doing, small values of the threshold t yield higher recall and
larger values yield higher precision. Something similar can be constructed for an
SVM classier using a threshold on the distance between the points and the sep-
arating hyperplane. Often this trade-off is visualized on a parametric plot where
precision and recall values, (t) and (t) are evaluated for different values of
the threshold (see Figure 4.9 for some examples). Sometimes these plots are also
called ROC curves (from Receiver Operating Characteristic) neglecting the caveat
that the original name was coined in clinical research for diagrams plotting speci-
city versus sensitivity (an alias for precision). A very common measure of perfor-
mance that synthesizes a precision-recall curve is the breakeven point, dened as
the best 4 point where (t ) = (t ) and can be seen as an alternative to the F
measure (discussed earlier in Section 4.3.3) for reducing performance to a single
number.
In the case of multiple categories we may dene precision and recall separately
for each category c, treating the remaining classes as a single negative class (see
Table 4.2 for an example). Interestingly, this approach also makes sense in domains
where the same document may belong to more than one category. In the case of mul-
tiple categories, a single estimate for precision and a single estimate for recall can
4 Since in general there may be more than one value of the threshold at which (t) = (t), we take
the value t that maximizes (t ) = (t ).
106 TEXT CATEGORIZATION
1
K
M = c , (4.48)
K
c=1
1
K
M = c . (4.49)
K
c=1
4.6.6 Applications
Up to this point we have largely discussed the classication of generic text documents
as represented by a bag-of-words vector representation. From this viewpoint it did
not really matter whether the bag of words represented a technical article, an email, or
a Web page. In practice, however, when applying text classication to Web documents
there are several domain-specic aspects of the Web that can be leveraged to improved
classication performance. In this section we review some typical examples that
demonstrate how ideas from text classication can be applied and extended in a Web
context.
Figure 4.8 A sample ontology for representing knowledge about academic websites. The
diagram indicates the class hierarchy and the main attributes of each class. Figure adapted
from Craven et al. (2000).
Table 4.2 Experimental results obtained in Craven et al. (2000) using the
Naive Bayes classier on the Web KB domain.
Actual category
Predicted
category Cou Stu Fac Sta Pro Dep Oth Precision
Table 4.3 Results reported by Joachims (1998) and Weiss et al. (1999) (last row) on 90 classes
of the Reuters-21578 collection. Performance is measured by microaveraged breakeven point.
Performance
Prediction method breakeven (%)
ments. The data set has been split into training and test data according to several
alternative conventions (see Lewis (1997) and Sebastiani (2002) for discussion). In
the so-called ModApte split, 9603 documents are reserved for training and 3299 for
testing (the rest being discarded). Ninety categories possess at least one training and
one test example. In this setting, Joachims (1998) experimented with several alter-
native classiers, including those described in this chapter (see Table 4.3). Note that
the support vector classier used in Joachims (1998) applies a radial basis func-
tion (RBF) kernel to learn nonlinear separation surfaces (see Schoelkopf and Smola
(2002) for a thorough discussion of kernel methods). It is worth noting that the sim-
ple k-NN classier achieves relatively good performance and outperforms Naive
Bayes in this problem. On the same data set, Weiss et al. (1999) have reported better
results using multiple decision trees trained with AdaBoost (Freund and Schapire
1996; Schapire and Freund 2000) specic results are provided in the last row of
Table 4.3.
TEXT ANALYSIS 109
Figure 4.9 Precision-recall curves obtained by training an SVM classier on the Reu-
ters-21578 data set (ModApte split, 10 most frequent classes), where each plot is for one
of the 10 classes.
110 TEXT CATEGORIZATION
A simplied problem in the Reuters data set is obtained by removing all but the 10
most populated categories. In this setting, Dumais et al. (1998) report a comparison
of several alternative learners obtaining their best performance (92% microaveraged
breakeven point) with support vector machines. Figure 4.9 shows precision-recall
curves we have obtained on this data set using an SVM classier.
systems (the task being to predict which newsgroup a message was posted to). The
data set of 20 newsgroups is available at http://www.ai.mit.edu/people/
jrennie/20Newsgroups/.
example, by a training a mixture of two Gaussians. Then, just a few labeled points may
be sufcient to decide which Gaussian is associated with the positive and negative
class (see Castelli and Cover (1995) for a theoretical discussion). A different intuition,
in terms of discriminant classiers, can be gained by considering Figure 4.10, where
unlabeled data points are shown as dots. The thicker hyperplane is clearly a better
solution than the thin maximum margin hyperplane computed from labeled points
only. Finally, as noted by Joachims (1999b), learning from unlabeled data is reason-
able in text domains because categories can be guessed using term co-occurrences.
For example, consider the 10 documents of Table 4.1 and suppose category labels
are only known for documents d1 and d10 (Linux and DNA, respectively). Then it
should be clear that term co-occurrences allow us to infer categories for the remain-
ing eight unlabeled documents. This observation links text categorization to LSI, a
connection that has been exploited for example in Zelikovitz and Hirsh (2001).
None of the classication algorithms we have studied so far can deal directly with
unlabeled data. We now present two approaches that develop the intuition above. The
use of unlabeled data is further discussed in the context of co-training in Section 4.7.
Transductive SVM
Support vector machines can be also be extended to handle nonlabeled data in the
transductive learning framework of Vapnik (1998). In this setting, the optimization
problem of Equation (4.33) (that leads to computing the optimal separating hyperplane
for linearly separable data) becomes:
yi (w T xi + w0 ) 1, i = 1, . . . , n,
min w subject to (4.52)
y1 ,...,yn ,w,w0 yj (w T xj + w0 ) 1, j = 1, . . . , n .
The solution is found by determining a label assignment (y1 , . . . , yn ) of the training
examples so that the separating hyperplane maximizes the margin of both the data in
D and D . In other words missing values (y1 , . . . , yn ) are lled in using maximum
margin separation as a guiding criterion. A similar extension of Equation (4.38) for
nonlinearly separable data is
n
n
min w + C i + C j
y1 ,...,yn ,w,w0
i=1 j =1
yi (w T xi + w0 ) 1 i , i = 1, . . . , n,
y (w T x + w ) 1 , j = 1, . . . , n ,
j j 0 j
subject to (4.53)
0, i = 1, . . . , n,
i
j 0, j = 1, . . . , n .
60%
40%
No unlabeled documents
20%
Figure 4.11 Classication accuracy as a function of the number of labeled documents using
EM and 10 000 unlabeled documents (solid line) and using no unlabeled documents and the
Naive Bayes classier (dotted line). Experimental results reported in (Nigam et al. 2000) for
the 20 newsgroups data set.
found that the transductive SVM actually maximizes the wrong margin i.e. pushes
toward a large separation of unlabeled data but achieve this by actually mislabeling
the unlabeled data. In the setting of Zhang and Oles (2000) the performance of the
transductive SVM was found to be worse than that of the inductive SVM.
4.7.1 Co-training
Links in hypertexts offer additional information that can also be exploited for improv-
ing categorization and information extraction. The co-training framework introduced
TEXT ANALYSIS 115
by Blum and Mitchell (1998) especially addresses this specic property of the Web,
although it has more general applicability. In co-training, each instance is observed
through two alternative sets of attributes or views and it is assumed that each view
is sufcient to determine the class of the instance. More precisely, the instance space
X is factorized into two subspaces X1 X2 and each instance x is given as a pair
(x1 , x2 ). For example, x1 could be the bag of words of a document and x2 the bag
of words obtained by collecting all the text in the anchors pointing to that document.
Blum and Mitchell (1998) assume that
(1) the labeling function that classies examples is the same as applied to x1 or to
x2 , and
(2) x1 and x2 are conditionally independent given the class.
Under these assumptions, unlabeled documents can be exploited in a special way for
learning. Specically, suppose two sets of labeled and unlabeled documents D and
D are given. The iterative algorithm described in Blum and Mitchell (1998) then
proceeds as follows. Labeled data are used to infer two Naive Bayes classiers, one
that only considers the x1 portion of x and one that only considers the portion x2 . Then
these two classiers are used to guess class labels of documents in a subset of D . A
xed amount of such instances that are classied as positive with highest condence
is then added as positive examples to D. A similar procedure is used to add self-
labeled negative examples to D and the process iterated, retraining the two Naive
Bayes classiers on both labeled and self-labeled data. Blum and Mitchell (1998)
report a signicant error reduction in an empirical test on Web KB documents.
Nigam and Ghani (2000) have shown that co-training offers signicant advantages
over EM if there is a true independence between the two feature sets.
Table 4.5 Examples of rst-order logic classication rules learned from Web KB data
(Craven et al. 2000). Note that terms such as jame and faculti result from text conation.
Two sample rules are reported in Table 4.5. For example, a Web page is classied as
a student page if it does not contain the terms data and comment and it is linked by
another page that contains the terms jame and paul but not the term mail. The method
was later rened by characterizing documents and links in a statistical way using the
Naive Bayes model in combination with FOIL Craven and Slattery (2001).
Graphical models such as Bayesian networks that are traditionally conceived for
propositional data (i.e. each instance of case is a tuple of attributes) have been more
recently generalized to deal with relational data as well. Relational Bayesian net-
works have been introduced in Jaeger (1997) and learning algorithms for probabilistic
relational models are presented in Friedman et al. (1999). Taskar et al. (2002) pro-
pose a probabilistic relational model for classifying entities, such as Web documents,
that have known relationships. This leads to the notion of collective classication,
where classication of a set of documents is viewed as a global classication prob-
lem rather than classifying each page individually. Taskar et al. (2002) use a general
probabilistic Markov network model for this class of problems and demonstrate that
taking relational information (such as hyperlinks) into account clearly improves clas-
sication accuracy. Cohn and Hofmann (2001) describe a probabilistic model that
jointly accounts for contents and connectivity integrating ideas from PLSA (see Sec-
tion 4.5.2) and PHITS (see Section 5.6.2).
Classication and probabilistic modeling of relational data in general is still a rela-
tively less explored research area at this time but is clearly a promising methodology
for classication and modeling of Web documents.
Hierarchical clustering
Hierarchical clustering algorithms do not presume a xed number of clusters K in
advance but instead produce a binary tree where each node is a subcluster of the
parent node. Assume that we are clustering n objects. The root node consists of a
cluster containing all objects and the n leaf nodes correspond to clusters where
each contains one of the n original objects. This tree structure is known as a dendro-
gram.
Agglomerative hierarchical clustering algorithms start with a pairwise matrix of
distances or similarities between all pairs of objects. At the start of the algorithm all
objects are considered to be in their own cluster. The algorithm operates by iteratively
and greedily merging the two closest clusters into a new cluster. The resulting merging
process results in the gradual bottom-up construction of a dendrogram. The denition
of closest depends on how we dene distance between two sets (clusters) of objects
in terms of their individual distances. For example, if we dene closest as being the
smallest pairwise distance between any pair of objects (where the rst is in the rst
cluster, and the second is in the second cluster) we will tend to get clusters where
some objects can be relatively far away from each other. This leads to the well-known
chaining effect if, for example, the distances correspond to Euclidean distance in
some d-dimensional space, where the cluster shapes in d-space can become quite
elongated and chain like.This minimum-distance algorithm is also known as single-
link clustering.
In contrast we could dene closest as being the maximum distance between all
pairs of objects. This leads naturally to very compact clusters if viewed in a Euclidean
space, since we are ensuring at each step that the maximum distance between any
pair of objects in the cluster is minimized. This maximum-distance method is known
as complete-link clustering. Other denitions for closest are possible, such as com-
puting averages of pair-wise distances, providing algorithms that can be thought of
as between the single-line and complete-link methods.
A useful feature of hierarchical clustering for documents is that we can build
domain-specic knowledge into the denition of the pairwise distance measure bet-
ween objects. For example, the cosine distance in Equation (4.1) or weighting schemes
such as TFIDF (Equation (4.5)) could be used to dene distance measures. Alter-
natively, for HTML documents where we believe that some documents are struc-
turally similar to others, we could dene an edit distance to reect structural differ-
ences.
However, a signicant disadvantage of hierarchical clustering is that the agglom-
erative algorithms have a time complexity between O(n2 ) and O(n3 ) depending on
the particular algorithm being implemented. All agglomerative algorithms are at least
O(n2 ) due the requirement of starting with a pairwise distance matrix between all
n objects. For small values of n, such as the clustering of a few hundred Web pages
that are returned by a search engine, an O(n2 ) algorithm may be feasible. How-
ever, for large values of n, such as 1000 or more, hierarchical clustering is somewhat
impractical from a computational viewpoint.
TEXT ANALYSIS 119
are both very large. The authors describe speed-ups of up to an order of magnitude for
greedy agglomerative clustering, K-means clustering, and probabilistic model-based
clustering, using their approach.
A number of other representations (besides hierarchical clustering, K-means, or
mixtures models) have also been proposed for clustering documents. Taskar et al.
(2002) describe how their Markov network model for relational data can be used
to incorporate both text on the page as well as hyperlink structure for clustering of
Web pages. Slonim and Tishby (2000) propose a very general information-theoretic
technique for clustering called the information bottleneck. The technique appears to
work particularly well for document clustering (Slonim et al. 2002).
Zamir and Etzioni (1998) describe a specic algorithm for the problem discussed
earlier of clustering the results of search engines that was . They use snippetsreturned
by Web search engines as the basis for clustering and propose a clustering algorithm
that uses sufx tree data structures based on phrases between algorithms. In the
data sets used in their experiments they show that the resulting algorithm is both
computationally efcient (linear in the number of documents) and nds clusters that
are approximately as good as those obtained from clustering the full text of the Web
pages.
TITLE
EMAIL
AUTHOR
AFFILIATION
ADDRESS
ABSTRACT
END
Figure 4.12 A toy example of states and transitions in an HMM for extracting various elds
from the beginning of research papers. Not shown are various possible self-transition probabil-
ities (e.g. multiple occurrences of the state author), transition probabilities on the edges, and
the probability distributions of words associated with each state. See Figure 7 in McCallum et
al. (2000b) for a more detailed and realistic example of an HMM for this problem.
described earlier in this chapter, has focused largely on the use of machine-learning
and statistical techniques that leverage human-labelled data to automate the process
of constructing information extractors. In effect these systems use the human-labelled
text to learn how to extract the relevant information (Cardie 1997; Kushmerick et al.
1997).
One general approach is to dene information extraction as a classication problem.
For example, we might want to classify short segments of text in terms of whether
they correspond to the title, author names, author addresses and afliations, and so
forth, from research papers that are found by crawling the Web (for example this is one
component of the functionality in the CiteSeer digital library system (Lawrence et al.
1999)). A classication approach to this problem would be to represent each document
as a sequence of words and then use a sliding window of xed width k as input to a
classier each of the k inputs to the classier is a word in a specic position. The
system can be trained on positive and negative examples that are typically manually
labeled. A variety of different classiers such as Naive Bayes, decision trees, and
relational rule representations have been used for sliding-window based classication
(Baluja et al. 2000; Califf and Mooney 1998; Freitag 1998; Soderland 1999).
A limitation of the sliding window is that it does not take into account sequential
constraints that are naturally present in the data, e.g. the fact that the author eld
almost always precedes the address eld in the header for a research paper. To
take this type of structure into account, one approach is to train stochastic nite-
122 EXERCISES
state machines that can incorporate sequential dependence. One popular choice has
been hidden Markov models (HMMs), which were mentioned briey in Chapter 1
during our discussion of graphical models. An HMM contains a nite-state Markov
model, where each state in the model corresponds to one of the elds that we wish
to extract from the text. For example, in parsing headers of research papers we could
have states corresponding to the title of the paper, author name, etc. An example of
a simple Markov state diagram is shown in Figure 4.12. The key idea in HMMs is
that the true Markov state sequence is unknown at parse-time instead we see noisy
observations from each state, in this case the sequence of words from the document.
Each state has a characteristic probability distribution over the set of all possible
words, e.g. the distribution of words from the state title will be quite different from
the distribution of words for the state author names. Thus, given a sequence of
words, and an HMM, we can parse the observed sequence into a corresponding set
of inferred states the Viterbi algorithm provides an efcient method (linear in the
number of observed words) for producing the most likely state-sequence, given the
observations.
For information extraction, the HMM can be trained in a supervised manner with
manually labeled data, or bootstrapped using a combination of labeled and unlabeled
data (where the EM algorithm is used for training HMMs on unlabeled data). An early
application of this approach to the problem of named entity extraction (automatically
nding names of people, organizations, and places in free text) is described in Bikel
et al. (1997). A variety of related ideas and extensions are described in Leek (1997),
Freitag and McCallum (2000), McCallum et al. (2000a) and Lafferty et al. (2001).
For a detailed description of the use of machine-learning methods (and HMMs in
particular) to automatically create a large-scale digital library see McCallum et al.
(2000c).
Information extraction is a broad and growing area of research at the intersec-
tion of Web research, language modeling, machine learning, information retrieval,
and database theory. In this section we have barely scratched the surface of the many
research problems, techniques, and applications that have been proposed, but nonethe-
less, we hope that the reader has gained a general high-level understanding of some
of the basic concepts involved.
4.10 Exercises
Exercise 4.1. Given a collection of m documents to be indexed, compare the memory
required to store pointers as integer document identiers and as the difference between
consecutive document identiers using Eliass coding. (Hint: use Zipfs Law for
term frequency and note that the most frequent term occurs in n documents, the second
most frequent in n/2 documents, the third in n/4 documents and so on.)
Exercise 4.2. A vector x Rm is said to be sparse if
= |{j = 1, . . . , m : xj = 0}| m.
TEXT ANALYSIS 123
The dot product of two vectors x and x normally requires (m) time. In the case of
sparse vectors, this can be reduced to ( + ). Describe an efcient algorithm for
computing the dot product of sparse vectors. (Hint: use linear lists to store the vectors
in memory.)
Exercise 4.3. Consider the vector-space representation of documents and compare
the cosine distance (Equation (4.1)) to the ordinary Euclidean distance. Show that for
vectors of unit length the ranking induced by the two distances is the same.
Exercise 4.4. Show that a data set of n vectors in Rm is linearly separable if n m+1.
Exercise 4.5. Consider a data set of n text documents. Suppose L is the average
number of words in each document. How many documents must be collected, on
average, before the sufcient condition of Exercise 4.4 fails? (Hint: use Heaps law
to estimate vocabulary growth.)
Exercise 4.6. Collect subjects of your email mailbox and build a termdocument
matrix X using a relatively small set of frequent terms. Compute the SVD decompo-
sition of the matrix using a software package capable of linear algebra computations
(e.g. octave) and examine the resulting reconstructed matrix X. Try querying your
messages and compare the results obtained with a simple Boolean model where you
simply retrieve matching subjects and the LSI model. Experiment with different val-
ues of k (the dimension of the latent semantic space).
Exercise 4.7. Build your own version of the Naive Bayes classier and evaluate its
performance on the Reuters-21578 data set. In a rst setting, keep all those classes
having at least one document in the training set and one in the test set. Repeat the
experiment using only the top 10 most frequent categories. Beware that categories
in the collection are not mutually exclusive (some documents belong to several cate-
gories).
Exercise 4.8. Write a program that selects the most informative features using mutual
information. Test your program on the 20 Newsgroups corpus. Train your Naive Bayes
classier using the most k informative terms for different values of k and plot your
generalization accuracy. What is your best value of k?
Repeat the exercise by using Zipfs Law on text, removing words that are either too
frequent or too rare. Try different cutoff values for the k1 most frequent words and the
k2 most rare words. Compare your results to those obtained with mutual information.
Finally combine the two methods above and compare your results.
Exercise 4.9. Obtain a public domain implementation of Support Vector Machines
(e.g. Joachimss SVMlight , http://svmlight.joachims.org/) and set up
an experiment to classify documents in the Web KB corpus. Since SVMs are
binary classiers you need to devise some coding strategy to tackle the multiclass
problem. Compare results obtained using the two easiest strategies: one-against-all
and all-pairs.
This Page Intentionally Left Blank
5
Link Analysis
by a given journal during [t t2 , t], and N is the total number of articles published
by that journal in [t t2 , t], the impact factor is dened as C/N (typically t1 = 1
year and t2 = 2 years). Thus, the impact factor is a very simple measure, since it
basically corresponds to the normalized indegree of a journal in a subgraph of the
citation network. Graph-based link analysis for the scientic literature goes back to
the 1960s (Garner 1967). However, these ideas were not exploited in the development
of rst-generation Web search tools.
The paper by Bray (1996) reports an early attempt to apply social networks concepts
to the Web. He suggested a Web visualization approach where the . . . appearance of
a site should reect its visibility, as measured by the number of other sites that have
pointers to it . . . and . . . its luminosity, as measured by the number of pointers with
which it casts navigational light off-site. . . . Visibility and luminosity dened in this
way are directly related to the indegree and the outdegree of websites, respectively.
More recently, toward the end of the 1990s, link analysis methods became more
widely known and used in a search engine context, leading to what is sometimes
called the second generation of Web searching tools.
This chapter reviews the most common approaches to link analysis and how these
techniques are be applied to compute the popularity of a document or a site. The algo-
rithms presented in this chapter extract emergent properties from a complex network
of interconnections, attempting to model (indirectly) subjective human judgments.
It remains debatable as to whether popularity (as implied by the mechanism of cita-
tions) captures well the notions of relevance and quality as they are subjectively
perceived by humans, and whether link analysis algorithms can successfully model
human judgments.
The use of hypertext information in information retrieval is older than the Web.
Mark (1988), for example, was concerned with retrieving hypertext cards in a medical
domain and noted that . . . often cards do not even mention what they are about, but
assume that the reader understands the context because he or she has read earlier
cards. He then proposed a simple algorithm for scoring documents where relevance
information was transmitted from documents to their parents in the hypertext graph
G. More precisely, the global score of v given the query and the topology of G was
computed as:
1
S(v) = s(v) + S(w). (5.1)
| ch[v]|
w| ch[v]|
This simple algorithm somewhat resembles message passing schemes that are very
common in connectionism (McClelland and Rumelhart 1986) or in graphical model-
ing (Pearl 1988). As such, it requires G to be a DAG so that a topological sort 1 can be
chosen for updating the global scores S. The DAG assumption is reasonable in small
hypertexts with a root document and a relatively strong hierarchical structure (in this
case, even if G is not acyclic, not much information would be lost by replacing it with
its spanning tree). The Web, however, is a large and complex graph. This may explain
why search engines largely ignored its topology for several years.
The paper by Marchiori (1997) was probably the rst one to discuss the quantitative
concept of hyper information to complement textual information in order to obtain
the overall information contained in a Web document. The idea somewhat resembles
Frisses approach. Indeed, if we rewrite Equation (5.1) as
S(v) = s(v) + h(v), (5.2)
then s(v) can be thought of as the textual information (that only depends on the
document and the query), h(v) corresponds to the hyper information that depends on
the link structure where v is embedded, and S(v) is the overall information. Marchiori
(1997) did not cite Mark (1988), but nonetheless he identied a fundamental problem
with Equation (5.1). If an irrelevant page v has a single link to a relevant page w,
Equation (5.1) implies that S(v) S(w). The scenario would be even worse in a
chain of documents v0 , v1 , . . . , vk . Here if S(vk ) is very high but S(v0 ), . . . , S(vk1 )
are almost zero, then v0 would receive a global score higher than vk , even though a
user would need k clicks to reach the important document.
As a remedy, Marchiori suggested that in this case the hyper information of v0
should be computed as
h(v) = F r(v,w) S(w), (5.3)
wch[v]
where F (0, 1) is a fading constant and r(v, w) {1, . . . , |ch[v]|} is the rank
of w after sorting (in ascending order) the children of v according to the value of
1 A topological sort is an ordering < of the vertices such that v < v if and only if there is a directed
path from v to v (Cormen et al. 2001).
128 NONNEGATIVE MATRICES AND DOMINANT EIGENVECTORS
3 4 3 4
1 2 1 2
(a) (b)
3 4 3 4
1 2 1 2
5 5
(c) (d)
Figure 5.1 Graphs with different types of incidence matrices. (a) is primitive,
(b) is irreducible (with period 4) but not primitive, (c) and (d) are reducible.
In this case, is called the dominant eigenvalue of A and the associated eigenvector
is called the dominant eigenvector. Denoting by (1 , . . . , n ) the eigenvalues of A,
in the following we will assume that the dominant eigenvalue is always 1 .
Note, however, that although there cannot be multiple roots there may be some
other eigenvalue j = 1 such that |j | = |1 |. It can be shown that if there are k
eigenvalues having the same magnitude as the dominant eigenvalue, then they are
equally spaced in the complex circle of radius 1 . Moreover, if A is the adjacency
matrix of a graph, k is the gcd of the lengths of all the cycles in the graph. In order
to get a dominant eigenvalue that is strictly greater than all other eigenvalues further
conditions are necessary.
A matrix A is said to be primitive if there exists a positive integer t such that
At > 0 (note the strict inequality). A primitive matrix is also irreducible, but the
converse is not true in general. For a primitive matrix, condition 1 of the Perron
Frobenius theorem holds with strict inequality. This means that all the remaining
eigenvalues are smaller in modulus than the dominating eigenvalue. Moreover, if the
adjacency matrix of a graph is primitive, then the gcd of the lengths of all cycles is
unity. Figure 5.1 illustrates some examples.
130 NONNEGATIVE MATRICES AND DOMINANT EIGENVECTORS
Root set R 5
1
10
6
11 7
12
3
4
8
13
14
15
Figure 5.2 Example of a base subgraph obtainedby starting from the vertex set {1, 2, 3, 4}.
Note that the children of a given node (line 3) are forward links and can be
obtained directly from each page v. Parents (line 4) correspond to backlinks and
can be obtained from a representation of the Web graph obtained, for example,
through a crawl. Several commercial search engines currently support the spe-
cial query link:url that returns the set of documents containing url as a link.
In the case of small scale applications, this approach can be used to obtain the
set of parents in line 4. Parameter d is the maximum number of parents of a
node in the root set that can be added. As we know (see Chapter 3), some pages
may have a very large indegree. Thus, bounding the number of parents is cru-
cial in practical applications. Algorithm BaseSubgraph returns a set of nodes S.
In what follows, HITS is assumed to work on the subgraph of the Web induced
by S.
Let G = (V , E) denote the subgraph of interest, where V = S. For each vertex
v V , let us introduce two positive real numbers a(v) and h(v). These quantities
are called the authority and the hubness weights of v, respectively. Intuitively, a
document should be very authoritative if it has received many citations. As discussed
132 HUBS AND AUTHORITIES: HITS
above, citations from important documents should be weighted more than citations
from less-important documents. In the case of HITS, the importance of a document as
a source of citations is measured by its hubness. Intuitively, a good hub is a document
that allows us to reach many authoritative documents through its links. The result is
that the hubness of a document depends on the authority of the cited documents, and
the authority of a document depends on the hubness of the citing documents. We are
apparently stuck in a loop, but let us observe that this recursive form of dependency
between hubs and authority weights naturally leads to the denition of the following
operations:
a(v) h(w), (5.6)
wpa[v]
h(v) a(w). (5.7)
wch[v]
The two operations above can be carried out to update authority and hubness weights
starting from initial values. This approach is meaningful because Kleinberg (1999)
showed that iterating Equations (5.6) and (5.7), intermixed with a proper normaliza-
tion step, yields a convergent algorithm. The output is a set of weights that can be
therefore considered to be globally consistent. Kleinbergs algorithm is listed below.
For convenience, weights are collected in two n-dimensional vectors a and h.
HubsAuthorities(G)
1 1 [1, . . . , 1] R|V |
2 a 0 h0 1
3 t 1
4 repeat
5 for each v inV
6 do at (v) wpa[v] ht1 (w)
7 ht (v) wch[v] at1 (w)
8 at at /at
9 ht ht /ht
10 t t +1
11 until at at1 + ht ht1 <
12 return (at , ht )
To show that HubsAuthorities terminates, we need to prove that for each > 0
the condition controlling the outer loop will be met for t large enough. Formally,
this means that the sequences {at }iN and {ht }tN converge to limits a and h ,
respectively. The proof of this result is based on rewriting HITS using linear algebra.
In particular, if we denote by A the incidence matrix of G, it can be easily veried that
the updating operations can be written compactly in vector notation as at = AT ht1
LINK ANALYSIS 133
Since 1 cannot be orthogonal to a nonnegative vector, the sequences {at } and {ht }
converge to 1 (AT A) and 1 (AAT ), respectively.
As an example, in Figure 5.4 we show the authority and hubness weights computed
by HubsAuthorities on the graph of Figure 5.2. We can note some unobvious
weight assignments. For example, vertex 3 has the largest indegree in the graph but
nonetheless its authority is rather small because of the low hubness weight of its
parents.
Bharat and Henzinger (1998) have suggested an improved version of HITS that
addresses some specic problems that are encountered in practice. For example, a
mutual reinforcement effect occurs when the same host (or document) contains many
identical links to the same document in another host. To solve this problem, Bharat
and Henzinger (1998) modied HITS by assigning weight to these multiple edges that
134 PAGERANK
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 5.4 Authority and hubness weights in the example graph of Figure 5.2.
are inversely proportional to their multiplicity. The method presented by Bharat and
Henzinger (1998) also addresses the problem of links that are generated automatically,
for example, by converting messages posted to Usenet news groups into Web pages.
Finally, they address the so-called topic drift problem, i.e. some nodes in the base
subgraph may be irrelevant with respect to the user query and documents with highest
authority or hubness weights could be about different topics. This problem can be
addressed either by pruning irrelevant nodes or by regulating the inuence of a node
with a relevance weight.
5.4 PageRank
The theory developed in this section was introduced by Page et al. (1998) and resem-
bles in many ways the recursive propagation idea we have seen in HITS. However,
unlike HITS, only one kind of weight is assigned to Web documents. Intuitively, the
rank of a document should be high if the sum of its parents ranks is high. To a rst
approximation, this intuition might be embodied in the equation
r(w)
r(v) = , (5.11)
|ch[w]|
wpa[v]
where r(v) is the rank assigned to page v and is a normalization constant. Note
that each parent w contributes by a quantity that is proportional to its rank r(w) but
inversely proportional to its outdegree. This is a fundamental difference with respect
to authority in HITS. The endorsement signal that ows from a given page w to each
of its children decreases as the number of outgoing links (and, therefore, the potential
of being a good hub) increases. Equation (5.11) can also be written in matrix notation,
as
r = Br = Mr. (5.12)
LINK ANALYSIS 135
The matrix B is obtained from the adjacency matrix A of the graph by dividing each
element by the corresponding row sum, i.e.
auv
, if ch[u] = ,
buv = w auw (5.13)
a = 0, otherwise.
u,v
This equation updates the probability that our random surfer will browse page v at
time t, given the vector of probabilities at time t 1 and the transition probabilities
mwv . In matrix notation this can be written as
rt = Mrt1 . (5.15)
To satisfy probability axioms, M must be a stochastic matrix, i.e. its rows should sum
to one. Since the rows of B are normalized, the probability axioms are satised if
= 1. This simply means that the random surfer picks one of the outlinks in the page
being visited according to the uniform distribution.
A fundamental question is whether iterating Equation (5.14) converges to some
sensible solution regardless of the initial ranks r0 . To answer this we need to inspect
different cases in the light of the theory of nonnegative matrices developed in Sec-
tion 5.2.
Four interesting cases are illustrated in Figure 5.5. The rst graph has a primitive
adjacency matrix. Therefore Equation (5.5) holds, and values of r(v) corresponding
to the steady state are indicated inside each node (below the node index). In the
same gure, arcs are labeled by the amount of rank that is passed from a node to its
children. As expected, Equation (5.11) holds everywhere. The second graph is more
problematic, since its adjacency matrix is irreducible but not primitive. In this case,
passing ranks from nodes to their children results in a cyclic updating. The random
walk recursion of Equation (5.14) converges in this case to a limit cycle rather than to
a steady state, and the periodicity of the limit cycle is the period of the matrix, or, as
136 PAGERANK
3 4 3 4
1/8
1/8 1/4 abcd cdab
1/8 1/8
1/4
1 1/8 2 1 2
1/4 1/4 bcda dabc
1/8 1/8
5
1/8
(a) (b)
3 1/9 4 3 0 4
1/3 2/9 0 0
1/9
0 0
1/9 2/9 0
1/9 1/3
1 2 1 2
1/9 1/3 0 0
0 0 0
0
5 5
0 0
(c) (d)
Figure 5.5 Rank propagation on graphs with different types of incidence matrices. Equa-
tion (5.14) converges to a nontrivial steady state in case (a) and (c), to a limit cycle in case (b),
and to zero in case (d).
we know, the gcd of the lengths of the cycles (4 in the example). This is indicated in
Figure 5.5b by four values a, b, c, d of rank that cyclically bounce along the nodes. The
third graph of Figure 5.5 has a reducible adjacency matrix. In this case M t converges
to a matrix whose last column is all zero, reecting the fact that the node should have
LINK ANALYSIS 137
zero rank as it has no parents. Finally, the fourth graph has also a reducible adjacency
matrix but this time the maximum eigenvalue is less than one, so M t converges to the
zero matrix. This is due to the existence of a node (1) with no children that effectively
acts as a rank sink.
The situation in Figure 5.5d is of course undesirable but is very common in the
actual Web. Many pages have no outlinks at all. Furthermore, pages that remain on the
crawling frontier and are never fetched will likely produce dangling edges in the graph
that is obtained from crawling. To solve this difculty, observe that the connectivity
of node 1 should be dened as illegal, since it violates the basic hypothesis underlying
the random walk model: the sum of the probabilities of the available actions should
be one in each node, but once in the sink nodes our random surfer would be left with
no choices. A sensible correction consists of giving the random surfer a method of
escape by adding allowable actions. One possibility is to assume that the surfer, who
cannot possibly follow any link, will restart browsing by picking a new Web page
at random. This is the same as adding a link from each sink to each other vertex,
i.e. introducing an escape matrix E dened as evw = 0 if |ch[v]| > 0 and evw = 1/n
otherwise, for each w. Then the transition matrix becomes
M = (B + E).
M is now a stochastic matrix and the Markov chain model for a Web surfer is sound. In
general, however, there is no guarantee that M is also primitive (if there are cycles with
zero outdegree as in Figure 5.1b, these bring irreducible but periodic components).
This difculty will be addressed shortly and for now let us assume that M is primitive.
The following iterative algorithm was suggested in the original paper on PageRank
(Page et al. 1998). It takes as input a nonnegative square matrix M, its size n, and a
tolerance parameter .
PageRank(M, n, )
1 1 [1, . . . , 1] Rn
2 z n1 1
3 x0 z
4 t 0
5 repeat
6 t t +1
7 xt M T xt1
8 dt xt1 1 xt 1
9 xt x1 + dt z
10 xt1 xt 1
11 until <
12 return xt
The quantity dt is the total rank being lost in sinks. Adding dt z to M T xt1 is basically
a normalization step. As it turns out, if M is a stochastic primitive matrix, then dt = 0
in each iteration (no normalization is necessary) and PageRank converges to the
138 STABILITY
r = e + (1 )x.
The simplest choice for e is a uniform distribution, i.e. e = (1/n)1. Intuitively, this
approach can be motivated by the metaphor that browsing consists of following exist-
ing links with some probability 1 or selecting a nonlinked page with probability .
When the latter choice is made, each page in the entire Web is sampled according to
the probability distribution e. In the case of the uniform distribution, Equation (5.15)
will be rewritten as
rt = [H + (1 )M]T rt1 , (5.16)
where H is a square matrix with huv = 1/n for each u, v. In this way we have obtained
an ergodic Markov chain whose underlying transition graph is fully connected. The
associated transition matrix H + (1 )M is primitive and therefore the sequence rt
converges to the dominant eigenvector. The stationary distribution r associated with
the Markov chain described by Equation (5.16) is known as PageRank. In practice,
is typically chosen to be between 0.1 and 0.2 (Brin and Page 1998).
5.5 Stability
An important question is whether the link analysis algorithms based on eigenvectors
(such as HITS and PageRank) are stable in the sense that results do not change
signicantly as a function of modest variations in the structure of the Web graph. More
precisely, suppose the connectivity of a portion of the graph is changed arbitrarily,
i.e. let G = (V , E) be the graph of interest and let us replace it by a new graph
G = (V , E), where some edges have been added or deleted. How will this affect the
results of algorithms such as HITS and PageRank?
Ng et al. (2001) proved two interesting results about the stability of algorithms
based on the computation of dominant eigenvectors.
LINK ANALYSIS 139
5 1h 1a
1
2h 2a
6
3a
2
4h
7
5a
3
4 6h 6a
7h 7a
G U
Figure 5.6 Forming a bipartite graph in SALSA.
small rank, the overall change will also be small. Bianchini et al. (2001) later proved
the tighter bound
1
r r 2 (j )r(j ), (5.20)
j V
5.6.1 SALSA
Lempel and Moran (2001) have proposed a probabilistic extension of HITS called the
Stochastic Approach for Link Structure Analysis (SALSA). Similar extensions have
been proposed independently by Raei and Mendelzon (2000) and Ng et al. (2001).
In all of these proposals, the random walk is carried out by following hyperlinks both
in the forward and in the backward direction.
SALSA starts from a graph G = (V , E) of topically related pages (like the base
subgraph of HITS) and constructs a bipartite undirected graph U = (V , E) as (see
Figure 5.6)
V = Vh Va ,
LINK ANALYSIS 141
where
.
Vh = {vh : v V , ch[v] = },
.
Va = {va : v V , pa[v] = },
.
E = {(uh , va ) : (u, v) E}.
The sets Vh and Va are called the hub side and the authority side of U , respectively.
Two separate random walks are then introduced. In the hub walk, each step con-
sists of
(1) following a Web link from a page uh to a page wa , and
(2) immediately afterward following a backlink going back from wa to vh , where
we have assumed that (u, w) E and (v, w) E.
For example, jumping from 1h to 5a and then back from 5a to 2h in Figure 5.6. In
the authority walk, a step consists of following a backlink rst and a forward link
next. In both cases, a step translates into following a path of length exactly two in
U . Note that, by construction, each walk starts on one side of U , either the hub side
or the authority side, and will remain conned to the same side. The Markov chains
associated with the two random walks have transition matrices H and T , respectively,
dened as follows:
1 1
huv = ,
deg(uh ) deg(wa )
w:(u,w)E,
(v,w)E
1 1
tuv = .
deg(va ) deg(wh )
w:(w,u)E,
(w,v)E
The hub and authority weights are then obtained as principal eigenvectors of the
matrices H and T . Note that these two matrices could also be dened in an alternative
way. Let A be the adjacency matrix of G, Ar the row-normalized adjacency matrix
(as in Equation (5.13)) and let Ac the column-normalized adjacency matrix of G
(i.e. dividing each nonzero entry by its column sum). Then H consists of the nonzero
rows and columns of Ar ATc , while T consists of the nonzero rows and columns of
ATc Ar .
Note that huv > 0 implies that there exists at least one page w that has links to both
u and v. This is known as co-citation in bibliometrics (Kessler 1963) (see Figure 5.9
and Exercise 5.2). Similarly, tuv > 0 implies there exists at least one page that is
linked to by both u and v, a bibliographic coupling (Small 1973).
Lempel and Moran (2001) showed theoretically that SALSA weights are more
robust that HITS weights in the presence of the Tightly Knit Community (TKC)
Effect. This effect occurs when a small collection of pages (related to a given topic)
is connected so that every hub links to every authority and includes as a special
case the mutual reinforcement effect identied by Bharat and Henzinger (1998) (see
142 PROBABILISTIC LINK ANALYSIS
Section 5.3). It can be shown that the pages in a community connected in this way
can be ranked highly by HITS, higher than pages in a much larger collection where
only some hubs link to some authorities. Clearly the TKC effect could be deliberately
created by spammers interested in pushing the rank of certain websites. Lempel and
Moran (2001) constructed examples of community pairs Cs connected in a TKC
fashion, and Cl sparsely connected, and proved that authorities of Cs are ranked
above the authorities of Cl by HITS but not by SALSA.
In a similar vein, Raei and Mendelzon (2000) and Ng et al. (2001) have proposed
variants of the HITS algorithm based on a random walk model with reset, similar to
the one used by PageRank. More precisely, a random surfer starts at time t = 0 at a
random page and subsequently follows links from the current page with probability
1 , or (s)he jumps to a new random page with probability . Unlike PageRank,
in this model the surfer will follow a forward link on odd steps but a backward link
on even steps. For large t, two stationary distributions result from this random walk,
one for odd values of t, that corresponds to an authority distribution, and one for
even values of t that correspond to a hubness distribution. In vector notation the two
distributions are proportional to
The stability properties of these ranking distributions are similar to those of PageRank
(Ng et al. 2001).
Some further improvements of HITS and SALSA, as well as theoretical analyses
on the properties of these algorithms can be found in Borodin et al. (2001).
5.6.2 PHITS
Cohn and Chang (2000) point out a different problem with HITS. Since only the
principal eigenvector is extracted, the authority along the remaining eigenvectors is
completely neglected, despite the fact that it could be signicant. An obvious approach
to address this limitation consists of taking into account several eigenvectors of the
co-citation matrix, in the same spirit as PCA is used to extract several factors that
are responsible for variations in multivariate data. As we have discussed in Sec-
tion 4.5.2, however, the statistical assumptions underlying PCA are not sound for
multinomial data such as termdocument occurrences or bibliographical citations.
PHITS can be viewed as probabilistic LSA (see Section 4.5.2) applied to co-citation
and bibliographic coupling matrices. In this case citations replace terms. As in PLSA,
a document d is generated according to a probability distribution P (d) and a latent
variable z is then attached to d with probability P (z | d). Here z could represent
research areas (in the case of bibliographic data) or a (topical) community in the
case of Web documents. Citations (links) are then chosen according to a probability
distribution P (d | z).
LINK ANALYSIS 143
Figure 5.7 A link farm. Shaded nodes are all copies of the same page.
W
G in
sink
G G W
core out rest
G W \G
decides to join the farm agrees to store a copy of a hub page on her server and to
link it from the root of her site. In return, the main URL of her site is added to the
hub page, which is in turn redistributed to the sites participating in the link exchange.
The result is a densely connected subgraph like the one shown in Figure 5.7.
It may appear that, since structures of this kind are highly regular, they should
be relatively easy to detect (see Exercise 5.8) and thus link farming should not be
a serious concern for search engines. However, it is possible to build farms that are
more tightly entangled in the Web and are therefore more difcult to detect by simple
topological analyses. This problem has been recently pointed out by Bianchini et al.
(2001), who have shown that every community, dened as an arbitrary subgraph G
of the Web, must satisfy a special form of energy balance. The overall PageRank
assigned to pages in G grows with the energy that ows in from pages linking to
the community and decreases with the energy dispersed in sinks and passed to pages
outside the community. With reference to Figure 5.8, let Gout denote the subgraph
of G induced by pages that contain hyperlinks pointing outside to G and let Gsink
denote the sink subgraph of G. Also, let Win be the subgraph induced by the pages
outside G that link to pages in G. Then the equation
rG = |G| + EG
in
EG
sink
EG
out
. (5.23)
holds, where |G| is the default energy that is assigned to the community,
1
EGin
= fG (w) r(w)
wWin
is the energy owing in from outside, where fG (w) is the fraction of links in w that
point to pages in G,
1
EGout
= (1 fG (w)) r(w)
wGout
Exercises
Exercise 5.1. Draw a graph of reasonable size, connecting vertices at random, and
compute the principal eigenvectors of the matrices AT A and AAT to get authority and
hubness weights. A very rapid way of doing this is by using linear algebra software
such as Octave. Now select a vertex having nonzero indegree but small authority and
try to modify the graph to increase its authority without increasing its indegree nor
the indegree of its parents.
Exercise 5.2. The two matrices involved in Equations (5.8) and (5.9) were introduced
several years before in the eld of bibliometrics. In particular, C = AT A is known as
the co-citation matrix (Kessler 1963) and B = AAT is known as the bibliographic
2 See http://www.operatingthetan.com/google/ for details.
146 LIMITATIONS OF LINK ANALYSIS
0.25
0.6
0.1
1
0.1 2
0.4 0.65
0.3
0.5
0.2 3
coupling matrix (Small 1973). Show that cuv is the number of documents that cite
both documents u and v, while buv is the number of pages that are cited by both u
and v (see Figure 5.9).
Exercise 5.3. Consider the Markov chain in Figure 5.10 (where arcs are labeled by
transition probabilities). Is it ergodic? What is the steady-state distribution?
LINK ANALYSIS 147
r r < ,
being an assigned tolerance. Assume for simplicity that a constant number of pages
are changed in a given unit of time.
Exercise 5.7. Implement the PageRank computation and simulate the results on a
relatively large articial graph (build the graph using ideas from Chapter 3). Then
introduce link farms in your graph and study the effect they have on the PageRank
vector as a function of the number and the size of the farms.
Exercise 5.8. Propose an efcient algorithm to detect link farms structured as in
Figure 5.7 in a large graph.
This Page Intentionally Left Blank
6
In this chapter we rene the basic design of Web crawlers, building on the material
presented in Section 2.5.
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
1N 2N 3N 4N N 1N 2N 3N 4N N
5 5 5 5 5 5 5 5
Figure 6.1 Efciency curves obtained by Cho et al. (1998) for a selective crawler. Both were
obtained on a set of 784 592 Stanford University URLs. Dotted lines correspond to a BFS
crawler, solid lines to a crawler guided by the estimated number of backlinks, and dashed lines
to a crawler guided by estimated PageRank. The diagonal lines are the reference performance
of a random crawler. The target measure of relevance is the actual number of backlinks. In the
right-hand plot, the importance target G is 100 backlinks.
having higher score are fetched rst. If s(u) provides a good model of the relevance
of the document pointed to by the URL, then we expect that the search process will
be guided toward the most relevant regions of the Web. From this perspective the
problem is to dene interesting scoring functions and to devise good algorithms to
compute them.
In the following we consider in detail some specic examples of scoring functions.
where root(u) is the root of the site containing u. The rationale behind this approach
is that by maximizing breadth it will be easy for the end-user to eventually nd the
desired information. For example a page that is not indexed by the search engine
may be easily reachable from other pages in the same site.
Popularity. It is often the case that we can dene criteria according to which certain
pages are more important than others. For example, search engine queries are
answered by proposing to the user a sorted list of documents. Typically, users tend
to inspect only the few rst documents in the list (in Chapter 8 we will discuss
ADVANCED CRAWLING TECHNIQUES 151
empirical data that support this statement). Thus, if a document rarely appears near
the top of any lists, it may be worthless to download it and its importance should be
small. A simple way of assigning importance to documents that are more popular
is to introduce a relevance function based on the number of backlinks
1, if indegree(u) > ,
s(backlinks)
(u) = (6.2)
0, otherwise,
0.2
0.1
0.0
Mean Mean Term Mean
TFIDF Probability Overlap
Figure 6.2 Results of the study carried out by Davison in 2000. (a) Three indicators that
measure text similarity (see Section 4.3) are compared in ve different linkage contexts. In
this data set, about 56% of the linked documents belong to the same domain. It can be seen
that the similarity between linked documents is signicantly higher than the similarity between
random documents. (b) Similarities measured between anchor text and text of documents in
ve different contexts.
0.4 0.4
0.2 0.2
0 0
0 2000 4000 6000 8000 10000 Documents 0 1000 2000 3000 4000 5000 Documents
fetched fetched
Average Average
relevance relevance
(soft focused) (soft focused)
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 1000 2000 3000 4000 5000 Documents 0 1000 2000 3000 4000 Documents
fetched fetched
Figure 6.3 Efciency of a BFS and a focused crawler compared in the study by Chakrabarti
et al. (1999b). The plots show the average relevance of fetched documents versus the number
of fetched documents. Results for two topics are shown (cycling on the left, AIDS/HIV on the
right).
Parent based. In this case we compute the score for a fetched document and extend
the computed value to all the URLs in that document. More precisely, if v is a
parent of u, we approximate the score of u as
(topic)
s (u) P (c | d(v), ). (6.4)
The rationale is a general principle of topic locality. If a page deals with, say,
music, it may be reasonable to believe that most of the outlinks of that page will deal
with music as well. In a systematic study based on the analysis of about 200 000 doc-
uments, Davison (2000b) found that topic locality is measurably present in the Web,
under different text similarity measures (see Figure 6.2a).
Chakrabarti et al. (1999b) use a hierarchical classier and suggest two implemen-
tations of the parent-based scoring approach. In hard focusing, they check if at least
one node in the category path of d(v) is associated with a topic of interest; if not
154 FOCUSED CRAWLING
Layer 2
Layer 1
Figure 6.4 Example of a two-layered context graph. The central white node
is a target document. Adapted from Diligenti et al. (2000).
the outlinks of v are simply discarded (not even inserted in the crawling queue Q).
In soft focusing, relevance is computed according to Equation (6.4), but if the same
URL is discovered from multiple parents, a revision strategy is needed in order to
(topic)
update s (u) when new evidence is collected. Chakrabarti et al. (1999b) found
no signicant difference between the two approaches.
Anchor based. Instead of the entire parent document, d(v), we can just use the text
d(v, u) in the anchor(s) where the link to u is referred to, as the anchor text is often
very informative about the contents of the document pointed to by the corresponding
URL. This semantic linkagewas also quantied by Davison (2000b), who showed
that the anchor text is most similar to the page it references (see Figure 6.2b).
To better illustrate the behavior of a focused crawler on real data, consider the
efciency diagrams in Figure 6.3 that summarize some results obtained by Chakrabarti
et al. (1999b). An unfocused crawler starting from a seed set of pages that are relevant
to a given topic will soon begin to explore irrelevant regions of the Web. As shown in
the top diagrams, the average relevance of the fetched pages dramatically decreases
as crawling goes on. Using a focused crawler (in this case, relevance is predicted
by a Naive Bayes classier trained on examples of relevant documents) allows us to
maintain an almost steady level of average relevance.
There are several alternatives to a focused crawler based on a single best-rst queue,
as detailed in the following.
relevant information can be expected to be found starting from a given page. Intu-
itively, suppose the crawler is programmed to gather homepages of academic courses
in articial intelligence. The backlinks of these pages are likely to lead to professors
home pages or to the teaching sections of the department site. Going one step further,
backlinks of backlinks are likely to lead into higher level sections of department sites
(such as those containing lists of the faculty). More precisely, the context graph of a
node u (see Figure 6.4) is the layered graph formed inductively as follows. Layer 0
contains node u. Layer i contains all the parents of all the nodes in layer i 1. No
edges jump across layers. Starting from a given set of relevant pages, Diligenti et al.
(2000) used context graphs to construct a data set of documents whose distance from
the relevant target was known (backlinks were obtained by querying general purpose
search engines). After training, the machine-learning system predicts the layer a new
document belongs to, which indicates how many links need to be followed before rel-
evant information will be reached, or it returns other to indicate that the document
and its near descendants are all irrelevant. Denoting by n the depth of the considered
context graph, the crawler uses n best-rst queues, one for each layer, plus one extra
queue for documents of class other. This latter queue is initialized with the seeds.
In the main loop, the crawler extracts URLs from the rst nonempty queue and in this
manner favors those that are more likely to rapidly lead to relevant information.
An optimal policy maximizes the value function over all the states: V (s)
V (s) for all s S. According to the Bellman optimality principle, underlying
the foundations of dynamic programming (Bellman 1957), a sequence of optimal
decisions has the property that, regardless of the action taken at the initial time, the
subsequent sequence of decisions must be optimal with respect of the outcome of the
rst action. This translates into
which allows us to determine the optimal policy once V is known for all s. An
optimal policy also maximizes Q (s, a) for all s S, a A(s):
The advantage of the state-action representation is that once Q is known for all s
and a, all the agent needs to do in order to maximize the expected future reward is to
choose the action that maximizes Q (s, a).
In the context of focused Web search, immediate rewards are obtained (downloading
relevant documents) and the policy learned by reinforcement can be used to guide
the agent (the crawler) toward high long-term cumulative rewards. As we have seen
in Section 2.5.3, the internal state of a crawler is basically described by the sets of
fetched and discovered URLs. Actions correspond to fetching a particular URL that
belongs to the queue of discovered URLs. Even if we simplify the representation by
removing from the Web all the off-topic documents, it is clear that the sets of states and
actions are overwhelming for a standard implementation of reinforcement learning.
LASER (Boyan et al. 1996) was one of the rst proposals to combine reinforcement
learning ideas with Web search engines. The aim in LASER is to answer queries
rather than to crawl the Web. The system begins by computing a relevance score
r0 (u) = TFIDF(d(u), q) and then propagates it in the Web graph using the recurrence
rt (v)
rt+1 (u) = ro (u) + , (6.9)
|pa[u]|
vpa[u]
which is iterated until convergence for each document in the collection, where and
are free parameters. After convergence, documents at distance k from u provide a
contribution proportional to k times their relevance to the relevance of u.
McCallum et al. (2000c) used reinforcement to search computer science papers
in academic websites. In order to simplify the problem of learning Q (s, a) for an
enormous number of states and actions they propose the following two assumptions:
the state is independent of the relevant documents that have been fetched so
far;
actions can only be distinguished by means of the words in the neighborhood
of the hyperlink that correspond to each action (e.g. the anchor text).
ADVANCED CRAWLING TECHNIQUES 157
In this way, learning Q reduces to learning a mapping from text (e.g. bag of words
representing anchors) to a real number (the expected future reward).
Interestingly, since CiteSeer creates a Web page for each online article, it effectively
maps the online subset of the computer science literature web to a subset of the World
Wide Web. A recent study has shown that papers that are available online tend to
receive a signicantly higher number of citations (Lawrence 2001).
The DEADLINER system described in Kruger et al. (2000) is a search engine
specialized in conference and workshop announcements. One of the input components
of the system is a context-graph focused crawler (see Section 6.2.2) that gathers
potentially related Web documents. An SVM text classier is subsequently used to
rene the set of documents retrieved by the focused crawler.
Si Sj
Aii Ajj
Aij Aji
Crawler i Crawler j
Figure 6.5 Two crawlers statically coordinated.
Coordination refers to the way different processes agree about the subset of pages
each of them should be responsible for. If two crawling processes i and j are
completely independent (not coordinated), then the degree of overlap can only
be controlled by having different seeds Si and Sj . If we assume the validity of
topological models such as those presented in Section 3, then we can expect that
the overlap will eventually become signicant unless a partition of the seed set is
properly chosen. On the other hand, making a good choice is a challenge, since
the partition that minimizes overlap may be difcult to compute, for example,
because current models of the Web are not accurate enough. In addition it may be
suboptimal with respect to other desiderata that motivated distributed crawling in
the rst place, such as distributing the load and scaling-up.
A pool of crawlers can be coordinated by partitioning the Web into a number of
subgraphs and letting each crawler be mainly responsible for fetching documents
from its own subgraph. If the partition is decided before crawling begins and not
changed thereafter, we refer to this as static coordination. This option has the great
advantage of simplicity and implies little communication overhead. Alternatively,
if the partition is modied during the crawling process, the coordination is said to
be dynamic. In the static approach the crawling processes, once started, can be seen
as agents that operate in a relatively autonomous way. In contrast, in the dynamic
case each process is subject to a reassignment policy that must be controlled by an
external supervisor.
Connement species, assuming statically coordinated crawlers, how strictly each
crawler should operate within its own partition. Consider two processes, i and
160 WEB DYNAMICS
j , and let Aij denote the set of documents belonging to partition i that can be
reached from the seeds Sj (see Figure 6.5). The question is what should happen
when crawler i pops from its queue foreign URLs pointing to nodes in a different
partition. Cho and Garcia-Molina (2002) suggest three possible modes: rewall,
crossover, and exchange. In rewall mode, each process remains strictly within its
partition and never follows interpartition links. In crossover mode, a process can
follow interpartition links when its queue does not contain any more URLs in its
own partition. In exchange mode, a process never follows interpartition links, but
it can periodically open a communication channel to dispatch the foreign URLs
it has encountered to the processes that operate in those partitions. To see how
these modes affect performance measures, consider Figure 6.5 again. The rewall
mode has, by denition, zero overlap but can be expected to have poor coverage,
since documents in Aij \ Aii are never fetched (for all i and j ). Cross-over mode
may achieve good coverage but can have potentially high overlap. For example,
documents in Aii Aij can be fetched by both process i and j . The exchange mode
has no overlap and can achieve perfect coverage. However, while the rst two modes
do not require extra bandwidth, in exchange mode there will be a communication
overhead.
Partitioning denes the strategy employed to split URLs into non-overlapping sub-
sets that are then assigned to each process.A straightforward approach is to compute
a hash function of the IP address in the URL, i.e. if n {0, . . . , 232 1} is the inte-
ger corresponding to the IP address and m the number of processes, documents such
that n mod m = i are assigned to process i. In practice, a more sophisticated solu-
tion would take into account the geographical dislocation of networks, which can
be inferred from the IP address by querying Whois databases such as the Rseaux
IP Europens (RIPE) or the American Registry for Internet Numbers (ARIN).
Note that G(t) is the expected fraction of components that are still operating at
time t. The age probability density function (pdf) g(t) is thus proportional to the
survivorship function. Returning from the reliability metaphor to Web documents,
S(t) is the probability that a document that was last changed at time zero will remain
unmodied at time t, while G(t) is the expected fraction of documents that are older
than t. The probability that the document will be modied before an additional time
h has passed is expressed by the conditional probability P (t < T t + s | T > t).
The change rate (t) (also known as the hazard rate in reliability theory, or mortality
force in demography) is then obtained by dividing by h and taking the limit for small
h,
. 1
(t) = lim P (t < T t + h | T > t)
h0 s
1 1 t+h f (t)
= lim f ( ) d = , (6.12)
h0 S(t) h t S(t)
where f (t) denotes the lifetime pdf. Combining (6.10) and (6.12) we have the ordinary
differential equation (see, for example, Apostol 1969)
F (t) = (t)(1 F (t)), (6.13)
162 WEB DYNAMICS
Observed
lifetime
age(t )
Actual
lifetime
age(t )
Observation
timespan
Figure 6.6 Sampling lifetimes can be problematic as changes can be missed for two reasons.
Top, two consecutive changes are missed and the observed lifetime is overestimated. Bottom,
the observation time-span must be large enough to catch changes that occur in a long range.
Sampling instants are marked by double arrowheads.
with F (0) = 0. We will assume that changes happen randomly and independently.
According to a Poisson process, the probability of a change event at any given time
is independent of when the last change happened (see Appendix A). For a constant
change rate (t) = , the solution of Equation (6.13) is
F (t) = 1 et , f (t) = et .
Brewington and Cybenko (2000) observed that the model could be particularly
valuable for analyzing Web documents. In practice, however, the estimation of f (t)
is problematic for any method based on sampling. If a document is observed at two
different instants t1 and t2 , we can check for differences in the document but we can-
not know how many times the document was changed in the interval [t1 , t2 ], a phe-
ADVANCED CRAWLING TECHNIQUES 163
1
Observed pdf Observed pdf
of page age 0.9 of page age
2
10 0.8
0.7
3 0.6
10
0.5
0.4
4 0.3
10
0.2
0.1
5
10 2 3
0 200 400 600 800 1000 1200 1400 1 10 10 10
Age in days Age in days
3
0.7
10
0.6
0.5
0.4
4
10 0.3
0.2
0.1
2 3
0 100 200 300 400 500 600 1 10 10 10
Lifetime in days Lifetime in days
Figure 6.7 Empirical distributions of page age (top) and page lifetime (bottom) on a set of
7 million Web pages. Adapted from Brewington and Cybenko (2000).
nomenon known as aliasing (see Figure 6.6). On the other hand, the age of a document
may be readily available, if the Web server correctly returns the Last-Modified
timestamp in the HTTP header (see Section 2.3.4 for a sample script that obtains this
information from a Web server). Sampling document ages is not subject to aliasing
and lifetime can be obtained indirectly via Equation (6.11). In particular, if the change
rate is constant, it is easy to see that the denominator in Equation (6.11) is a/ and
thus from Equation (6.12) we obtain
g(t) = f (t) = et . (6.14)
In other words, assuming a constant change rate, it is possible to estimate lifetime
from observed age.
This simple model, however, does not capture the essential property that the Web
is growing with time. Brewington and Cybenko (2000) collected a large data set of
roughly 7 million Web pages, observed between 1999 and 2000 in a time period
of seven months, while operating a service called The Informant. 1 The resulting
distributions of age and lifetime are reported in Figure 6.7.
1 Originally http://informant.dartmouth.edu, now http://www.tracerlock.com.
164 WEB DYNAMICS
P ()
P0 e
P0 e(t)
P0
t
Figure 6.8 Assuming an exponentially growing Web whose pages are never changed,
it follows that the age distribution is also exponential at any time .
What is immediately evident is that most of the collected Web documents are
young, but what is interesting is why this is the case. Suppose that growth is modeled
by an exponential distribution, namely that the size at time is P ( ) = P0 e , where
P0 is the size of the initial population and the growth rate. If documents were
created and never edited, their age would be simply the time since their creation. The
probability that a random document at time has an age less than t is
new docs in ( t, ) e e( t)
Gg (t, ) = = (6.15)
all docs e 1
(see Figure 6.8) and thus the resulting age density at time is
e( t)
gg (t, ) = [H (t) H (t )], (6.16)
e 1
where H (t) is the Heaviside step function, i.e. H (t) = 0 for t < 0 and H (t) = 1
otherwise.
In other words, this trivial growth model yields an exponential age distribution,
like the model in Section 6.4.1 that assumed a static Web with documents refreshed
at a constant rate. Clearly, the effects of both Web growth and document refreshing
should be taken into account in order to obtain a realistic model. Brewington and
Cybenko (2000) experimented with a hybrid model (not reported here) that combines
an exponential growth model and exponential change rate. Fitting this model with
ages obtained from timestamps, they estimated = 0.001 76 (in units of days1 ).
This estimate corresponds to the Web size doubling in about 394 days. In contrast,
if we estimated from the lower bounds of 320 million pages in December 1997
(Lawrence and Giles 1998b) and 800 million pages in February 1999 (Lawrence and
Giles 1999), roughly 426 days later, we would obtain = 0.022, or a doubling time
ADVANCED CRAWLING TECHNIQUES 165
of 315 days. Despite the differences in these two estimates, they are of the same order
of magnitude and give us some idea of the growth rate of the Web.
Another important problem when using empirically measured ages is that servers
often do not return meaningful timestamps. This is particularly true in the case of
highly dynamic Web pages. For example, it is not always possible to assess from the
timestamp whether a document was just edited or was generated on-the-y by the
server. These considerations suggest that the estimation of change rates should be
carried out using lifetimes rather than ages. The main difculty is how to deal with
the problem of potentially poorly chosen timespans, as exemplied in Figure 6.6.
Brewington and Cybenko (2000) suggest a model that explicitly takes into account
the probability of observing a change, given the change rate and the timespan. Their
model is based on the following assumptions.
Document changes are events controlled by an underlying Poisson process,
where the probability of observing a change at any given time does not depend
on previous changes. Given the timespan and the change rate , the probability
that we observe one change (given that it actually was made) is therefore
P (c | , ) = 1 e . (6.17)
It should be observed that, in their study, Brewington and Cybenko (2000)
found that pages are changed with different probabilities at different hours of
the day or during different days of the week (most changes being concentrated
during ofce hours). Nonetheless, the validity of a memoryless Poisson model
is assumed.
Mean lifetimes are Weibull distributed (see Appendix A), i.e. denoting the mean
lifetime by t = 1/, the pdf of t is
t 1 (t/)
w(t) = e , (6.18)
where is a scale parameter and is a shape parameter.
Change rates and timespans are independent and thus
P (c | ) = P (c, | ) d = P ( )P (c | , ) d.
0 0
where w(1/) is an estimate of the mean lifetime. Brewington and Cybenko (2000)
used the data shown in Figure 6.7 to estimate the Weibull distribution parameters,
obtaining = 1.4 and = 152.2. The estimated mean lifetime distribution is plotted
in Figure 6.9.
166 WEB DYNAMICS
Estimated Estimated
1
probability density function cumulative distribution function
of mean page lifetime of page lifetime
0.004
0.8
0.003 0.6
0.002 0.4
0.001 0.2
0 t
200 400 600 800 1 5 10 50 100 500 1000
Lifetime in days Lifetime in days
Figure 6.9 Estimated density and distribution of mean lifetime resulting from
the study of Brewington and Cybenko (2000).
fetch fetch
Grace period
0 t- t I
Figure 6.10 A document is -current at time t if no changes have occurred before
the grace period that extends backward in time until t .
These results allow us to estimate how often a crawler should refresh the index
of a search engine to guarantee that it will remain (, )-current. Let us consider
rst a single document and, for simplicity, let t = 0 be the time when the crawler
last fetched the document. Also, let I be the time interval between two consecutive
visits (see Figure 6.10). The probability that for a particular time t the document
is unmodied in [0, t ] is e(t) for t [, I ) and 1 for t (0, ). Thus, the
probability that a specic document is -current at time t is
I
1 1 (t) 1 e(I )
dt + e dt = + , (6.20)
0 I I I I
but since each document has a change rate , whose reciprocal is Weibull distributed,
the probability that the collection of documents is -current is
1 e(I /t)
= w(t) + dt. (6.21)
0 I t/I
This allows us to determine the minimum refresh interval I to guarantee (, )-
currency once the parameters of the Weibull distribution for the mean change rate are
known. Assuming a Web size of 800 million pages, Brewington and Cybenko (2000)
ADVANCED CRAWLING TECHNIQUES 167
determined that a reindexing period of about 18 days was required to guarantee that
95% of the repository was current up to one week ago.
.com
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6.11 Average change interval found in a study conducted by Cho and Garcia-Molina
(2000a). 270 popular sites were monitored for changes from 17 February to 24 June 1999. A
sample of 3000 pages was collected from each site by visiting in breadth rst order from the
homepage.
are much more dynamic compared to educational or governmental sites. While this
is not surprising, it suggests that a resource allocation policy that does not take into
account site (or even document) specic dynamics may waste bandwidth re-fetching
old information that, however, is recent in the index. From a different viewpoint, this
nonuniformity suggests that, for a given bandwidth, the recency of the index can be
improved if the refresh rate is differentiated for each document.
To understand how to design an optimal synchronization policy we will make
several simplifying assumptions. Suppose there are N documents of interest and
suppose we can estimate the change rate i , i = i, . . . , N , of each document. Suppose
also that it will be practical to program a crawler that regularly fetches each document
i with a refresh interval Ii . Suppose also that the time required to fetch each document
is constant. The fact that we have limited bandwidth should be reected in a constraint
involving Ii . If B is the available bandwidth, expressed as the number of documents
that can be fetched in a time unit, this constraint is
N
1
N. (6.22)
Ii
i=1
The problem of optimal resource allocation consists of selecting the refresh intervals
Ii so that a recency measure of the resulting index (e.g. freshness or index age) will be
maximized (Cho and Garcia-Molina 2000b). For example, we may want to maximize
freshness
N
(I1 , . . . , IN ) = arg max (i , i ) (6.23)
Ii ,...,IN
i=1
subject to (6.22). We might be tempted to arrange a policy that assigns to each doc-
ument a refresh interval Ii that is proportional to the change rate i . However, this
intuitive approach is suboptimal, and can be proven to be even worse than assign-
ing the same interval to each document. The optimal intervals can be easily derived
ADVANCED CRAWLING TECHNIQUES 169
Exercises
Exercise 6.1. Write a simplied crawling program that organizes the list of URLs
to be fetched as a priority queue. Order the priority queue according to the expected
indegree of the page pointed to by each URL and compare your results to a best-rst
search algorithm, using the actual indegree as a target measure of relevance.
Exercise 6.2. Extend the crawler developed in Exercise 6.1 to search for documents
related to a specic topic of interest. Collect a set of documents of interest to be used as
training examples and use one of text categorization tools studied in Chapter 4 to guide
the crawler. Compare results obtained using the parent-based and the anchor-based
strategies.
Exercise 6.3. Write a program that recursively scans your hard disk and estimates
the lifetime of your les.
Exercise 6.4. Suppose the page mean lifetime is Weibull distributed with parameters
and . What are the average mean lifetime, the most likely mean lifetime, and the
median mean lifetime? What would be reasonable values for these quantities starting
from an estimated distribution like the one shown in Figure 6.9?
Exercise 6.5. Explain what we mean when we say that a collection of stored docu-
ments is (, )-current.
170 WEB DYNAMICS
Exercise 6.6. Suppose you want to build a (0.8,1-week)-current search engine for
a collection of 2 billion documents whose average size is 10 Kb. Suppose the mean
change is Weibull distributed with = 1 and = 100 days. What is the required
bandwidth (suppose the time necessary to build the index is negligible)?
7
7.1 Introduction
Up to this point in the book we have focused on how the Web works and its general
characteristics, such as the inner workings of search engines and the basic properties
of the connectivity of the Web. This provides us with a rich source of information
on how the Web is constructed and interconnected, illustrating its inner workings
from an engineering viewpoint. However, there is a critically important aspect of the
picture that we have not yet discussed, namely how humans interact with the Web. In
this chapter we will examine in some detail this aspect of human interaction across
a broad spectrum of activities, such as how we use the Web to browse, navigate, and
issue search queries.
The Web can be viewed as an enormous distributed laboratory for studying human
behavior in a virtual digital environment. From a data analysis viewpoint, the Web
provides rich opportunities to gather observational data on a large-scale and to use
such data to construct, test, refute, and adapt models of how humans behave in the
Web environment. For example, if we were to record all search queries issued over a
12-month period at a large search-engine website, we could then use this data for
(1) exploration, e.g. we can generate summary statistics on how site-visitors are
issuing queries, such as how many queries per session, or the distribution of
the number of terms across queries;
(2) modeling, e.g. we can investigate whether there is any dependence between the
content of a query and the time of day or day of week when the query is issued;
(3) prediction, e.g. we can construct a model to predict which Web pages are likely
to be most relevant for a query issued by a site-visitor for whom we have past
historical query and navigation data.
In this chapter we will explore how we can go from exploratory summary statistics
(such as raw counts, percentages, and histograms) to more sophisticated predictive
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
172 WEB DATA AND MEASUREMENT ISSUES
models. These types of models are of broad interest across a wide variety of elds:
from the social scientist who wishes to better understand the social implications of
Web usage, to the human factors specialist who wishes to design better software
tools for information access on the Web, to the network engineer who seeks a better
understanding of the human mechanisms that contribute to aggregate network trafc
patterns, to the e-commerce sales manager who wants to better predict which factors
inuence the Web shopping behavior of a site-visitor.
We begin with an overview of Web data measurement issues. While the Web allows
us to collect vast amounts of data, there are nonetheless some important factors that
limit the quality of this data, depending, for instance, on how and where the data are
collected. In Section 7.2 we discuss in particular those factors that can have a direct
impact on any inferences we might make from the available data. We follow this in
Section 7.3 by looking at a variety of empirical studies that investigate how we navigate
and browse Web pages. Section 7.4 then discusses a number of different probabilistic
mechanisms (such as Markov models) that have been proposed for modeling Web
browsing. Section 7.5 concludes the chapter by looking at how we issue queries to
search engines, discussing data from various empirical studies as well as a number of
probabilistic models that capture various aspects of how we use Web search engines.
10000
Saturday Sunday Monday Tuesday Wednesday Thursday Friday
5000
0
0 20 40 60 80 100 120 140 160
10000
HUMAN-GENERATED TRAFFIC
5000
0
0 20 40 60 80 100 120 140 160
5000
0
0 20 40 60 80 100 120 140 160
Time (in hours)
Figure 7.1 Number of page requests per hour as a function of time from page requests
recorded in the www.ics.uci.edu Web server logs during the rst week of April 2002. The
solid vertical lines indicate boundaries between the different days (at midnight). The top plot
is total trafc, the center plot is human-generated trafc, and the lower plot is robot-generated
trafc.
Robot page requests were identied by classifying page requests using a variety of
heuristics for example, many robots self-identify themselves in the server logs. The
top plot is the total number of page requests per hour, the center plot is the estimated
number generated by humans per hour, and the lower plot is the estimated number of
page requests generated by robots per hour. The top plot is the sum of the center and
lower plots.
We can clearly see that the human-generated trafc is quite different from the
robot-generated trafc. The robot trafc consists of two components: periodic spikes
in trafc combined with a lower-level relatively constant stream of requests. The more
constant component of the robot trafc can be assumed to be from robots from good
citizen search engines and cache sites, which attempt to distribute over time their
crawler requests to any single site so as not to cause any signicant bursts of activity
(and potential deterioration in quality of service) at the Web server. The spiky trafc
on the other hand can be assumed to be coming from a less well-intentioned source
(or sources), whose crawler periodically oods websites with page requests and can
174 WEB DATA AND MEASUREMENT ISSUES
potentially overload a server. In Figure 7.1 the dominant burst of page requests is
arriving at the same time each night, just after midnight.
The human-generated trafc has quite a different characteristic. There is a noisy,
but quite discernible, daily pattern of trafc, particularly during Monday to Friday.
Human-generated trafc tends to peak around midday, and tends to be at its lowest
between 2 am and 6 am (not surprisingly). This suggests that much of the trafc to the
site is being generated in synchrony with the daily Monday to Friday work schedule
of university staff and students.
The main point here is to illustrate that Web server log data need careful interpreta-
tion. The presence of robot data in this example is a signicant contributor to overall
trafc patterns on the website. For example, if we were to mistakenly assume that the
data in the top plot in Figure 7.1 were all human-generated, this assumption might
lead us to rather different (and false) hypotheses about time-dependent patterns in
human trafc to the site.
Local Computer
Page Caching, Multiple Users
Local Network
Dynamic Addressing
Proxy Server
Page Caching
Web Server
Figure 7.2 A graphical summary of how page requests from an individual user can be masked
at various stages between the users local computer and the Web server, via page caching at
the browser level, multiple different users on the local computer, dynamic network address
assignment at the local network level, and more page caching by proxy servers.
it can simply reply with a message that it is too busy to process the request at that
time.
Thus, in theory, the Web server sees every individual request for pages on the site. In
practice the situation is a little more complex, as illustrated graphically in Figure 7.2.
Browser software, for example, often stores in local memory the HTML content of
pages that were requested earlier by a user. This is known as caching and is a generally
useful technique for improving response time for pages that are requested more than
once over a period of time. For example, if the content of a particular Web page does
not change frequently, and a user uses the back button on their browser to revisit the
page during a session, then it makes sense for the browser to redisplay a stored or
cached copy of the page rather than requesting it again from the Web server. Thus,
although the user viewed the page a second time, this page viewing is not recorded
in the Web server log because of local caching.
Caching can also occur at intermediate points between the users local computer
and the Web server. For example, an Internet service provider (ISP) might cache Web
pages that are particularly popular and store them locally. In this manner, requests
from users for these pages will be intercepted by the ISP cache and the cached version
of the page is displayed on the users browser. Once again the page request by the
user for the cached page will not be recorded in the Web server logs.
Proxy servers can be thought of as a generalization of caching. A proxy server is
typically designed to serve the Web-surng needs of a large set of users, such as the
customers of an ISP or the employees of a large organization. The proxy exists on
the network at a point between each individuals computer and the Web at large. It
acts as a type of intermediary for providing Web content to users. For example, the
176 WEB DATA AND MEASUREMENT ISSUES
contents of entire websites that are frequently accessed might be downloaded by the
proxy server during low-trafc hours, in order to reduce network trafc at peak hours.
Various security mechanisms and ltering of content can also be implemented on the
proxy server.
From a data analysis viewpoint the overall effect of caching and proxy servers is
to render Web server logs somewhat less than ideal in terms of being a complete and
faithful representation of individual page views.
In practice, in analyzing Web data, one can use various heuristics to try to infer the
true actions of the user given the relatively noisy and partial information available
in the Web server logs at the server. For example, one can try to combine pairs of
referrer-requested pages with knowledge of the link structure of the website, to try
to infer the actual path taken by an individual through a website. For example, if a
user were to request the sequence page A, page B, page C, page B, page F, the Web
server log would likely only record the sequence ABCF, since page B would likely be
cached and not requested from the server the second time it was viewed. The logged
referrer page for F would be page B, and using the timing information for the requested
pages (say, for example that page F was requested relatively quickly after page C)
one could then infer that the actual sequence of requests was ABCBF. In simple cases
these heuristics can be quite useful (see, for example, Anderson et al. (2001) for a
denition of Web trails in this context). However, in the general case, one can rarely be
certain about what the user actually viewed. Statistical methods are likely to be quite
useful for handling such uncertainty. At the present time, however, relatively simple
heuristics, such as only using actual logged page requests in subsequent data analysis,
and not making any assumptions about missed requests, are common practice in Web
usage analysis.
on the previous day. Consequently, actions by the same user, such as page requests and
product purchases, during different sessions can be linked together. Even if a users IP
address is dynamically re-assigned between sessions, the cookie information remains
the same.
Thus, we can replace the IP address in the server log le with a more reliable
identier as provided by the cookie le. Commercial websites use cookies extensively
and they can be quite effective in improving the quality of matching page requests to
individual users, or at least matching to specic login accounts on specic machines.
Although browser software allows individual users to disable the storing of cookies
on their machines, many individual users either choose not to do so voluntarily ( or
are unaware of the fact that cookies can be disabled). Many commercial websites
require that cookies are enabled in order for a user to access their site. It is estimated
that in general well over 90% of Web surfers have the cookie feature permanently
enabled on their browsers, and cookies generally tend to be quite effective for this
reason. Of course, cookies in effect rely on the implicit cooperation of the user and,
thus, raise the issue of privacy. For this reason some non-commercial sites (such as
those at university campuses) often choose not to activate cookie creation on their
Web servers in the belief that cookies can be viewed as somewhat of an invasion of
an individuals right to privacy.
Another option for user identication is to require users to voluntarily identify
themselves via a login or registration process, e.g. for websites such as newspapers
that charge fees for accessing content. As with cookies, this can provide much higher
reliability in terms of being able to accurately associate particular requests with par-
ticular individuals, but again comes with the cost of imposing additional requirements
on the user, which in turn may discourage potential visitors to the site.
Table 7.1 Estimates of monthly average user activity as reported from client-side data
at www.netratings.com for September 2002.
phone numbers dialed are repeated (57%) and how many Unix command lines are
repetitions of earlier commands (75%).
4500
4000
Number of unique pages visited per user
3500
3000
2500
2000
1500
1000
500
0
0 0.5 1 1.5 2 2.5
Total number of pages visited per user 4
x 10
4
10
Number of unique pages visited per user
3
10
2
10
1
10
2 3 4 5
10 10 10 10
Total number of pages visited per user
Figure 7.3 The number of distinct pages visited versus page vocabulary size for each of
17 users in the Cockburn and McKenzie (2002) study, on both (a) standard, and (b) loglog
scales.
HUMAN BEHAVIOR AND THE WEB 183
0.45
0.4
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18
User IDs
Figure 7.4 Bar chart of the ratio of the number of page requests for the most frequent page
divided by the total number of page requests, for 17 users (sorted by ratio) in the Cockburn
and McKenzie (2002) study.
Cockburn and McKenzie also found several other interesting and rather pronounced
characteristics of client-side browsing. For example, the most frequently requested
page for each user (e.g. their home page) can account for a relatively large fraction of
all page requests for a given user. Figure 7.4 shows a bar chart of this fraction for each
of the 17 users, ranging from about 50% of all requests for one user down to about
5%. This data also indicates that, while there is often considerable regularity in Web-
browsing behavior for a particular individual, there can be considerable heterogeneity
in behavior across different individuals.
In addition to the studies mentioned above, other types of research studies have been
used to improve our understanding of client-side Web behavior. For example, Byrne et
al. (1999) analyzed video-taped recordings of eight different users as they used their
Web browsers over a period of 15 min to about 1 h. The resulting audio descriptions by
the users of what they were doing and the video recordings of their screens were then
analyzed and categorized into a taxonomy of tasks. Among the empirical observations
were the fact that users spent a considerable amount of time scrolling up and down
Web pages (40 min out of 5 h). Another observation was that a considerable amount
of time was spent by users waiting for pages to load, approximately 50 min in total,
or 15% of the total time. This gure is of course highly dependent on bandwidth and
other factors related to the network infrastructure between the user and the website
being visited.
184 PROBABILISTIC MODELS OF BROWSING BEHAVIOR
Under a rst-order Markov assumption, we assume that the probability of each state
st given the preceding states st1 , . . . , s1 only depends on state st1 , i.e.
L
P (s) = P (s1 ) P (st | st1 ). (7.1)
t=2
HUMAN BEHAVIOR AND THE WEB 185
This provides a simple generative model for producing sequential data, in effect a
stochastic nite-state machine: we select an initial state s1 from a distribution P (s1 )
then, given s1 , we choose state s2 from the distribution P (s2 | s1 ), and so on. Denoting
Tij = P(st = j | st1 = i) as the probability of transitioning from state i to state j ,
where M j =1 P (st = j | st1 = i) = 1, we dene T as an M M transition matrix
with entries Tij , assuming a stationary Markov chain (see Appendix A).
This type of rst-order Markov model makes a rather strong assumption about
the nature of the data-generating process, namely that the next state to be visited is
only a function of the current state and is independent of any previous states visited.
The virtue of such a model is that it provides a relatively simple way to capture
sequential dependence. Thus, although the true data-generating process for a particular
problem may not necessarily be rst-order Markov, such models can provide useful
and parsimonious frameworks for analyzing and learning about sequential processes
and have been applied to a very wide variety of problems ranging from modeling
of English text to reliability analysis of complex systems (see, for example, Ross
2002).
Returning to the issue of modeling how a user navigates a website, consider a
website with W Web pages, where we wish to describe the navigation patterns of
users at this website. Since W may often be quite large (e.g. on the order of 105 or
106 for a large computer science department at a university), it may not be practi-
cal to represent each Web page with its own state, since a W -state model requires
O(W 2 ) transition probability parameters to specify the Markov transition matrix T .
To alleviate this problem we can cluster or group the original W Web pages into a
much smaller number M of clusters, each of which is assigned a state in the Markov
model. The clustering into M states could be carried out in several different ways,
e.g. by manual categorization into different categories based on content, by grouping
pages based on directory structure on the Web server, or by automatically clustering
Web pages using any of the clustering techniques described earlier in Chapters 1
and 4.
Transition probabilities P (st = j | st1 = i) in such a model represent the
probability that an individual users next page request will be from category j , given
that their most recent page request was in category i. We can add a special end-state to
our model (call it E), which indicates the end of a session. For example, a commonly
used heuristic is to declare the end of a session once a certain time duration, such as
20 min, has elapsed since the last page request from that user.
As an example, if we have three categories of pages plus an end state E the transition
matrix for such a model can be written as
P (1 | 1) P (2 | 1) P (3 | 1) P (E | 1)
P (1 | 2) P (2 | 2) P (3 | 2) P (E | 2)
T = P (1 | 3)
. (7.2)
P (2 | 3) P (3 | 3) P (E | 3)
P (1 | E) P (2 | E) P (3 | E) 0
This model can simulate a set of nite length sequences, where the boundaries between
sequences are denoted by the invisible symbol E. The state E has a self-transition
186 PROBABILISTIC MODELS OF BROWSING BEHAVIOR
probability of zero, to denote that after we get a single occurrence of E the system
restarts and begins generating a new sequence, with probabilities P (1 | E), P (2 | E),
and P (3 | E) of starting in each of the three states. We could if we wish also dene
a xed starting state S of a sequence, which is then followed by one of the other
states however, in effect we only need the end-state E here, since E is sufcient to
denote the boundaries between the end of one sequence and the start of another.
Markov models have their limitations. The assumption of the rst-order Markov
model that we can predict what comes next given only the current state does not take
into account long-term memory aspects of Web browsing. For example, users can
use the back button to return to the page from which they just came, imparting at
least second-order memory to the process. For example, see Fagin et al. (2000) for
an interesting proposal for a stochastic model that uses back buttons.
We can try to capture more memory in the process by using a kth-order Markov
chain, dened in general by the assumption that
N
L() = P (D | ) = P (si | ). (7.4)
i=1
We assume on the right-hand side that page requests from different sessions are
conditionally independent of each other given the model (a reasonable assumption in
general, particularly if the sessions are being generated by different individuals). The
statistical principle of maximum likelihood can now be used to nd the parameters ML
that maximize the likelihood function L( ), i.e. the parameter values that maximize
the probability of the observed data D conditioned on the model (see Chapter 1 for a
general discussion of maximum likelihood).
To nd the ML parameters requires that we determine the maximum of L( ) as
a function of . Under our Markov model, it is straightforward to show that the
likelihood above can be rewritten in the form
nij
L() = ij , 1 i, j M, (7.5)
i,j
where nij is the number of times that we see a transition from state i to state j in the
observed data D. Note that the initial starting states for each sequence are counted as
transitions out of state E in our model.
We use the log-likelihood l() = log L() for convenience:
l( ) = nij log ij .
ij
server logs may be quite large and, thus, D may contain potentially millions or even
hundreds of millions of sequences, there is still a good chance that some of the
observed transition counts nij may be zero in our data D, i.e. we will never have
observed such a transition. For example, if we have a new e-commerce site we might
wish to t a Markov model to the data for prediction purposes but in the rst few
weeks of operation of the site the amount of data available for estimating the model
might be quite limited.
As discussed in Chapter 1, a useful approach to get around this problem of lim-
ited data is to incorporate prior knowledge into the estimation process. Specically,
we can specify a prior probability distribution P ( ) on our set of parameters and
then maximize P ( | D), the posterior distribution on given the data, rather than
P (D | ).
The prior distribution P ( ) is supposed to reect our prior belief about it is
a subjective probability. The posterior P ( | D) reects our posterior belief in the
location of , but now informed by the data D. For the particular case of Markov
transition matrices, it is common to put a distribution on each row of the transition
matrix and to further assume that each of these priors are independent,
P ( ) = P ({i1 , . . . , iM }),
i
where j ij = 1. Consider the set of parameters {i1 , . . . , iM } for the ith row in the
transition matrix T . A useful general form for a prior distribution on these parameters
is the Dirichlet distribution, dened in Appendix A as
M
(qij 1)
P ({i1 , . . . , iM }) = Dqi = C ij , (7.6)
j =1
where , qij > 0, j qij = 1, and C is a normalizing constant to ensure that the
distribution integrates to unity (see Appendix A and Chapter 1 for details). The MP
posterior parameter estimates are
nij + qij
ijMP = . (7.7)
ni +
If there is a zero count nij = 0 for some particular transition (i, j ) then rather than
having a parameter estimate of 0 for this transition (which is what ML would yield),
the MP estimate will be qij /(ni + ), allowing prior knowledge to be incorporated.
If nij > 0, then we get a smooth combination of the data-driven information (nij )
and the prior information represented by and qij .
For estimating transitions among Web pages the following is a relatively easy way
to set the relative sizes of the priors.
(1) Let be the weight of the prior. One intuitive way to think of this is as
the effective sample size required before the counts for a particular state ni
balance the prior component in the denominator in Equation (7.7).
HUMAN BEHAVIOR AND THE WEB 189
(2) Given a value for , partition the states into two sets: set 1 contains all states
that are directly linked to state i via a hyperlink, and set 2 contains all states
that have no direct link from state i.
(3) Assign a total prior probability mass of , 0 1, on transitions out of any
state i into the states in set 2. We could further assume that transitions to all
states in set 2 are equally likely a priori, so that qij = /K for states j in set 2
(those not linked by a direct hyperlink to any page in state i). This assumes that
we have no specic information about which states are more likely in set 2.
The total probability reects our prior belief that a site visitor will transition
to a Web page by a non-hyperlink mechanism such as via a bookmark, by use
of the history information in a browser, or by directly typing in the URL of the
page. Observational data (e.g. Tauscher and Greenberg 1997) suggest that is
typically quite small, i.e. users tend to follow direct hyperlinks when surng.
(4) For the remaining transitions, those to states in set 1 (for which hyperlinks
exist from state i), we have a total of probability mass of 1 to assign. We
can either assign this uniformly, or we could weight it nonuniformly in various
ways, such as (for example) weighting by the number of hyperlinks that exist
between all of the Web pages in state i and those in set 1 (if the states represent
sets of Web pages).
(5) The prior probabilities both into and out of the end-state E can be set based on
our prior knowledge of how likely we think a user is to exit the site from any
particular state, and our prior belief over which states users are likely to use
when entering the site.
This assignment of a prior on transition probabilities is intended as a general guide of
how Bayesian ideas can be utilized in building models for Web trafc, rather than a
denitive prescription. For example, the model would need to be extended to account
for other frequently used navigational mechanisms such as the back button. Nor does
it account for dynamically generated Web pages (such as those at e-commerce sites),
where the hyperlinks on a page vary over time dynamically in a non-predictable
fashion.
The general idea of using a small probability for nonlinked Web pages is quite
reminiscent of the methodology used to derive the PageRank algorithm earlier in
Chapter 5 (we might call this the Google prior). Sen and Hansen (2003) describe
the use of Dirichlet priors for the problem of estimating Markov model parameters,
with applications to characterizing and predicting Web navigation behavior.
able to make such predictions provides a basis for numerous different applications,
such as pre-fetching and caching of Web pages and personalizing the content of a
Web page by dynamically adding hyperlink short-cuts to other pages.
As mentioned earlier, for a typical website the number of pages on the site can be
quite large. For example, for the UC Irvine website (Figure 7.1) it is estimated that
there are on the order of 50 000 different pages on the site. Thus, it is typical to group
pages into M categories and build predictive models for these groups with an M-state
Markov chain. The groups can be dened, for example, in terms of page functionality
(Li et al. 2002), page contents (Cadez et al. 2003), or the directory structure of the
site (Anderson et al. 2001).
Simple rst-order Markov models have generally been found to be inferior to
other types of Markov models in terms of predictive performance on test data sets
(Anderson et al. 2001; Cadez et al. 2003; Deshpande and Karypis 2001; Sen and
Hansen 2003). The most obvious generalization of a rst-order Markov model is the
kth-order Markov model. A kth-order model is dened so that the prediction of state
st depends on the previous k states in the model:
P (st | st1 , . . . , s1 ) = P (st | st1 , . . . , stk ), k {1, 2, 3, . . .}. (7.8)
This model requires the specication of O(M k+1 )
transition probabilities, one for
each of the possible combinations of M k histories multiplied by M possible current
states. An obvious problem with this model is that the number of parameters in the
model increases combinatorially as a function of both k and M. Thus, for example,
if we use simple frequency-based (maximum-likelihood) estimates of the transition
probabilities we face the problem of having many subsequences of length k + 1 that
will not have been seen in the historical data. For example, if M = 20 and k = 3,
then we need to have at least M k+1 = 204 = 160 000 symbols in the training data set.
Even this number is optimistically low some subsequences will be more frequent
than others and in addition we will need to see multiple instances of each subsequence
to get reliable estimates of the corresponding transition probabilities.
There are a number of ways to get around this problem. For example, Deshpande
and Karypis (2001) describe a number of different schemes for pruning the state
space of a set of kth-order Markov models such that histories with little predictive
power are pruned from the model. Empirical results with page-request data from two
different e-commerce sites showed that these pruning techniques provided systematic
but modest improvements in predictive accuracy compared to using the full model.
Techniques from language modeling are also worth mentioning in this context, since
language modeling (e.g. at the word level) can involve both very large alphabets
and dependencies beyond near-neighbors in the data. Of particular relevance here
are empirical smoothing techniques that combine different models from order one to
order k, with weighting coefcients among the k models that are (generally speaking)
estimated to maximize the predictive accuracy of the combined model. See Chen
and Goodman (1996) for an extensive discussion and empirical evaluation of various
techniques and MacKay and Peto (1995a) for a description of a general Bayesian
framework for this problem.
HUMAN BEHAVIOR AND THE WEB 191
Cadez et al. (2003) and Sen and Hansen (2003) propose mixtures of Markov chains,
where we replace the rst-order Markov chain
K
P (st+1 | s[1,t] ) = P (st+1 , c = k | s[1,t] )
k=1
K
= P (st+1 | s[1,t] , c = k)P (c = k | s[1,t] )
k=1
K
= P (st+1 | st , c = k)P (c = k | s[1,t] ). (7.12)
k=1
The last line follows from the fact that, conditioned on component c = k, the next
state st+1 only depends on st . Thus, the Markov mixture model denes the probability
of the next symbol as a weighted convex combination of the transition probabilities
P (st+1 | st , c = k) from each of the component models. The weights are the mem-
bership probabilities P (c = k | s[1,t] ) based on the observed history s[1,t] :
P (s[1,t] | c = k)P (c = k)
P (c = k | s[1,t] ) = , 1 k K, (7.13)
j P (s[1,t] | c = j )P (c = j )
192 PROBABILISTIC MODELS OF BROWSING BEHAVIOR
where
L
P (s[1,t] | c = k) = P (s1 | c = k) P (st | st1 , c = k). (7.14)
t=2
Intuitively, these membership weights evolve as we see more data from the user,
i.e. as t increases. In fact, it is easy to show that as t and if the model is
correct, then one of the weights will converge to one and the others to zero. In other
words, if a data sequence becomes long enough it will always be possible to perfectly
identify which of the K components is generating the sequence. In practice, sequences
representing page requests are often quite short. Furthermore, it is not realistic to
assume that the observed data are really being generated by a mixture of K rst-order
Markov chains the mixture model is a useful approximation to what is likely to
be a more complex data-generating process. Thus, in practice, several of the weights
P (c = k | s[1,t] ) will be nonzero and the corresponding models will all participate
in making predictions for st+1 via Equation (7.12).
Comparing Equations (7.11) and (7.12) is informative it is clear from the func-
tional form of the equations that the mixture model provides a richer representation
for sequence modeling than any single rst-order Markov model. The mixture can in
effect represent higher-order information for predictive purposes because the weights
encode (in a constrained fashion) information from symbols that precede the current
symbol.
A value for K in this type of mixture model can be chosen by evaluating the out-
of-sample predictive performance of a number of models on unseen test data. For
example, the one-step ahead prediction accuracy could be evaluated on a validation
data set for different values of K and the most accurate model chosen.
Alternatively, the log probability score, L1t=1 log P (st+1 | s[1,t] ), provides a some-
what different scoring metric, where a model is rewarded for assigning high predictive
probability to the state that occurs next and, conversely, is penalized for assigning
probability to the other states. This is suggestive of a prediction game, where pre-
dictive models compete against each other by spreading a probability distribution
over the M possible values of the unseen state st+1 . If the model is too condent in
its predictions and assigns a probability near unity to one of the states, then it must
assign a very low probability to other states, and may be penalized substantially (via
log P ) if one of these other states actually occurs at t + 1. On the other hand, if the
model is very conservative and always hedges its bets by assigning probability 1/M
to each of the states, then it will never gain a high log P score. Good models provide
a trade-off between these two extremes.
If we take the negative of the log probability score and normalize by the number
of predictions made, 1/L L t=1 log P (st+1 | s[1,t] ), we get a form of entropy that
is bounded below by zero. The model that always makes perfect predictions gets a
score of log P = log 1 = 0 for each prediction and a predictive entropy of zero,
while the model that always assigns probability 1/M to each possible outcome will
have a predictive entropy of log M. Cadez et al. (2003) describe the evaluation of
predictive entropy for a variety of Markov and nonMarkov models using a test data
HUMAN BEHAVIOR AND THE WEB 193
set of roughly 100 000 page request sequences logged at www.msnbc.com with
M = 17 page categories. The entropy ranged from about four bits (using log to the
base two) for a single histogrammodel down to about 2.2 bits for the best-performing
mixture of Markov chains with about K = 60 components.
A number of other variations of Markov models have been proposed for page-
request predictions, including variable-order Markov models (Sen and Hansen 2003)
and position-dependent Markov models where the transition probabilities depend on
the position of the symbol in the sequence (Anderson et al. 2001). This latter model
can be viewed as a nonstationary Markov chain. The position-dependent model makes
sense for websites that have a pronounced hierarchical structure, where (for example)
a user descends through different levels of a tree-structured directory of Web pages.
Several other variations of Markov models for prediction are explored in Zukerman
et al. (1999), Anderson et al. (2001, 2002), and Sen and Hansen (2003).
since, after one enters state i, the probability of exiting to another state after exactly r
subsequent state transitions requires r 1 self-transitions with probability Tii followed
by a transition out of state i (with probability 1 Tii ). The mean of this geometric
distribution is 1/1 Tii , so that the mean runlength is inversely proportional to the
exit probability.
This geometric form alerts us to a potential limitation of rst-order Markov mod-
els. In practice the geometric model for run-lengths might be somewhat restrictive,
particularly for certain types of states and websites. For example, for a news website
or an e-commerce website, we might nd that observed runlengths in a state such as
weather or business news have a mode at r = 2 or more page requests (since users
may be more likely to request multiple pages rather than just a single page) rather
than the monotonically decreasing shape of a geometric distribution with a mode at
r = 1.
We can relax the geometric distribution on runlengths implied by the Markov
model by using a semi-Markov model. A semi-Markov model operates in a generative
manner as follows: after we enter state i we draw a runlength r from a state-dependent
distribution Pi (r), and then after r time-steps stochastically transition to a different
state according to a set of transition probabilities from state i to the other states. (The
more usual variant of semi-Markov models uses continuous time, where on entering
a state i a time duration is drawn from a distribution on time for that state, > 0).
194 PROBABILISTIC MODELS OF BROWSING BEHAVIOR
= 0.5 =1
0.7 0.4
0.6
0.3
0.5
0.4
P(r)
P(r)
0.2
0.3
0.2
0.1
0.1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Runlength r Runlength r
=2 =5
0.35 0.2
0.3
0.15
0.25
0.2
P(r)
P(r)
0.1
0.15
0.1
0.05
0.05
0 0
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Runlength r Runlength r
For example, we could have Pi (r) be a modied Poisson distributed random vari-
able,
r1
Pi (r) = e , (7.15)
(r 1)!
where r is a positive integer, > 0 is a parameter of the Poisson model, and the
distribution has mean + 1 and variance . The standard Poisson distribution (see
Appendix A) has r = 0, 1, 2, . . ., whereas here we use a slightly modied denition
that has r starting at one, so that the mean is also shifted by one (in the standard model
the mean and the variance are both (see Appendix A)).
The most likely run-length (the mode) is the largest integer k such that k + 1.
Thus, for example, if we choose > 1 for this model, the most likely runlength will be
two or greater, providing a qualitatively different shape for our runlength distribution
compared to the geometric model which has its mode at one (see Figure 7.5 for
examples). For example, a Poisson distribution with = 0.5 might be appropriate for
websites where visits tend to be relatively short and the most likely number of page
requests is one, while a Poisson distribution with = 5 might be a better model for
websites where most visitors request multiple pages.
1
10
Empirical Frequency of L
2
10
3
10
4
10
5
10
6
10
0 1 2
10 10 10
Session Length L
Figure 7.6 Histogram on a log-scale of session length for 200 000 different sessions at the
www.ics.uci.edu website in early 2002. Session boundaries were determined by 20 min
or longer gaps between page requests, and site visitors that were clearly identiable as robots
were removed.
3.325
3.33
Log-Likelihood
3.335
3.34
3.345
1.9 1.95 2 2.05 2.1
Parameter in the power-law model
given an observed set of session lengths {L1 , . . . , LN } that are assumed to be con-
ditionally independent given the model, where P (Li | ) is dened as in Equa-
tion (7.16). Differentiating this expression (or the log-likelihood) with respect to
, and setting it to zero to nd the maximum, doesnot yield a closed-form solu-
tion for the ML value for , since the term C = 1/ L cannot be reduced to a
closed-form function of .
The solution can, however, be obtained by standard numerical methods or, since
we only have a single parameter to t, we can just visually examine the log-likelihood
as a function of and read-off an approximate value for ML . Figure 7.7 shows the
log-likelihood function, under the power-law distribution model, as a function of ,
for the empirical data shown in Figure 7.6, where we see that the maximum value
occurs at 2 ( = 1.98 to be precise). Under the power-law model, maximum
likelihood suggests that P (L) L1.98 . Figure 7.8 shows this model superposed on
the empirical data.
Note that an alternative technique for tting the parameter of a power-law dis-
tribution to empirical data is to use linear regression directly, i.e. to estimate the best-
tting slope for the line in Figure 7.6 with a least-squares penalty function. While this
may seem like a reasonable approach, it can provide somewhat different results than
maximum likelihood, since it in effect places equal weight on the error in predicting
the count for each value of L, even though (for example) small values of L with large
counts may account for a much larger fraction of the data than large values of L with
very small counts. Maximum likelihood on the other hand seeks parameter values
that maximize the probability of the data as a whole, rather than equally weighting
the t for different values of L. In many respects this is a more suitable criterion than
HUMAN BEHAVIOR AND THE WEB 197
0.7
0.6
0.5
Probability of r
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18 20
Session Length r
0
10
1
10
2
10
Probability of r
3
10
4
10
5
10
6
10
0 1 2
10 10 10
Session Length r
Figure 7.8 Comparing the tted power-law model (dotted line) using an ML parameter
estimate of = 1.98 with the observed empirical session lengths (dots), on both (a) standard
scales, and (b) loglog scales.
1
n
ML = Li , (7.18)
n
i=1
i.e. the ML estimate of the mean under this model is the average session length, and
n
1 1 1 1
ML = . (7.19)
n Li
i=1
In Figure 7.9 we show the t of the inverse Gaussian distribution and three of the
other models we have discussed earlier to the session length data from Figure 7.6.
From this plot we see that the power-law t matches the general shape of the empirical
data distribution more closely than any of the other models. The Poisson model has the
HUMAN BEHAVIOR AND THE WEB 199
0
10
POISSON
1
10
GEOMETRIC
2
10
Probability of L
INVERSE GAUSSIAN
3
10 POWER-LAW
4
10
5
10
6
10
0 1 2
10 10 10
Session Length L
Figure 7.9 Comparing a variety of simple parametric session-length models with the observed
empirical session lengths (dots), on a loglog scale, for the session-length data in Figure 7.6.
worst t, with a mode at about r = 3. The geometric and inverse-Gaussian models both
fail to capture the tail properties of the distribution, underestimating the probabilities
of long session lengths. The power-law model is far from a perfect t, but nonetheless
it does capture the general characteristics of the session-length distribution with only
a single parameter. Sen and Hansen (2002) discuss other approaches to modeling
session lengths, such as higher-order Markov models.
(b) various demographic characteristics of users such as age, education level; and
Bucklin and Sismeiro (2003) propose a covariate model of this form for site nav-
igation behavior. As with the Huberman et al. model, they also consider a decision-
theoretic framework for modeling the utility of continuing to a new page. However, in
the Bucklin and Sismeiro model, this utility is modeled as a stochastic function of var-
ious covariates rather than just being a stochastic function of the utility of current and
previous pages (the Huberman et al. approach). The covariates in the model include
the number of previous site visits, the visit depth(number of pages viewed on the site
during this session up to this point), as well as various system variables that take into
account the complexity of a page, how long it takes to download, whether it contains
dynamic content, and so forth. In the proposed model, both the utility of requesting an
additional page and the logarithm of the page-view duration (conditioned on being at
a particular page in a particular session), are modeled as linear functions of individual
and page-specic covariates, plus an additive random noise term.
A useful general concept in the work of Bucklin and Sismeiro is heterogeneity of
behavior across users. Rather than estimating a single set of model parameters that
describe all site-users, a more general approach is to allow each user to have their own
parameters. If we had a large amount of site-visit data for each individual user, then
in principle we could estimate the parameters for each user independently using their
own data. However, in practice we are usually in the opposite situation in terms of
browsing behavior, i.e. we typically have very few data for many of the site visitors. In
this situation, estimating model parameters independently for each user (using only
their own data) would result in highly noisy parameter estimates for each individual.
A general statistical approach to this problem is to couple together the parameter
estimates in a stochastic manner, based on the simple observation that the behavior of
different individuals will be dependent although different. In classical statistics this is
known as a random effects model individual behavior is modeled as a combination
of a common systematic effect with an individual-specic random effect.
In Bayesian statistics a somewhat more general approach is taken by specifying
what is called a hierarchical model. In the hierarchical model we can imagine for
simplicity that there are just two levels to the hierarchy: at the upper level we have
a distribution that characterizes our belief in how the parameters of the model are
distributed across the population as a whole; at the lower level we have a set of n
distributions describing our belief in the distributions of parameters for each of n
individuals. Both the population distribution and the n individual distributions can be
learned from the observed data in a Bayesian manner. The upper level distribution
in effect acts as a population prior (learned from the data) that constrains the infer-
ences we make on specic individuals. This population distribution typically acts to
constrain the parameter estimates that we would get from looking at individual is
data alone, so that they are closer to those of the population as a whole. A proper
description of Bayesian hierarchical models is beyond the scope of this text. A good
introductory treatment can be found for example in Gelman et al. (1995).
For Web data analysis the use of hierarchical models (or random effects) is quite
natural given that Web browsing behavior from populations of users can have both
common patterns as well as signicant individual differences. Bucklin and Sismeiro
HUMAN BEHAVIOR AND THE WEB 201
Lau and Horvitz (1999) analyze a sample of 4690 queries from a set of approx-
imately one million queries collected by the Excite search engine in September
1997.
Silverstein et al. (1998) analyze approximately one billion entries in the query
log for the AltaVista search engine collected over six weeks in August and
September 1998.
HUMAN BEHAVIOR AND THE WEB 203
Spink et al. (2002) describe a series of studies over multiple years involving
the Excite search engine, with query logs collected at different time-points in
September 1997, December 1999, and May 2001. The query logs analyzed at
each time-point typically each contain about one million queries.
Xie and OHallaron (2002) analyze 110 000 queries from the Vivisimo search
query logs collected over a 35-day period in January and February of 2001, and
1.9 million queries collected over 8 h in a single day in December 1999 from
the Excite query logs.
We will discuss below the main results from these four studies. Other empirical
studies have also been reported generally speaking, the inferred characteristics of
query patterns across all of these studies are in broad agreement.
0.35
Vivisimo Query Lengths
0.25
Probability, P(L)
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
L = Query Length
0.4
0.35
Excite Query Lengths
0.25
Probability, P(L)
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
L = Query Length
Figure 7.10 Comparison of empirical query length distributions (bars) and tted Poisson
models (dots and lines) for both Vivisimo query data (top) and Excite query data (bottom). The
data were provided by Xinglian Yie and David OHallaron.
only a single query in a particular session, and Lau and Horvitz (1999) found
similar results with only about 25% of the users in their study modifying queries.
Spink et al. (2002) report that the number of users who viewed only a single page
of Excite results (the top 10 websites as ranked by the search engine) increased
from 29% in 1997 to 51% in 2001. In the AltaVista study of Silverstein et al.
(1998) fully 85% of users were found to view only the rst page of search results.
HUMAN BEHAVIOR AND THE WEB 205
The difference in numbers between the two studies could reect two different
types of user populations (Excite users versus AltaVista users). It could also
reect a difference in methodology in terms of how number of pages viewed
was dened and measured in each study.
There appears to have been a shift in the distribution of query topics over
time. Both Lau and Horvitz (1999) and Spink et al. (2002) manually classied
samples of several thousand queries into a predened set of 15 information
goals and 11 general categories, respectively. Both studies found that about
one in six queries in the 1997 samples concerned adult content. In contrast,
in the 2001 sample in the Spink study this gure had dropped to 1 in 12. The
top two topics of commerce, travel, employment, or economy and people,
places or things in the 2001 sample accounted for 45% of all queries. This is a
considerable increase from the 20% share for the same two query topics in 1997,
reecting perhaps the increased use of the Web for business and commercial
purposes. Conversely the top-ranked topic in 1997 with 20% of the queries,
entertainment or recreation, had dropped to a 7% share by 2001. It should be
noted that these topic-related gures are based on a particular sample of users
(about 2500 per sample) at a particular search engine (Excite), and relied on a
subjective process of manual assignment of query terms to categories.
Despite differences in sample populations, search engines, and dates of the studies,
these four studies produced a generally consistent set of ndings about user behavior
in a search-engine context. For example, it is clear from the data that most users view
relatively few pages per query and most users do not use advanced search features.
3
10
4
10
5
10
0 1 2 3 4 5
10 10 10 10 10 10
Rank of the Query
2
10
3
10
Empirical Probability per Rank
4
10
5
10
6
10
0 1 2 3 4 5 6
10 10 10 10 10 10 10
Rank of the Query
Figure 7.11 Frequency of different queries issued as a function of rank, for both Vivisimo
query data (top) and Excite query data (bottom). The data were provided by Xinglian Yie and
David OHallaron.
data analysis. Furthermore although the two curves are from two different search
engines they are quite similar to each other. This again suggests that there are strong
regularities in terms of patterns of behavior in how we search the Web, and these
patterns appear to be relatively independent of the particular search engine being
used.
HUMAN BEHAVIOR AND THE WEB 207
Informational Goals
Time Interval
value pair, given the variable-value combinations of all of its parents, were estimated
by frequency counts (maximum likelihood) directly from the sample of 4690 queries in
the study. This particular model imposes a specic (and somewhat natural) ordering
on the variables (e.g. informational goals precedes all others) but otherwise does
not make any conditional independence assumptions. More generally, if there were
more variables in a model like this, a sparser graph (reecting various conditional
independence assumptions) would likely be used.
Although quite a simple model, Lau and Horvitz found that this network was able
to produce interesting and potentially useful predictions. For example, according to
the model, if a query specialization is followed by another action 1020 s later then
the user is most likely to be searching for entertainment-related information. The
conditional probability for entertainment given the user action and time-interval
information is signicantly higher than the marginal probability of entertainment.
Such correlations between sequences of actions, time between actions, and informa-
tion goals, could in principle allow the search engine to make inferences about the
likely search trajectory of individual users based on their action history. In turn,
these inferences could then be used to provide more relevant feedback to the user in
real time and could also be used for marketing purposes such as real-time targeted
advertising.
7.6 Exercises
Exercise 7.1. Obtain Web server logs from your organizations Web server (if you
can) for one days worth of data and carry out the following analyses:
estimate the fraction of page requests that are coming from robots;
segment the data into sessions, using a 20 minute time-out rule for declaring
the end of sessions, and plot the empirical distribution of session lengths;
t at least two of the following models to the session-length data: geometric,
Poisson, inverse Gaussian, and power-law. Plot and comment on the results.
Exercise 7.2. Derive the mean posterior estimate in Equation (7.7) from rst princi-
ples.
Exercise 7.3. Prove that the ML estimate of the parameter for the Poisson distri-
bution is the empirical mean (sample average) of the data.
Exercise 7.4. Construct three different session-length distributions by designing a
transition matrix T for a four-state Markov chain (with one of the states being an
end state), where each of the distributions should be quite different from the other
two. Write a program to simulate data from each of the Markov models and construct
histograms of the session lengths for each model from your simulations.
Exercise 7.5. Discuss how the Lau and Horvitz (1999) Bayesian network model for
search queries and the Bucklin and Sismeiro (2003) model for site navigation might
HUMAN BEHAVIOR AND THE WEB 209
be combined to provide a more general model that encompasses both navigation and
searching.
Exercise 7.6. We mentioned that for large M (e.g. a large number of pages M) esti-
mating the O(M 2 ) parameters of a rst-order Markov chain may be quite impractical.
Specify in detail two different algorithms for clustering M pages into K states, for
the purposes of estimating a reduced category K-state Markov chain for modeling
navigation behavior.
Exercise 7.7. Discuss whether the rst-order Markov models discussed in Section 7.4
can adequately model the process of users using the back button in their browsers
when browsing. If you believe that the model is not adequate, discuss how one might
develop a stochastic model that can handle this aspect of user behavior.
Exercise 7.8. Derive from rst principles the ML estimates for the inverse Gaussian
distribution in Equations (7.18) and (7.19).
Exercise 7.9. None of the models in Figure 7.9 t the data perfectly well, and indeed
to the right of the data (for session lengths greater than 102 ) there may be some outliers
present. Comment on what aspects of real data sets these simple models may not be
accounting for.
Exercise 7.10. In the techniques described in this chapter we did not discuss the issue
of an individuals right to privacy. Research the issue of privacy in the context of Web
data analysis and write an essay on the current state of affairs in your country in terms
of privacy and the individual surfer. For example, if a surfer visits your website what
does the law say about who owns the logged data? What are the laws (if any) that
govern how Web servers can issue cookies? And so forth.
Exercise 7.11. Imagine that you are conducting a study on Web navigation behavior
10 years from now, with client-side data and demographic information for 100 000
individuals. Predict how you think both navigation and search behavior will have
changed by then, compared to the data in the studies described in this chapter.
This Page Intentionally Left Blank
8
8.1 Introduction
In the late 1990s the Web quickly changed from being primarily an academic pursuit
to an increasingly commercial one. This rapid commercialization of the Web, in
areas such as e-commerce, subscription news services, and targeted advertising, has
led to the infusion of the Web into modern daily life. While commercialization can
have its negative aspects, on the plus side it presents signicant new challenges and
opportunities for academic researchers. Some examples of these are given in the
following list.
Can we design algorithms that can help recommend new products to site-
visitors, based on their browsing behavior?
Can we better understand what factors inuence how customers make purchases
on a website?
Can we predict in real time who will make purchases given their observed
navigation behavior?
In this chapter we will investigate these and related questions. Once again, as in
Chapter 7, what is driving this research is the availability of vast amounts of raw data.
For a standard bricks-and-mortar retail store, the only information about a customers
behavior within a store is usually in the form of scanner data or market basket data,
namely, the list of items and prices that were scanned when the customer paid for
their purchases. In contrast, for a store on the Web, a sequence of clickstream data
is typically available for each customer prior to making a purchase, in addition to
the purchase data (if the customer makes a purchase). This provides a much richer
description of a customers behavior over time. Naturally this clickstream data can
itself be quite noisy, but nonetheless the volume of data available is often large enough
(e.g. millions of customers per day visiting very large e-commerce sites such as
www.amazon.com) that certain patterns clearly emerge even from noisy data.
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
212 CUSTOMER DATA ON THE WEB
Table 8.1 A toy matrix R of binary votes, where items correspond to columns and users
correspond to rows. If entry (i,j ) is 1 it means that the ith user voted positively for the j th
item. A blank entry means that no positive vote was recorded. Note the sparsity of the data.
User 1 1 1
User 2 1 1 1
User 3 1 1
User 4 1 1
User 5 1
User 6 1
User 7 1
User 8 1
User 9 1 1
User 10 1 1
User 11
User 12 1 1 1 1
User 13 1
User 14 1
User 15 1 1
In the second variation of the problem the items are not products that are purchased,
but instead are items that are rated by the user, either explicitly (e.g. movies rated on
a scale of one to ve) or implicitly (e.g. documents in a digital library rated by how
long the user is estimated to have spent viewing the document).
In either case the matrix V is usually very sparse, since the typical user will only
have purchased or rated a very small fraction of the overall number m of items. For
example, Popescul et al. (2001) report a study of recommender systems applied to
the ResearchIndex online scientic literature database (Lawrence et al. 1999), with
n = 33 050 users accessing m = 177 232 documents. Each user accessed on average
only 0.01% 18 documents in the database, so that 99.99% of the possible useritem
pairs are in effect not present.
An interesting issue is the treatment of missing data in the matrix V for items j
that were not voted on by user i, where missing is considered to be all the zeros.
These missing data are not missing completely at random but instead are likely to
be affected by a form of negative selection bias, where users may be more likely to
not vote on items that they do not like (Breese et al. 1998). In much of the work
on recommender systems this bias is not explicitly dealt with. We also choose to
conveniently ignore this issue in the discussion below, under the assumption that it
may be somewhat of a second-order factor in modeling the data compared to the issue
of how to handle the positive votes.
In our discussion we will generically refer to the useritem pairs vi,j as votes,
keeping in mind that votingmight represent purchasing an item on a website, access-
ing a document on a website, or rating an item such as a piece of music or a movie
on a website. All of these actions can be loosely interpreted as positive votes for the
item in question, whether implicit or explicit.
214 AUTOMATED RECOMMENDER SYSTEMS
Many different ideas have been proposed as the basis for recommender algorithms,
the main ones being nearest-neighbor and model-based collaborative ltering. We
discuss both of these approaches as well as some related techniques in the next few
sections.
be the mean vote for user i, 1 i n, where Ii is the set of items that user i has
voted on (i.e. vi,j > 0 for j Ii , and vi,j = 0 otherwise). It is customary to calibrate
user is votes by subtracting the mean vote, to create an adjusted vote matrix
vi,j = vi,j vi
so that if (for example) one user tends to vote high and another tends to vote low, this
systematic mean difference is removed.
We can then make predictions for the vote of new user a on items j that he or she
has not voted on by using
1
n
va,j = va + wa,i vi,j ,
C
i=1
n
where C = i=1 |wa,i | is a normalizing constant. The weights wa,i are dened by
a function that estimates how similar user i is to user a. These weights are usually
calculated based on the similarity of the votes that a has already made on items in
Ia and the votes of other users on this same set of items. The overall effect of the
prediction equation is to generate a predicted vote for items j , where the prediction is
more heavily weighted toward votes of individuals who are have similar past voting
histories to a. Once the predicted vote for each individual item in I \ Ia is calculated,
these predictions can then be ranked and the highest ranked items can be presented
to user a.
216 AUTOMATED RECOMMENDER SYSTEMS
Dening weights
One way to differentiate between different collaborative ltering algorithms is in
terms of how the similarity weights are dened in the prediction equation above. One
widely used weighting scheme (Resnick et al. 1994) is the correlation coefcient
between users i and a, dened as
1
wa,i = (va,j va )(vi,j vi ),
C2
j
where 1/2
C2 = (va,j va )2 (vi,j vi )2
j j
The denominator has the effect of normalizing the contribution of user i relative to
the total number of votes that user i has made. Thus, in an extreme example, if a has
voted on only two items, but there is a user i who has voted on all items with the same
value as user a on the two items in common, then the correlation weight as dened
earlier would have value 1, even though user a and user i may really have very little
in common. In contrast, the vector similarity function would have a weight wa,i that
is much closer to zero than to one, since the contribution of user i to as items would
be downweighted by the fact that i has so many votes in general.
In addition to the weighting scheme, there are a signicant number of other factors
that can be tweaked in terms of constructing an automated algorithm for predictions.
COMMERCE ON THE WEB 217
For example, for a user a with many votes, there may be a very large number of users i
that will match with a but that have relatively small weights. Including these matches
in the prediction equation may actually lead to worse predictions than simply leaving
them out, and thus, one can impose a threshold on wa,i such that only users with
similarity weights above the threshold are used in the prediction, or only select the
top k most similar users, and so forth. The paper by Sarwar et al. (2000) discusses a
variety of such extensions.
where vj 0, 1 indicates whether the vote in the j th column is zero or one and
j k = P (vj | c = k). This is known as a conditional independence or Naive Bayes
model in statistics and machine learning discussed earlier in Chapters 1 and 4. More
generally one could use various forms of Bayesian networks or log-linear models for
each component.
The number of parameters of the Naive Bayes model per component is linear in
dimensionality m, so overall we only have O(Km) parameters compared to O(2m )
for a full joint distribution (for binary votes). The parameters of the model can be esti-
mated from training data using a simple application of the expectation maximization
algorithm that iteratively adjusts the parameters of the hiddenmodel to maximize the
likelihood of the observed data (see, for example, the earlier discussion in Chapter 1
and more generally in Hand et al. (2001)).
The mixture of conditional independence models may be too simple to fully
capture many aspects of the real data. For example, it ignores any dependencies
between items within a cluster, i.e. all pairs P (vj , vl | c = k) are modelled as
P (vj | c = k)P (vl | c = k). Nonetheless, it does capture unconditional marginal
dependence of items, in that the model does allow P (vj , vl ) to be different from
P (vj )P (vl ). Intuitively, we can imagine that for each component there are sets of
items that are relatively likely to have on votes, where P (vj = 1 | c = k) is much
greater than P (vj = 1), and the rest of the items are more likely to have votes vj = 0.
A limitation of this type of model for collaborative ltering is that it implies that
each user can be described by a single component model we are assuming that each
user was generated by one (and only one) of the K components. This is the same
assumption made in the clustering approach discussed earlier, and indeed our mixture
model is essentially a form of clustering with probabilistic semantics. Thus, as before
with clustering, if a users interests span multiple different types of topics (e.g. books
on kayaking, Italian history, and blues music) there may be no single cluster that
represents the combination of these three topics. However, there may well be mixture
components that represent each of the individual topics, e.g. clusters of books for
each of water sports, European history, and music.
Hofmann and Puzicha (1999) proposed an interesting extension of the conditional
mixture model above that directly addresses the problem of multiple interests, based
on a more general model proposed earlier by Hoffman (1999). Their model can be
interpreted as allowing each users interests to be generated by a superposition of K
different underlying simpler component models. Thus, instead of assuming that each
users row vector of votes is generated by a single component model P (v1 , . . . , vm |
c = k), in the Hofmann and Puzicha model each row vector of votes can be generated
by a combination of up to K of the component models. This is a potentially powerful
idea in modeling high-dimensional data. To represent arbitrary combinations of K
different interests we do not need 2K different models but can instead combine K
models appropriately.
Parameters of the Hofmann and Puzicha model can be t to the observed data
matrix V using the EM algorithm. In the original formulation of the model each
user has his/her own set of parameters, leading to potential overtting when using
COMMERCE ON THE WEB 221
the probability of each of the m different items in the data conditioned on the other
items, using the vote matrix V . To make predictions for user a, if the votes are implicit
(e.g. product purchases or Web page requests) we can use all of the other m1 implicit
votes as inputs to predict the vote of interest. Predictions are made for each item that
has an implicit vote of zero, i.e. that has not been purchased or visited, and then the
resulting set of probabilities are ranked to provide a recommendation to the user a.
Heckerman et al. (2000) applied this technique to recommender data sets involving
website visits (a visit to a particular page by a user is considered an implicit binary
vote for that page, and items correspond to pages) and TV viewing records consisting
of data collected by companies that estimate TV viewership here the items corre-
spond to programs watched by viewers. The tree models were compared to a method
proposed earlier by Breese et al. (1998) that constructs a full Bayesian network as
a model for the joint distribution P (v1 , . . . , vm ) and then uses the network to make
predictions on each vj that does not have a positive vote, using all the others. Breese et
al. found that the Bayesian network model empirically outperformed memory-based
and cluster-based collaborative ltering on different ratings data sets, and thus, it
provided a useful comparative benchmark for the tree models.
The models were trained on training subsets and evaluated on disjoint out-of-sample
test subsets. In the test data sets, each users positive votes were randomly partitioned
into two subsets of items: an input set, where the votes were assumed known and
used as inputs to each model, and a measurement set, where the votes were assumed
unknown and used to test the models predictive power. Predictions were made for a
variety of scenarios using this train/test arrangement. Under one scenario all but one
of the positive votes for user a were placed in the input set of items and then used
to predict the other item (this corresponds to knowing as much as possible about the
user). Under the other scenarios only a xed number k 2, 5, 10 of positive votes
were put in the input set for prediction, so that smaller k values corresponded to less
information about the user.
A particular type of objective function was constructed to estimate the utility of
each ranked list of ratings to a user a full description of this function is provided
in Heckerman et al. (2000). For the Web and TV viewing data the probabilistic trees
were generally slightly less accurate than Bayesian networks in terms of this objective
function, but the differences were quite small.
Table 8.2 summarizes a variety of other aspects of the experiments, across the three
data sets, for the specic scenario of all but one prediction results for the other
scenarios were quite similar. The trees are signicantly faster in terms of making
predictions (e.g. 23.5 versus 3.9 per second on the second Web data set), which is an
important feature for real-time prediction on websites. Presumably both algorithms
were implemented efciently so that these numbers provide fair comparisons between
the two. The probabilistic trees also have advantages ofine: they require less time
and an order of magnitude less memory to train than the Bayesian network models.
For both techniques it is quite impressive that models for high-dimensional data can
be constructed so quickly, e.g. in about 100 s for the rst Web data set based on
1000 dimensions (items) and 10 000 users.
COMMERCE ON THE WEB 223
Table 8.2 Summary of data sets and experimental results for all but one prediction experi-
ments from Heckerman et al. (2000). BN is the belief network model and PT is the probability
tree model.
In practical settings, e.g. for a real-world e-commerce application, both the number
of users and the number of items will often be signicantly larger than those in the data
sets described above (Schafer et al. 2001). Nonetheless, these experiments provide
useful guidelines on some of the tradeoffs and options available when using model-
based techniques for recommender systems.
items. On the other hand, systems that are only based on content are clearly ignoring
potentially valuable information that lurks in historical rating and purchasing patterns
of the community.
An example of a model-based approach that combines both content and collabora-
tive ltering is that proposed in Popescul et al. (2001). Their approach is an extension
of the hidden variable model of Hofmann and Puzicha (1999) discussed earlier in Sec-
tion 8.3.3, where now the features of the items are included in the model as well votes
from users. A specic model was developed for the case of users browsing documents
at an online digital library (the NEC ResearchIndex document database), where items
correspond to documents, content consists of the words in each document, and a
positive vote corresponds to user u requesting a particular document m. In this model,
a joint density is constructed by assuming the existence of a hidden latent variable z
that renders users u, documents m, and words w conditionally independent, i.e.
P (u, d, w) P (u | z)P (d | z)P (w | z)P (z). (8.1)
z
As in the Hofmann and Puzicha (1999) approach, the hidden variable z represents
different (hidden) topics for the documents, and multiple topics can be active within
a single document d or for a single user u. The inclusion of the term P (w | z) allows
the inclusion of content information in a natural manner. The EM algorithm can be
used to estimate the conditional probability parameters that relate the hidden variable
z to the observed data. Popescul et al. (2001) found that this particular model had
problems with overtting due to the sparsity of the data, even based on a relatively
active set of 1000 users accessing 5000 documents, where the density of ones in the
data matrix was 0.38% versus 0.01% for randomly selected users. To combat this
overtting, they also proposed a simpler model P (u, w) based on content alone that
in effect infers preferences in word-space, which is much more dense than document-
space since there are far more words, and found empirically that this model produced
better predictions than the original model. Thus, while relatively sophisticated models
can be proposed for combining content and votes, tting these models to sparse high-
dimensional data remains a signicant challenge.
Thus, there is clearly no network effect at play here but rather a brute force attempt
to use the Web for direct advertising.
3.5
Millions of Subscribers
2.5
1.5
0.5
0
0 10 20 30 40 50 60
Weeks
Figure 8.1 Fitted Bass distribution model for the rst 52 weeks of adoption of Hotmail, where
the curve represents the tted model and the dots represent ve of the 52 data points used to
estimate the parameters of the model. (Adapted from Montgomery (2001).)
noted that these particular data could also be well-approximated by a simple quadratic
or exponential for this initial growth period.
If we extrapolate the model to a later time-period we see that the diffusion model has
a characteristic S-shaped curve (Figure 8.2), that asymptotes at N, the estimated
ultimate number of adopters. In Figure 8.2 we have added the number of Hotmail
subscribers after 72 weeks (12 million) and after six years (110 million). We can see
that the coefcients estimated using only the rst 52 weeks of data do not extrapolate
well. It is perhaps not surprising that a model tted to the rst years worth of data do
not extrapolate well to predicting what will happen ve years later. Figure 8.2 also
shows a different set of hand-chosen parameters for the diffusion model, where the
asymptotic value of N = 110 is assumed known and the values of and are each
reduced, resulting in a much better overall t of course in hindsight it is always
somewhat easy to explain the data!
The diffusion model as presented above is certainly too simple to fully explain the
success of Hotmail or other similar Web phenomena. Such models nonetheless serve
the useful purpose of providing a starting point for generative modeling of network-
based recommendations on the Web. In related work on this topic, Domingos and
Richardson (2001) proposed a Markov random eld model based on social networks
for estimating an individuals network value, which depends on how much a particular
individual can inuence other individuals in the network. In a broader context, Daley
228 WEB PATH ANALYSIS FOR PURCHASE PREDICTION
120
100
Bass Model:
80
Millions of Subscribers
60
40
20 Bass Model:
= 0.0012, = 0.008, N = 9.76 million
0
0 50 100 150 200 250 300 350
Weeks
Figure 8.2 Bass distribution models over six years of Hotmail adoption.
Table 8.3 Estimated transition probabilities among eight states from sessions where items
were purchased. The value for row i, column j indicates the probability of going from a page
in category i to a page in category j . The marginal probabilities (the probability of a page being
in a particular category, as estimated from the data) are in the rightmost column.
H A L P I S O E Marginal
to have different navigation patterns through a site than the consumers who are just
browsing in an exploratory manner.
We next examine this hypothesis further using a data set collected by Li et al.
(2002), consisting of the browsing patterns of 1160 different individuals who visited
the online bookstore www.barnesandnoble.com between 1 April and 30 April
2002. The data were collected at the client side using the Comscore Media Metrix
system that records the page requests and page-viewings of a randomly selected set
of computer users. While relatively small in terms of the total number of individuals,
this data set is nonetheless quite useful in terms of providing a general idea of the
characteristics of online shopping behavior.
The 1160 individuals generated 1659 different sessions at the Barnes & Noble
website. The sessions in total consisted of 9180 page requests and 14 512 page view-
ings, where page viewings count all pages viewed by the user and page requests
only record requests that went to the server website (thus, pages that were viewed
twice in a session and redisplayed by the caching software would be counted as page
viewings, but not as page requests). The end of a session was declared if no page had
been viewed for 20 min.
The mean session length (in terms of page viewings) was 8.75, the median was
ve, the standard deviation was 16.4, the minimum length was one, and the maximum
was 570. 7% of sessions ended in purchases, a rate that is relatively high purchase
rates of between 1% and 2% have been reported in the media as being more typical
for e-commerce sites in general (Tedeschi 2000).
Demographic data were also available for the 1160 individuals. They had an average
age of 46, 53% were female, 77% were white, 40% had children, 29% were married,
82% had some college education, and 32% had an annual income in excess of $50 000.
This suggests a relatively well-educated and afuent set of consumers.
One of the difculties in analyzing Web navigation data is the very large number
of possible pages that can be presented to a site visitor. In essence the number is
unbounded, since in addition to there being a page for each product, the pages them-
230 WEB PATH ANALYSIS FOR PURCHASE PREDICTION
Table 8.4 Estimated transition probabilities among eight states from sessions where items
were not purchased. The value for row i, column j indicates the probability of going from a
page in category i to a page in category j . The marginal probabilities (the probability of a page
being in a particular category, as estimated from the data) are in the rightmost column.
H A L P I S O E Marginal
the Shopping Cart (S) and Account (A) states are much higher in the purchase group
than in the non-purchase group, and of course the transition probabilities into the
Order (O) state are all zero for the non-purchase group.
Li et al. tted a series of relatively sophisticated statistical models to this data,
the details of which are somewhat beyond the scope of this text. The most complex
models included several components:
(1) latent variables that modelled the utility of individual i selecting a page from
category c at time t;
(2) time-dependence via a hidden Markov model that allows switching between
states over time;
(3) covariates based on both properties of the pages and individual demographics;
and
(4) heterogeneity across different individuals using a hierarchical Bayes approach,
as briey discussed in Chapter 7.
The models were tted using a widely used stochastic sampling technique for
Bayesian estimation known as the Monte Carlo Markov chain. The different models
were then evaluated in terms of their ability to provide out-of-sample predictions of
both
(a) the next page category requested by an individual given the sequence of pages
requested by that individual up to that time-point and given the individuals
demographic data; and
(b) whether or not that person would make a purchase or not (given the same
information).
In terms of predicting the next category, the results indicated that Markov models
with memory were signicantly more accurate than models that did not have memory,
232 EXERCISES
with the best memory-based models achieving 64% accuracy in predicting the next
category (out of eight) versus only about 20% for non-memory models. It was also
found that hidden Markov models with two states consistently had slightly better
accuracy out-of-sample than models with only a single state, although it should be
cautioned that this was based on only a single train/test split of the available data.
The authors interpreted one of the states as being an exploratory browsing state and
the other as being purchase-oriented, lending some support to the hypothesis earlier
about a simple two-state model for consumer behavior in an e-commerce context.
Demographics did not appear to have signicant predictive power in the models, and
browsing behavior seemed to be relatively independent of the duration of page views
according to the model. Interestingly, the composition of the page appeared to have an
inuence on browsing behavior. Having prices on a page seemed to discourage visitors
in the browsing-oriented state from continuing to browse, while the same information
had a positive effect on visitors in the purchase-oriented state. Conversely, promotional
items such as advertised discounts had the reverse effect, encouraging visitors in
the browsing state to continue browsing, but discouraging users in the purchasing
state from continuing in that state. Naturally, we need to keep in mind that all such
interpretations of the data are being ltered through a particular model and as such may
reect artifacts of the model-tting process and the random sample at hand rather than
necessarily reecting a true property of the processes generating the data. Nonetheless,
these types of inferences are quite suggestive and lead naturally to hypotheses about
consumer behavior that could be properly tested in further experimental studies.
In terms of predicting whether a visitor makes a purchase or not, the models tted
by Li et al. were somewhat less accurate than they were in predicting the next category.
Recall that 7% of all sessions in this data set result in a purchase. The two-state model
was used to predict whether a visitor would make a purchase or not, based on the rst
k page views, for site visitors whose sessions were not used in building the model
(but whose demographic data were assumed known). After two page views, the model
predicted that 12% of the true purchasers would purchase a product and only 5.3%
of the true non-purchasers. After six page views, the model predicted that 13.1% of
the true purchasers would purchase a product and made the same prediction for 2.9%
of the true non-purchasers. Note that a perfect forecasting system would predict a
purchase for 100% of the purchasers and 0% of the non-purchasers. More-accurate
forecasts can be made as we receive more information six page views are more
accurate than two page views but overall there is signicant room for improvement,
since some 87% of true purchasers went undetected by the model after six page
views.
8.6 Exercises
Exercise 8.1. Consider a simple toy model that we could use to simulate a large
sparse binary data set with n rows (users) and m columns (items). The model operates
in the following way.
COMMERCE ON THE WEB 233
(1) Each user acts independently of all other users, using the same model (below).
(2) For a given user i, he or she considers each of the m items in turn, and indepen-
dently makes a decision with probability p as to whether to purchase the item
or not. If they decide to purchase the item then they enter a 1 for that item,
otherwise a 0.
(3) We set p = k/d, where k is a parameter of the model and represents the
expected number of items purchased by each user. Typically k is much smaller
than m, e.g. k = 4, m = 1000.
Calculate the probability, as a function of k, m, and n, that a new user acting under
the same model will have no items at all in common with any of the n users in an
n m data matrix generated under the model above. Plot this probability for different
values of k, m, and n, and comment on the results.
Exercise 8.2. An extension of the model in Exercise 8.1 allows each of the items to
have a different probability pj , 1 j m, of being purchased. A further extension
allows limited dependencies among items, e.g. the probability of a certain item j being
purchased depends on whether another item i was purchased or not, represented by
conditional probabilities P (j | i). Clearly the most complex model we could consider
in terms of item dependencies would be a full joint density on all m items.
How many parameters are needed to specify this full joint density? How many
are required for the independence model in Exercise 8.1? Clearly the independence
model is too simple, but the full joint density is impractical once m becomes large.
Describe as many types of probabilistic models you can think of that are in between
the independence model and the full joint density model (e.g. a Markov model where
each item depends on other item, in a specied order, would be one such model).
Exercise 8.3. A different probabilistic model than the independence model in Exer-
cise 8.1 for generating sparse binary data operates as follows. Each user still generates
data independently. To generate a row of data, the ith user now tosses an m-sided die
some number of times. The number of tosses could also be drawn from a random dis-
tribution, or could be xed for each user. The probability of the m-sided die coming
up on side j is represented by pj , 1 j m, where pj = 1, and for simplicity
we could assume that all pj are equal, i.e. pj = 1/m. If the die comes up on side j ,
the user puts a 1 in column j , otherwise a 0. This type of data generation model is
sometimes referred to as a multinomial model as discussed in Chapter 4.
Compare this model with the independence model in Exercise 8.1, in terms of the
type of data that will be generated. For example, assume that in our model here that
the die is tossed exactly k times for each user. How does the probability distribution
of number of ones per row (we can think of this as the distribution of basket sizes)
under the die model compare to the probability distribution of basket sizes under
the independence model in Exercise 8.1? (Hint: reviewing the material on Markov
chains in Chapter 1 may help.)
You can also consider the problem of trying to model two items that have an
exclusive-OR relationship: the user is likely to purchase one or the other, but not both
Appendix A
Mathematical Complements
In this appendix, we rst begin with a short and rather informal introduction to some
of the basic concepts and denitions of graph theory used primarily in Chapter 3 but
also in the denition of Bayesian networks in Chapter 1. We then review a number
of useful distributions that are used in various sections of this book, and provide
a short introduction to Singular Value Decomposition (SVD), Markov chains, and
the basic concepts of information theory. The list of distributions that are routinely
encountered in probabilistic models and analysis of real world phenomena is not
meant to be exhaustive in any way. In fact, a more detailed list can be found, for
instance, in Gelman et al. (1995).
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
236 APPENDIX A
A.1.2 Connectivity
A path of length k in a graph is an alternating sequence of edges and vertices
v0 , e1 , v1 , . . . , ek , vk with each edge adjacent to the vertices immediately preced-
ing and following it. A directed path is dened in the obvious way by adding the
requirement that all edges be oriented in the same direction. The shortest directed
or undirected path between any two vertices in a graph can be found using standard
dynamic programming/breadth rst methods (Cormen et al. 2001; Dijkstra 1959;
Viterbi 1967). The distance between two vertices can be measured by the length of
the shortest path. A loop is a path where v0 = vk and a cycle is a directed loop. A
directed acyclic graph, or DAG, is a directed graph which contains no directed cycles.
A (strongly) connected component of a (directed) graph is a maximal set of vertices
such that there is a (directed) path joining any two nodes. A graph is connected if it has
only one connected component. The diameter of a component is the maximal length
of the shortest path between any vertices in the component. The diameter of a graph is
the largest diameter of its connected components. A clique of G is a fully connected
subgraph of G (in some denitions, clique are also required to be maximal complete
subgraphs). The mathematical denition of diameter corresponds to a worst-case
denition. For practical purposes, it is often more informative to know what is the
average distance between any two vertices rather than the maximal distance. This is
also called average diameter or average distance in the literature and throughout the
book. Finally, it is easy to see how these connectivity notions can be modied and
adapted as needed in the context of directed graphs.
phase transition around p = log n/n, so that the graph tends to be disconnected below
the threshold value and has a giant connected component above the threshold (see,
for instance, Bollobs 1985). For reasons described in Chapter 3, the random uniform
graphs sometimes are also called random exponential graphs, due to the exponentially
decaying tail of their degree distribution.
A.2 Distributions
A.2.1 Expectation, variance, and covariance
For completeness, we provide a brief reminder on the denition of expectation, vari-
ance, and covariance. Let P (X) be a probability density function for a real-valued
variable X. The expected value of X is dened as
E[X] = xP (x) dx
It should be clear how these expressions can also be applied to discrete random
variables, or to vector random variables, on a component-by-component basis.
is given by
n!
P (X1 = k1 , . . . , Xm = km ) = pk1 . . . pm
km
. (A.2)
k1 ! . . . km ! 1
The mean of each component is E[Xi ] = npi , the variance var[Xi ] = npi (1 pi ),
and the covariance cov[Xi , Xj ] = pi pj for i = j .
Poisson distribution
The Poisson distribution corresponds to rare events. The Poisson distribution with
parameter is given by
k
P (X = k | ) = e . (A.3)
k!
The mean and the variance are equal to . When p is small, the binomial distribution
can be approximated by the Poisson distribution.
Geometric distribution
The geometric density with parameter p (0 < p < 1) is described by
Exponential distribution
The exponential density with parameter is given by
f (x | ) = ex (A.7)
for x 0. The mean is E[X] = 1/ and the variance var[X] = 1/2 . The exponential
distribution is a special case of Gamma distribution.
Gamma distribution
The gamma density (Feller 1971) with parameters and is given by
1 x
(x | , ) = x e (A.8)
()
for x > 0, and zero otherwise. () is the Gamma function () = 0 ex x 1 dx.
The mean is E[X] = / and the variance var[X] = /2 . If X has a Gaussian
distribution with mean zero and variance 2 , then X 2 has a Gamma density with
parameters = 1/2 and = 1/2 2 . The exponential density with parameter is the
Gamma density with parameters = 1and . The Chi-square density with k degrees
of freedom, for instance, is the Gamma density with parameters = k/2 and = 1/2.
Dirichlet distribution
Finally, in the context of multinomial distributions that play an important role in this
book, an important class of distributions are the Dirichlet distributions (Berger 1985).
By denition, a Dirichlet density on the probability vector p = (p1 , . . . , pm ), with
parameters and q = (q1 , . . . , qm ), has the form
q 1
() qi 1 pi i
m m
Dq (p) = pi = , (A.9)
i (qi ) i=1 i=1
Z(i)
with , pi , qi 0 and pi = qi = 1. For such a Dirichlet distribution, E(pi ) =
qi , var[pi ] = qi (1 qi )/( + 1), and cov[pi pj ] = qi qj /( + 1). Thus q is the
mean of the distribution, and determines how peaked the distribution is around its
240 APPENDIX A
F (x | , ) = 1 e(x/) .
(A.11)
where h(x) 0, ti (x) are real-valued functions of the observation x that do not
depend on , and c( ) and wi ( ) are real-valued functions of the parameter vector
that do not depend on x.
Most common distributions belong to the exponential family, including the nor-
mal (with either mean or variance xed), gamma (e.g. Chi square and exponential),
Dirichlet (e.g. Beta), in the continuous case, and binomial and multinomial, geomet-
ric, negative binomial, and Poisson distributions in the discrete case. A characteristic
APPENDIX A 241
of all exponential distributions is the exponential decay to zero for large values of x.
This is in contrast with the polynomial decay of power-law distributions studied in
Chapter 1 and often encountered in the other chapters of this book. Among the impor-
tant general properties of the exponential family is the fact that a random sample
from a distribution in the one-parameter exponential family always has a sufcient
statistic S. Furthermore, the sufcient statistic itself has a distribution that belongs to
the exponential family.
i.e. it is the projection along the direction having maximum variance, which equals to
12 /n. Recursively, the kth principal component maximizes the projection variance
subject to the constraint that is must be orthogonal to the previous k 1 principal
components. PCA is often used in statistics and machine learning for reducing the
dimensionality of data. In this case, only the coordinates of data points with respect
to the rst K < n principal components are conserved.
APPENDIX A 243
This is the standard rst-order Markov assumption: the probability of the future is
independent of the past given the current state.
If the probabilities P (Xt+1 = j | Xt = i) are independent of time t, then we say
that the Markov chain is stationary and denote the corresponding set of probabilities
as Pij , 1 i, j M. The set of probabilities Pij can be conveniently described by
an M M transition matrix T , where the rows sum to unity, i.e.
Pij = 1.
j
For example, a three-state stationary Markov chain could have the transition matrix:
0.7 0.2 0.1
T = 0.1 0.8 0.1 . (A.20)
0.2 0.3 0.5
The units used to measure entropy depend on the base used for the logarithms. When
the base is two, the entropy is measured in bits. The entropy measures the prior
uncertainty in the outcome of a random experiment described by p, or the information
gained when the outcome is observed. It is also the minimum average number of bits
(when the logarithms are taken base 2) needed to transmit the outcome in the absence
of noise. The corresponding concept in the case of a continuous random variable X
with density p(x) is called the differential entropy
+
H (X) = p(x) log p(x) dx (A.22)
k
pi
H (p) = H (q) + qi H , (A.24)
qi
i=1
where pi denotes the set of probabilities pj for j Ai . Thus, for example, the
composition law states that, by grouping the rst two events into one,
H ( 13 , 16 , 21 ) = H ( 21 , 21 ) + 21 H ( 23 , 31 ) + 21 H (1). (A.25)
From the rst condition, it is sufcient to determine H for all rational cases where
pi = ni /n, i = 1, . . . , n. But from the second and third conditions,
n
n
H ni = H (p1 , . . . , pn ) + pi H (ni ). (A.26)
i=1 i=1
For example,
The constant C determines the base of the logarithm. Base 2 logarithms lead to
a measure of entropy and information in bits. For most mathematical calculations,
however, we use natural logarithms so that C = 1.
It is not very difcult to verify that the entropy has the following properties:
H(p) 0;
H(p | q) H(P ) with equality if and only if p and q are independent;
H(p1 , . . . , pn ) ni=1 H (pi ) with equality if and only if pi are independent;
H (p) is convex () in p;
H(p1 , . . . , pn ) = ni=1 H (pi | pi1 , . . . , p1 );
H(p) H (n) with equality if and only if p is uniform.
246 APPENDIX A
Relative entropy
The relative entropy between two density vectors
p = (p1 , . . . , pn ), q = (q1 , . . . , qn ),
Mutual information
The third concept for measuring information is the mutual information. Consider two
density vectors p and q associated with a joint distribution r over the product space.
The mutual information I(p, q) is the relative entropy between the joint distribution
r and the product of the marginals p and q:
As such, it is always positive. When r is factorial, i.e. equal to the product of the
marginals, the mutual information is zero. The mutual information is a special case
of relative entropy. Likewise, the entropy (or self-entropy) is a special case of mutual
information because H (p) = I(p, p). Furthermore, the mutual information satises
the following properties:
APPENDIX A 247
Therefore the difference between the entropy and the conditional entropy measures
the average information that an observation of Y brings about X. It is straightforward
to check that
I(X, Y ) = H (X) H (X | Y ) = H (Y ) H (Y | X)
= H (X) + H (Y ) H (Z) = I(Y, X), (A.34)
where H (Z) is the entropy of the joint variable Z = (X, Y ). We leave the reader to
draw the classical Venn diagram associated with these relations.
Jensens inequality
A key theorem for the reader interested in proving some of the results in the previous
sections is Jensens inequality. If a function f is convex (), and X is a random
variable, then
E(f (X)) f (E(X)). (A.35)
Furthermore, if f is strictly convex, equality implies that X is constant. This inequality
is intuitively obvious by thinking in terms of the center of gravity of a set of points on
the curve f . The center of gravity of f (x1 ), . . . , f (xn ) is below f (x ), where x is the
center of gravity of x1 , . . . , xn . As a special important case, E(log X) log(E(X)).
This, for instance, immediately yields the properties of the relative entropy.
a century ago (Blahut 1987; Cover and Thomas 1991; McEliece 1977; Shannon
1948a,b). According to Shannon, the information contained in a data set D is given
by log P (D), and the average information over the family D of possible data sets
is the entropy H (P (D)).
Shannons theory of information, although eminently successful for the develop-
ment of modern computer and telecommunication technologies, does not capture
subjective and semantic aspects of information that are not directly related to its
transmission. As pointed out in the title of Shannons seminal article, it is a theory of
communication, in the sense of transmission rather than information. It concentrates
on the problem of reproducing at one point either exactly or approximately a mes-
sage selected at another point regardless of the relevance of the message. But there is
clearly more to information than data reproducibility and somehow information ought
to depend also on the observer. Consider for instance the genomic DNA sequence of
the AIDS virus. It is a string of about 10 000 letters over the four-letter DNA alphabet,
of great signicance to researchers in the biological or medical sciences, but utterly
uninspiring to a layman. The limitations of Shannons denition may in part explain
why the theory has not been as successful as one would have hoped in other areas of
science such as biology, psychology, economics, or the Web.
Shannons theory fails to account how data can have different signicance for
different observers. This is rooted in the origin of the probabilities used in the denition
of information.
These probabilities are dened according to an observer or a model M (the Bell
Labs engineer (Jaynes 2003)) which Shannon does not describe explicitly, so that
the information in a data set is rather the negative log-likelihood
and the corresponding entropy is the average over all data sets
I (D, M) = H (P (D | M)) = P (D | M) log P (D | M) dD. (A.37)
D
There are situations, however, characterized by the presence of multiple models and/or
observers and where the subjective/semantic dimensions of the data are more impor-
tant than its transmission.
In a Web context, imagine surng the Web in search of a car and stumbling on a
picture of Marilyn Monroe. The Shannon information contained in the picture depends
on the picture resolution, whether it is color or black and white, etc. In this situation,
it is probably a secondary consideration. More important are the facts that the picture
is unexpected, i.e. surprising, and irrelevant to the goal of nding a car. Thus there
are at least three different aspects of information contained in data: the transmission
or Shannons information, the surprise, and the relevance. We now provide a precise
denition of surprise.
APPENDIX A 249
Surprise
The effect of the information contained in D is clearly to change the belief of the
observer from P (M) to P (M | D). Thus, a complementary way of measuring infor-
mation carried by the data D is to measure the distance between the prior and the
posterior. To distinguish it from Shannons communication information, we call this
notion of information the surprise information or surprise (Baldi 2002)
Alternatively, we can dene the single model surprise by the log-odd ratio
P (M)
S(D, M) = log (A.40)
P (M | D)
and the surprise by its average
S(D, M) = S(D, M)P (M) dM, (A.41)
M
taken with respect to the prior distribution over the model class. Unlike Shannons
entropy, which requires integration over the space of data, surprise is a dual notion
that requires integration over the space of models.
Note that this denition addresses the TV snow paradox: snow, the most boring
of all television programs, carries the largest amount of Shannon information in terms
of exact reproducibility. At the time of snow onset, the image distribution we expect
and the image we perceive are very different and therefore the snow carries a great
deal of both surprise and Shannons information. Indeed snow may be a sign of storm,
earthquake, toddlers curiosity, or military putsch. But after a few seconds, once our
model of the image shifts toward a snow model of random pixels, television snow
perfectly ts the prior and hence becomes boring. Since the prior and the posterior
are virtually identical, snow frames carry zero surprise although carrying megabytes
of Shannons information.
Computing surprise
To measure surprise in the most simple settings, consider a data set consisting of N
binary points. The simplest class M(x) of models contains a single parameter x, the
250 APPENDIX A
Relevance
Surprise is a measure of dissimilarity between the prior and posterior distributions
and as such it lies close to the axiomatic foundation of Bayesian probability. Surprise
is different from other denitions of information that have been proposed (Aczel and
Daroczy 1975) as alternatives to Shannons entropy. Most alternative denitions, such
as Rnyis entropies, are actually algebraic variations on Shannons denition rather
than conceptually different approaches.
Scoring items by surprise provides a general principle for the rapid detection and
ranking of unusual events and the construction of saliency maps, in any feature space,
that can guide the deployment of attention (Itti and Koch 2001; Nothdurft 2000;
APPENDIX A 251
Olshausen et al. 1993) and other rapid ltering mechanisms in natural or synthetic
information processing systems.
The notion of surprise, however, has its own limitations. In particular, it does not
capture all the semantics/relevance aspects of data. When the degree of surprise of the
data with respect to the model class becomes low, the data are no longer informative
for the given model class. This, however, does not necessarily imply that one has
a good model, since the model class itself could be unsatisfactory and in need of a
complete overhaul. Conversely, highly surprising data could be a sign that learning is
required, or that the data are irrelevant, as in the case of the TV snow or the Marilyn
Monroe picture.
Thus, relevance, surprise, and Shannons entropy are three different facets of infor-
mation that can be present in different combinations. The notion of relevance in
particular seems to be the least understood although there have been several attempts
(Jumarie 1990; Tishby et al. 1999). A possible direction is to consider, in addition to
the space of data and models, a third space A of actions or interpretations and dene
relevance as the relative entropy between the prior P (A) and the posterior P (A | D)
distributions over A. Whether this approach simply shifts the problem into the de-
nition of the set A remains to be seen. In any event, the quest to understand the nature
of information, and in particular of semantic relevance, goes well beyond the domain
of Web applications and is far from being over.
This Page Intentionally Left Blank
Appendix B
General Guidelines
While we use the following guidelines, occasional exceptions are possible and clearly
indicated in the text. In general, vectors and matrices are in bold face, with matrices
represented by capital letters and vectors by lowercase letters. Unless otherwise spec-
ied, all vectors are column vectors by default. In general, Greek letters represent
parameters.
X = (xij ) matrix
x = (xi ) vector
XT transpose of X
trX trace of X
|X| = det X determinant of X
D data. In a typical unsupervised case the data are a matrix
D = (X) with one example per row. In a typical
supervised case, the data are a double matrix
D = (X, Y ), where Y denotes the targets, with one
example per row.
M (M( )) model (model with parameter )
parameters for some model M
( ML , MP , MAP ) parameter estimates (using maximum likelihood,
maximum a posteriori and mean posterior estimates)
M the universe of models under consideration
D the universe of possible data sets
n number of training examples
m number of dimensions
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
254 APPENDIX B
Probabilities
P ()(Q, R . . .) probability (probability density functions)
E[] (EQ []) expectation (expectation with respect to Q)
var[] variance
cov[] covariance
P (x1 , . . . , xn ) probability that X1 = x1 , . . . , Xn = xn .
P (X | Y )(E[X | Y ]) conditional probability (conditional expectation)
I background information
N (, ), N (, C), Normal (or Gaussian) density with mean and variance
N (, 2 ), N (x; , 2 ) 2 , or covariance matrix C
(w | , ) Gamma density with parameters and
B(n, p) Binomial distribution with n independent Bernouilli
trials each with probability p of success
P () Poisson distribution with parameter
Dq , Du Dirichlet distribution with
parameters and q, or u
(ui = qi , qi 0, and i qi = 1)
Functions
E energy, error, negative log-likelihood or log-posterior
(depending on context)
L Lagrangian
H(p), H (X) entropy of the probability vector p, or the random
variable X/differential entropy in continuous case
H (p, q), H (X, Y ) relative entropy between the probability vectors p and q,
or the random variables X and Y
APPENDIX B 255
Abbreviations
DNS domain name service
EM expectation maximization
HMM hidden Markov model
256 APPENDIX B
Abello, J., Buchsbaum, A. and Westbrook, J. 1998 A functional approach to external graph
algorithms. In Proc. 6th Eur. Symp. on Algorithms, pp. 332343.
Achacoso, T. B. and Yamamoto, W. S. 1992 Ays Neuroanatomy of C. elegans for Computation.
Boca Raton, FL: CRC Press.
Aczel, J. and Daroczy, Z. 1975 On measures of information and their characterizations. New
York: Academic Press.
Adamic, L., Lukose, R. M., Puniyani, A. R. and Huberman, B. A. 2001 Search in power-law
networks. Phys. Rev. E 64, 046135.
Aggarwal, C. C., Al-Garawi, F. and Yu, P. S. 2001 Intelligent crawling on the World Wide Web
with arbitrary predicates. In Proc. 10th Int. World Wide Web Conf., pp. 96105.
Aiello, W., Chung, F. and Lu, L. 2001 A Random Graph Model for Power Law Graphs. Exper-
imental Math. 10, 5366.
Aji, S. M. and McEliece, R. J. 2000 The generalized distributive law. IEEE Trans. Inform.
Theory 46, 325343.
Albert, R. and Barabsi, A.-L. 2000 Topology of evolving networks: local events and univer-
sality. Phys. Rev. Lett. 85, 52345237.
Albert, R., Jeong, H. and Barabsi, A.-L. 1999 Diameter of the World-Wide Web. Nature 401,
130.
Albert, R., Jeong, H. and Barabsi, A.-L. 2000 Error and attack tolerance of complex networks.
Nature 406, 378382.
Allwein, E. L., Schapire, R. E. and Singer, Y. 2000 Reducing multiclass to binary: a unifying
approach for margin classiers. In Proc. 17th Int. Conf. on Machine Learning, pp. 916.
San Francisco, CA: Morgan Kaufmann.
Amaral, L. A. N., Scala, A., Barthlmy, M. and Stanley, H. E. 2000 Classes of small-world
networks. Proc. Natl Acad. Sci. 97, 11 14911 152.
Amento, B., Terveen, L. and Hill, W. 2000 Does authority mean quality? Predicting expert
quality ratings of Web documents. In Proc. 23rd Ann. Int. ACM SIGIR Conf. on Research
and Development in Information Retrieval, pp. 296303. New York: ACM Press.
Anderson, C. R., Domingos, P. and Weld, D. 2001Adaptive Web navigation for wireless devices.
In Proc. 17th Int. Joint Conf. on Articial Intelligence, pp. 879884. San Francisco, CA:
Morgan Kaufmann.
Anderson, C. R., Domingos, P. and Weld, D. 2002 Relational markov models and their applica-
tion to adaptive Web navigation. In Proc. 8th Int. Conf. on Knowledge Discovery and Data
Mining, pp. 143152. New York: ACM Press.
Modeling the Internet and the Web P. Baldi, P. Frasconi and P. Smyth
2003 P. Baldi, P. Frasconi and P. Smyth ISBN: 0-470-84906-1
258 REFERENCES
Berry, M. W. 1992 Large scale singular value computations. J. Supercomput. Applic. 6, 1349.
Berry, M. W. and Browne, M. 1999 Understanding Search Engines: Mathematical Modeling
and Text Retrieval. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Bharat, K. and Broder, A. 1998 A technique for measuring the relative size and overlap of
public Web search engines. In Proc. 7th Int. World Wide Web Conf., Brisbane, Australia,
pp. 379388.
Bharat, K. and Henzinger, M. R. 1998 Improved algorithms for topic distillation in a hyperlinked
environment. In Proc. 21st Ann Int. ACM SIGIR Conf. on Research and Development in
Information Retrieval, pp. 104111. New York: ACM Press.
Bianchini, M., Gori, M. and Scarselli, F. 2001 Inside Googles Web page scoring system.
Technical report, Dipartimento di Ingegneria dellInformazione, Universit di Siena.
Bikel, D. M., Miller, S., Schwartz, R. and Weischedel, R. 1997 Nymble: a high-performance
learning name-nder. In Proceedings of ANLP-97, pp. 194201. (Available at http://
citeseer.nj.nec.com/bikel97nymble.html.)
Billsus, D. and Pazzani, M. 1998 Learning collaborative information lters. In Proc. Int. Conf.
on Machine Learning, pp. 4654. San Francisco, CA: Morgan Kaufmann.
Blahut, R. E. 1987 Principles and Practice of Information Theory. Reading, MA: Addison-
Wesley.
Blei, D., Ng, A. Y. and Jordan, M. I. 2002a Hierarchical Bayesian models for applications in
information retrieval. In Bayesian Statistics 7 (ed. J. M. Bernardo, M. Bayarri, J. O. Berger,
A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West). Oxford University Press.
Blei, D., Ng, A. Y. and Jordan, M. I. 2002b Latent Dirichlet allocation. In Advances in Neural
Information Processing Systems 14 (ed. T. Dietterich, S. Becker and Z. Ghahramani). San
Francisco, CA: Morgan Kaufmann.
Blum, A. and Mitchell, T. 1998 Combining labeled and unlabeled data with co-training. In
Proc. 11th Ann. Conf. on Computational Learning Theory (COLT-98), pp. 92100. New
York: ACM Press.
Bollacker, K. D., Lawrence, S. and Giles, C. L. 1998 CiteSeer: an autonomous Web agent for
automatic retrieval and identication of interesting publications. In Proc. 2nd Int. Conf. on
Autonomous Agents (Agents98) (ed. K. P. Sycara and M. Wooldridge), pp. 116123. New
York: ACM Press.
Bollobs, B. 1985 Random Graphs. London: Academic Press.
Bollobs, B. and de la Vega, W. F. 1982 The diameter of random regular graphs. Combinatorica
2, 125134.
Bollobs, B. and Riordan, O. 2003 The diameter of a scale-free random graph. Combinatorica.
(In the press.)
Bollobs, B., Riordan, O., Spencer, J. and Tusndy G. 2001 The degree sequence of a scale-free
random graph process. Random. Struct. Alg. 18, 279290.
Borodin, A., Roberts, G. O., Rosenthal, J. S. and Tsaparas, P. 2001 Finding authorities and
hubs from link structures on the World Wide Web. In Proc. 10th Int. Conf. on World Wide
Web, pp. 415429.
Box, G. E. P. and Tiao, G. C. 1992 Bayesian Inference In Statistical Analysis. John Wiley &
Sons, Ltd/Inc.
Boyan, J., Freitag, D. and Joachims, T. 1996 A machine learning architecture for optimizing
Web search engines. In Proc. AAAI Workshop on Internet-Based Information Systems.
Brand, M. 2002 Incremental singular value decomposition of uncertain data with missing
values. In Proc. Eur. Conf. on Computer Vision (ECCV): Lecture Notes in Computer Science,
pp. 707720. Springer.
Bray, T. 1996 Measuring the Web. InProc. 5th Int. Conf. on the World Wide Web, 610 May
1996, Paris, France. Comp. Networks 28, 9931005.
260 REFERENCES
Breese, J. S., Heckerman, D. and Kadie, C. 1998 Empirical analysis of predictive algorithms for
collaborative ltering. In Proc. 14th Conf. on Uncertainty in Articial Intelligence, pp. 43
52. San Francisco, CA: Morgan Kaufmann.
Brewington, B. and Cybenko, G. 2000 How dynamic is the Web? Proc. 9th Int. World Wide
Web Conf. Geneva: International World Wide Web Conference Committee (IW3C2).
Brin, S. and Page, L. 1998 The anatomy of a large-scale hypertextual (Web) search engine. In
Proc. 7th Int. World Wide Web Conf. (WWW7). Comp. Networks 30, 107117.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomikns, A. and
Wiener, J. 2000 Graph structure in the Web. In Proc. 9th Int. World Wide Web Conf. (WWW9).
Comp. Networks 33, 309320.
Brown, L. D. 1986 Fundamentals of Statistical Exponential Families. Hayward, CA: Institute
of Mathematical Statistics.
Bucklin, R. E. and Sismeiro, C. 2003 A model of Web site browsing behavior estimated on
clickstream data. (In the press.)
Buntine, W. 1992 Learning classication trees. Statist. Comp. 2, 6373.
Buntine, W. 1996 A guide to the literature on learning probabilistic networks from data. IEEE
Trans. Knowl. Data Engng 8, 195210.
Byrne, M. D., John, B. E., Wehrle, N. S. and Crow, D. C. 1999 The tangled Web we wove: a
taskonomy of WWW use. In Proc. CHI99: Human Factors in Computing Systems, pp. 544
551. New York: ACM Press.
Cadez, I. V., Heckerman, D., Smyth, P., Meek, C. and White, S. 2003 Model-based clustering
and visualization of navigation patterns on a Web site. Data Mining Knowl. Discov. (In the
press.)
Califf, M. E. and Mooney, R. J. 1998 Relational learning of pattern-match rules for information
extraction. Working Notes of AAAI Spring Symp. on Applying Machine Learning to Discourse
Processing, pp. 611. Menlo Park, CA: AAAI Press.
Callaway, D. S., Hopcroft, J. E., Kleinberg, J., Newman, M. E. J. and Strogatz, S. H. 2001 Are
randomly grown graphs really random? Phys. Rev. E 64, 041902.
Cardie, C. 1997 Empirical methods in information extraction. AI Mag. 18, 6580.
Carlson, J. M. and Doyle, J. 1999 Highly optimized tolerance: a mechanism for power laws in
designed systems. Phys. Rev. E 60, 14121427.
Castelli, V. and Cover, T. 1995 On the exponential value of labeled samples. Pattern Recog.
Lett. 16, 105111.
Catledge, L. D. and Pitkow, J. 1995 Characterizing browsing strategies in the World-Wide
Web. Comp. Networks ISDN Syst. 27, 10651073.
Chaitin, G. J. 1987 Algorithmic Information Theory. Cambridge University Press.
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, S. R., Raghavan, P., Rajagopalan
S. and Tomkins, A. 1999a Mining the link structure of the World Wide Web. IEEE Computer
32, 6067.
Chakrabarti, S., Joshi, M. M., Punera, K. and Pennock, D. M. 2002 The structure of broad
topics on the Web. In Proc. 11th Int. Conf. on World Wide Web, pp. 251262. New York:
ACM Press.
Chakrabarti, S., van den Berg, M. and Dom, B. 1999b Focused crawling: a new approach to
topic-specic Web resource discovery. In Proc. 8th Int. World Wide Web Conf., Toronto.
Comp. Networks 31, 1116.
Charniak, E. 1991 Bayesian networks without tears. AI Mag. 12, 5063.
Chen, S. F. and Goodman, J. 1996 An empirical study of smoothing techniques for language
modeling. In Proc. 34th Ann. Meeting of the Association for Computational Linguistics (ed.
A. Joshi and M. Palmer), pp. 310318. San Francisco, CA: Morgan Kaufmann.
REFERENCES 261
Dhillon, I. S. and Modha, D. S. 2001 Concept decompositions for large sparse text data using
clustering. Machine Learning 42, 143175.
Dietterich, T. G. and Bakiri, G. 1995 Solving multiclass learning problems via error-correcting
output codes. J. Articial Intelligence Research 2, 263286.
Dijkstra, E. D. 1959 A note on two problem in connexion with graphs. Numerische Mathematik
1, 269271.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L. and Gori, M. 2000 Focused crawling
using context graphs. In VLDB 2000, Proc. 26th Int. Conf. on Very Large Data Bases,
1014 September 2000, Cairo, Egypt (ed. A. El Abbadi, M. L. Brodie, S. Chakravarthy,
U. Dayal, N. Kamel, G. Schlageter and K. Y. Whang), pp. 527534. Los Altos, CA: Morgan
Kaufmann.
Dill, S., Kumar, S. R., McCurley, K. S., Rajagopalan, S., Sivakumar, D. and Tomkins, A. 2001
Self-similarity in the Web. In Proc. 27th Very Large Databases Conf., pp. 6978.
Domingos, P. and Pazzani, M. 1997 On the optimality of the simple Bayesian classier under
zero-one loss. Machine Learning 29, 103130.
Domingos, P. and Richardson, M. 2001 Mining the network value of customers. In Proc. ACM
7th Int. Conf. on Knowledge Discovery and Data Mining, pp. 5766. New York: ACM Press.
Dreilinger, D. and Howe, A. E. 1997 Experiences with selecting search engines using
metasearch. ACM Trans. Informat. Syst. 15, 195222.
Drucker, H., Vapnik, V. N. and Wu, D. 1999 Support vector machines for spam categorization.
IEEE Trans. Neural Networks 10, 10481054.
Duda, R. O. and Hart, P. E. 1973 Pattern Classication and Scene Analysis. John Wiley &
Sons, Ltd/Inc.
Dumais, S., Platt, J., Heckerman, D. and Sahami, M. 1998 Inductive learning algorithms and
representations for text categorization. In Proc. 7th Int. Conf. on Information and Knowledge
Management, pp. 148155. New York: ACM Press.
Jones, K. S. and Willett, P. (eds) 1997 Readings in information retrieval. San Mateo, CA:
Morgan Kaufmann.
Edwards, J., McCurley, K. and Tomlin, J. 2001 An adaptive model for optimizing performance
of an incremental Web crawler. In Proc. 10th Int. World Wide Web Conf., pp. 106113.
Elias, P. 1975 Universal codeword sets and representations of the integers. IEEE Trans. Inform.
Theory 21, 194203.
Erdos, P. and Rnyi, A. 1959 On random graphs. Publ. Math. Debrecen 6, 290291.
Erdos, P. and Rnyi, A. 1960 On the evolution of random graphs. Magy. Tud. Akad. Mat. Kut.
Intez. Kozl. 5, 1761.
Everitt B. S. 1984 An Introduction to Latent Variable Models. London: Chapman & Hall.
Evgeniou, T., Pontil, M. and Poggio, T. 2000 Regularization networks and support vector
machines. Adv. Comput. Math. 13, 150.
Fagin, R., Karlin, A., Kleinberg, J., Raghavan, P., Rajagopalan, S., Rubinfeld, R., Sudan., M.
and Tomkins, A. 2000 Random walks with back buttons. In Proc. ACM Symp. on Theory
of Computing, pp. 484493. New York: ACM Press.
Faloutsos, C. and Christodoulakis, S. 1984 Signature les: an access method for documents
and its analytical performance evaluation. ACM Trans. Informat. Syst. 2, 267288.
Faloutsos, M., Faloutsos, P. and Faloutsos, C. 1999 On power-law relationships of the Internet
topology. In Proc. ACM SIGCOMM Conf., Cambridge, MA, pp. 251262.
Feller, W. 1971 An Introduction to Probability Theory and Its Applications, 2nd edn, vol. 2.
John Wiley & Sons, Ltd/Inc.
Fermi, E. 1949 On the origin of the cosmic radiation. Phys. Rev. 75, 11691174.
264 REFERENCES
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T.
1999 Hypertext Transfer Protocol: HTTP/1.1. RFC 2616. (Available at http://www.ietf.org/
rfc/rfc2616.txt.)
Fienberg, S. E., Johnson, M. A. and Junker, B. J. 1999 Classical multilevel and Bayesian
approaches to population size estimation using multiple lists. J. R. Statist. Soc. A 162, 383
406.
Flake, G. W., Lawrence, S. and Giles, C. L. 2000 Efcient identication of Web communities. In
Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 150160.
New York: ACM Press.
Flake, G. W., Lawrence, S., Giles, C. L. and Coetzee, F. 2002 Self-organization and identica-
tion of communities. IEEE Computer 35, 6671.
Fox, C. 1992 Lexical analysis and stoplists. In Information Retrieval: Data Structures and
Algorithms (ed. W. B. Frakes and R. Baeza-Yates), ch. 7. Englewood Cliffs, NJ: Prentice
Hall.
Fraley, C. and Raftery, A. E. 2002 Model-based clustering, discriminant analysis, and density
estimation. J. Am. Statist. Assoc. 97, 611631.
Freitag, D. 1998 Information extraction from HTML: Application of a general machine learning
approach. In Proc. AAAI-98, pp. 517523. Menlo Park, CA: AAAI Press.
Freitag, D. and McCallum, A. 2000 Information extraction with HMM structures learned by
stochastic optimization. AAAI/IAAI, pp. 584589.
Freund, Y. and Schapire, R. E. 1996 Experiments with a new boosting algorithm . In Proc. 13th
Int. Conf. on Machine Learning, pp. 148146. San Francisco, CA: Morgan Kaufmann.
Frey, B. J. 1998 Graphical Models for Machine Learning and Digital Communication. MIT
Press.
Friedman, N. and Goldszmidt, M. 1996 Learning Bayesian networks with local structure. In
Proc. 12th Conf. on Uncertainty in Articial Intelligence, Portland, Oregon (ed. E. Horwitz
and F. Jensen), pp. 274282. San Francisco, CA: Morgan Kaufmann.
Friedman, N., Getoor, L., Koller, D. and Pfeffer, A. 1999 Learning probabilistic relational
models. In Proc. 16th Int. Joint Conf. on Articial Intelligence (IJCAI-99) (ed. D. Thomas),
vol. 2 , pp. 13001309. San Francisco, CA: Morgan Kaufmann.
Fuhr, N. 1992 Probabilistic models in information retrieval. Comp. J. 35, 243255.
Galambos, J. 1987 The Asymptotic Theory of Extreme Order Statistics, 2nd edn. Malabar, FL:
Robert E. Krieger.
Gareld, E. 1955 Citation indexes for science: a new dimension in documentation through
association of ideas. Science 122, 108111.
Gareld, E. 1972 Citation analysis as a tool in journal evaluation. Science 178, 471479.
Garner, R. 1967 A Computer Oriented, Graph Theoretic Analysis of Citation Index Structures.
Philadelphia, PA: Drexel University Press.
Gelbukh, A. and Sidorov, G. 2001 Zipf and Heaps Laws coefcients depend on language. In
Proc. 2001 Conf. on Intelligent Text Processing and Computational Linguistics, pp. 332335.
Springer.
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. 1995 Bayesian Data Analysis. London:
Chapman & Hall.
Ghahramani, Z. 1998 Learning dynamic Bayesian networks. In Adaptive Processing of
Sequences and Data Structures. Lecture Notes in Artical Intelligence (ed. M. Gori and
C. L. Giles), pp. 168197. Springer.
Ghahramani, Z. and Jordan, M. I. 1997 Factorial hidden Markov models. Machine Learning
29, 245273.
Ghani, R. 2000 Using error-correcting codes for text classication. In Proc. 17th Int. Conf. on
Machine Learning, pp. 303310. San Francisco, CA: Morgan Kaufmann.
REFERENCES 265
Gibson, D., Kleinberg, J. and Raghavan, P. 1998 Inferring Web communities from link topol-
ogy. In Proc. 9th ACM Conf. on Hypertext and Hypermedia : Links, Objects, Time and
Spacestructure in Hypermedia Systems, pp. 225234. New York: ACM Press.
Gilbert, E. N. 1959 Random graphs. Ann. Math. Statist. 30, 11411144.
Gilbert, N. 1997 A simulation of the structure of academic science. Sociological Research
Online 2. (Available at http://www.socresonline.org.uk/socresonline/2/2/3.html.)
Gilks, W. R., Thomas, A. and Spiegelhalter, D. J. 1994 A language and program for complex
Bayesian modelling. The Statistician 43, 6978.
Greenberg, S. 1993 The Computer User as Toolsmith: The Use, Reuse, and Organization or
Computer-Based Tools. Cambridge University Press.
Guermeur, Y., Elisseeff, A. and Paugam-Mousy, H. 2000 A new multi-class SVM based on a
uniform convergence result. In Proc. IJCNN: Int. Joint Conf. on Neural Networks, vol. 4,
pp 41834188. Piscataway, NJ: IEEE Press.
Han, E. H., Karypis, G. and Kumar, V. 2001 Text categorization using weight-adjusted k-nearest
neighbor classication. In Proc. PAKDD-01, 5th PacicAsia Conferenece on Knowledge
Discovery and Data Mining (ed. D. Cheung, Q. Li and G. Williams). Lecture Notes in
Computer Science Series, vol. 2035, pp. 5365. Springer.
Han, J. and Kamber, M. 2001 Data Mining: Concepts and Techniques. San Francisco, CA:
Morgan Kaufmann.
Hand, D., Mannila, H. and Smyth, P. 2001 Principles of Data Mining. Cambridge, MA: MIT
Press.
Harman, D., Baeza-Yates, R., Fox, E. and Lee, W. 1992 Inverted les. In Information Retrieval,
Data Structures and Algorithms (ed. W. B. Frakes and R. A. Baeza-Yates), pp. 2843.
Englewood Cliffs, NJ: Prentice Hall.
Hastie, T., Tibshirani, R. and Friedman, J. 2001 Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer.
Heckerman, D. 1997 Bayesian networks for data mining. Data Mining Knowl. Discov. 1, 79
119.
Heckerman, D. 1998 A tutorial on learning with Bayesian networks. In Learning in Graphical
Models (ed. M. Jordan). Kluwer.
Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R. and Kadie, C. 2000 Dependency
networks for inference, collaborative ltering, and data visualization. J. Mach. Learn. Res.
1, 4975.
Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M. and Ur, S. 1998 The shark-
search algorithm an application: tailored Web site mapping. In Proc. 7th Int. World-Wide
Web Conf. Comp. Networks 30, 317326.
Heydon, A. and Najork, M. 1999 Mercator: a scalable, extensible Web crawler. World Wide
Web 2, 219229. (Available at http://research.compaq.com/SRC/mercator/research.html.)
Heydon, A. and Najork, M. 2001 High-performance Web crawling. Technical Report SRC 173.
Compaq Systems Research Center.
Hoffman, T. 1999 Probabilistic latent semantic indexing. In Proc. 22nd Ann. Int. ACM SIGIR
Conf. on Research and Development in Information Retrieval, pp. 5057. New York: ACM
Press.
Hofmann, T. 2001 Unsupervised learning by probabilistic latent semantic analysis. Machine
Learning 42, 177196.
Hofmann, T. and Puzicha, J. 1999 Latent class models for collaborative ltering. In Proc. 16th
Int. Joint Conf. on Articial Intelligence, pp. 688693.
Hofmann, T., Puzicha, J. and Jordan, M. I. 1999 Learning from dyadic data. In Advances in
Neural Information Processing Systems 11: Proc. 1998 Conf. (ed. M. S. Kearns, S. A. Solla
and D. Cohen), pp. 466472. Cambridge, MA: MIT Press.
266 REFERENCES
Huberman, B. A. and Adamic, L. A. 1999 Growth dynamics of the World Wide Web. Nature
401, 131.
Huberman, B. A., Pirolli, P. L. T., Pitkow, J. E. and Lukose, R. M. 1998 Strong regularities in
World Wide Web surng. Science 280, 9597.
Hunter, J. and Shotland, R. 1974 Treating data collected by the small world method as a Markov
process. Social Forces 52, 321.
ISO 1986 Information Processing, Text and Ofce Systems, Standard Generalized Markup
Language (SGML), ISO 8879, 1st edn. Geneva, Switzerland: International Organization for
Standardization.
Itti, L. and Koch, C. 2001 Computational modelling of visual attention. Nature Rev. Neurosci.
2, 194203.
Jaakkola, T. S. and Jordan, I. 1997 Recursive algorithms for approximating probabilities in
graphical models. In Advances in Neural Information Processing Systems (ed. M. C. Mozer,
M. I. Jordan and T. Petsche), vol. 9, pp. 487493. Cambridge, MA: MIT Press.
Jaeger, M. 1997 Relational Bayesian networks In Proc. 13th Conf. on Uncertainty in Articial
Intelligence (UAI-97) (ed. D. Geiger and P. P. Shenoy), pp. 266273. San Francisco, CA:
Morgan Kaufmann.
Janiszewski, C. 1998 The inuence of display characteristics on visual exploratory behavior.
J. Consumer Res. 25, 290301.
Jansen, B. J., Spink, A., Bateman, J. and Saracevic, T. 1998 Real-life information retrieval: a
study of user queries on the Web. SIGIR Forum 32, 517.
Jaynes, E. T. 1986 Bayesian methods: general background. In Maximum Entropy and Bayesian
Methods in Statistics (ed. J. H. Justice), pp. 125. Cambridge University Press.
Jaynes, E. T. 2003 Probability Theory: The Logic of Science. Cambridge University Press.
Jensen, F. V. 1996 An Introduction to Bayesian Networks. Springer.
Jensen, F. V., Lauritzen, S. L. and Olesen, K. G. 1990 Bayesian updating in causal probabilistic
networks by local computations. Comput. Statist. Q. 4, 269282.
Jeong, H., Tomber, B., Albert, R., Oltvai, Z. and Barabsi, A.-L. 2000 The large-scale organi-
zation of metabolic networks. Nature 407, 651654.
Joachims, T. 1997 A probabilistic analysis of the Rocchio algorithm with TFIDF for text
categorization. In Proc. 14th Int. Conf. on Machine Learning, pp. 143151. San Francisco,
CA: Morgan Kaufmann.
Joachims, T. 1998 Text categorization with support vector machines: learning with many rele-
vant features. In Proc. 10th Eur. Conf. on Machine Learning, pp. 137142. Springer.
Joachims, T. 1999a Making large-scale SVM learning practical In Advances in Kernel Methods:
Support Vector Learning (ed. B. Schlkopf, C. J. C. Burges and A. J. Smola), pp. 169184.
Cambridge, MA; MIT Press.
Joachims, T. 1999b Transductive inference for text classication using support vector machines.
In Proc. 16th Int. Conf. on Machine Learning (ICML), pp. 200209. San Francisco, CA:
Morgan Kaufmann.
Joachims, T. 2002 Learning to Classify Text using Support Vector Machines. Kluwer.
Jordan, M. I. (ed.) 1999 Learning in Graphical Models. Cambridge, MA: MIT Press.
Jordan, M. I., Ghahramani, Z. and Saul, L. K. 1997 Hidden Markov decision trees. In Advances
in Neural Information Processing Systems (ed. M. C. Mozer, M. I. Jordan and T. Petsche),
vol. 9, pp. 501507. Cambridge, MA: MIT Press.
Jumarie, G. 1990 Relative information. Springer.
Kask, K. and Dechter, R. 1999 Branch and bound with mini-bucket heuristics. In Proc. Int.
Joint Conf. on Articial Intelligence (IJCAI99), pp. 426433.
Kessler, M. 1963 Bibliographic coupling between scientic papers. Am. Documentat. 14, 10
25.
REFERENCES 267
Killworth, P. and Bernard, H. 1978 Reverse small world experiment. Social Networks 1, 159.
Kira, K. and Rendell, L. A. 1992 A practical approach to feature selection. In Proc. 9th Int.
Conf. on Machine Learning, pp. 249256. San Francisco, CA: Morgan Kaufmann.
Kittler, J. 1986 Feature selection and extraction. In Handbook of Pattern Recognition and Image
Processing (ed. T. Y. Young and K. S. Fu), ch. 3. Academic.
Kleinberg, J. 1998 Authoritative sources in a hyperlinked environment. In Proc. 9th Ann. ACM
SIAM Symp. on Discrete Algorithms, pp. 668677. New York: ACM Press. (A preliminary
version of this paper appeared as IBM Research Report RJ 10076, May 1997.)
Kleinberg, J. 1999 Hubs, authorities, and communities. ACM Comput. Surv. 31, 5.
Kleinberg, J. 2000a Navigation in a small world. Nature 406, 845.
Kleinberg, J. 2000b The small-world phenomenon: an algorithmic perspective. In Proc. 32nd
ACM Symp. on the Theory of Computing.
Kleinberg, J. 2001 Small-world phenomena and the dynamic of information. Advances in
Neural Information Processing Systems (NIPS), vol. 14. Cambridge, MA: MIT Press.
Kleinberg, J. and Lawrence, S. 2001 The structure of the Web. Science 294, 18491850.
Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A. 1999 The Web as
a graph: measurements, models, and methods. In Proc. Int. Conf. on Combinatorics and
Computing. Lecture notes in Computer Science, vol. 1627. Springer.
Kohavi, R. and John, G. 1997 Wrappers for feature subset selection. Artif. Intel. 97, 273324.
Koller, D. and Sahami, M. 1997 Hierarchically classifying documents using very few words.
In Proc. 14th Int. Conf. on Machine Learning (ICML-97), pp. 170178. San Francisco, CA:
Morgan Kaufmann.
Koller, D. and Sahami, N. 1996 Toward optimal feature selection. In Proc. 13th Int. Conf. on
Machine Learning, pp. 284292.
Korte, C. and Milgram, S. 1978 Acquaintance networks between racial groups: application of
the small world method. J. Pers. Social Psych. 15, 101.
Koster, M. 1995 Robots in the Web: threat or treat? ConneXions 9(4).
Krishnamurthy, B., Mogul, J. C. and Kristol, D. M. 1999 Key differences between HTTP/1.0
and HTTP/1.1. In Proc. 8th Int. World-Wide Web Conf. Elsevier.
Kruger, A., Giles, C. L., Coetzee, F., Glover, E. J., Flake, G. W., Lawrence, S. and Omlin,
C. W. 2000 DEADLINER: building a new niche search engine. In Proc. 2000 ACMCIKM
International Conf. on Information and Knowledge Management (CIKM-00) (ed. A. Agah,
J. Callan and E. Rundensteiner), pp. 272281. New York: ACM Press.
Kullback, S. 1968 Information theory and statistics. New York: Dover.
Kumar, S. R., Raghavan, P., Rajagopalan, S. and Tomkins, A. 1999a Extracting large-scale
knowledge bases from the Web. Proc. 25th VLDB Conf. VLDB J., pp. 639650.
Kumar, S. R., Raghavan, P., Rajagopalan, S. and Tomkins, A. 1999b Trawling the Web for
emerging cyber communities. In Proc. 8th World Wide Web Conf. Comp. Networks 31,
1116.
Kumar, S. R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. and Upfal, E. 2000
Stochastic models for the Web graph. In Proc. 41st IEEE Ann. Symp. on the Foundations of
Computer Science, pp. 5765.
Kushmerick, N., Weld, D. S. and Doorenbos, R. B. 1997 Wrapper induction for information
extraction. In Proc. Int. Joint Conf. on Articial Intelligence (IJCAI), pp. 729737.
Lafferty, J., McCallum, A. and Pereira, F. 2001 Conditional random elds: probabilistic models
for segmenting and labeling sequence data. In Proc. 18th Int. Conf. on Machine Learning,
pp. 282289. San Francisco, CA: Morgan Kaufmann.
268 REFERENCES
Lam, W. and Ho, C.Y. 1998 Using a generalized instance set for automatic text categorization. In
Proc. SIGIR-98, 21st ACM Int. Conf. on Research and Development in Information Retrieval
(ed. W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson and J. Zobel), pp. 8189.
New York: ACM Press.
Lang, K. 1995 Newsweeder: Learning to lter news Proc. 12th Int. Conf. on Machine Learning
(ed. A. Prieditis and S. J. Russell), pp. 331339. San Francisco, CA: Morgan Kaufmann.
Langley, P. 1994 Selection of relevant features in machine learning. In Proc. AAAI Fall Symp.
on Relevance, pp. 140144.
Lau, T. and Horvitz, E. 1999 Patterns of search: analyzing and modeling Web query renement.
In Proc. 7th Int. Conf. on User Modeling, pp. 119128. Springer.
Lauritzen, S. L. 1996 Graphical Models. Oxford University Press.
Lauritzen, S. L. and Spiegelhalter, D. J. 1988 Local computations with probabilities on graphical
structures and their application to expert systems. J. R. Statist. Soc. B 50, 157224.
Lawrence, S. 2001 Online or invisible? Nature 411, 521.
Lawrence, S. and Giles, C. L. 1998a Context and page analysis for improved Web search. IEEE
Internet Computing 2, 3846.
Lawrence, S. and Giles, C. L. 1998b Searching the World Wide Web. Science 280, 98100.
Lawrence, S. and Giles, C. L. 1999a Acccessibility of information on the Web. Nature 400,
107109.
Lawrence, S., Giles, C. L. and Bollacker, K. 1999 Digital libraries and autonomous citation
indexing. IEEE Computer 32, 6771.
Leek, T. R. 1997 Information extraction using hidden Markov models. Masters thesis, Uni-
versity of California, San Diego.
Lempel, R. and Moran, S. 2001 SALSA: the stochastic approach for link-structure analysis.
ACM Trans. Informat. Syst. 19, 131160.
Letsche, T. A. and Berry, M. W. 1997 Large-scale information retrieval with latent semantic
indexing. Information Sciences 100, 105137.
Lewis, D. D. 1992An evaluation of phrasal and clustered representations on a text categorization
task. In Proc. 15th Ann. Int. ACM SIGIR Conf. on Research and Development in Information
Retrieval, pp. 3750. New York: ACM Press.
Lewis, D. D. 1997 Reuters-21578 text categorization test collection. (Documentation and data
available at http://www.daviddlewis.com/resources/testcollections/reuters21578/.)
Lewis, D. D. 1998 Naive Bayes at forty: the independence assumption in information retrieval.
In Proc. 10th Eur. Conf. on Machine Learning, pp. 415. Springer.
Lewis, D. D. and Catlett, J. 1994 Heterogeneous uncertainty sampling for supervised learning.
In Proc. ICML-94, 11th Int. Conf. on Machine Learning (ed. W. W. Cohen and H. Hirsh),
pp. 148156. San Francisco, CA: Morgan Kaufmann.
Lewis, D. D. and Gale, W. A. 1994 A sequential algorithm for training text classiers. In Proc.
17th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval,
pp. 312. Springer.
Lewis, D. D. and Ringuette, M. 1994 Comparison of two learning algorithms for text categoriza-
tion. In Proc. 3rd Ann. Symp. on Document Analysis and Information Retreval, pp. 8193.
Li, S., Montgomery, A., Srinivasan, K. and Liechty, J. L. 2002 Predicting online purchase
conversion using Web path analysis. Graduate School of Industrial Administration, Carnegie
Mellon University, Pittsburgh, PA. (Available at http://www.andrew.cmu.edu/alm3/papers/
purchase%20conversion.pdf.)
Li, W. 1992 Random texts exhibit Zipfs-law-like word frequency dsitribution. IEEE Trans.
Inform. Theory 38, 18421845.
REFERENCES 269
Lieberman, H. 1995 Letizia: an agent that assists Web browsing. In Proc. 14th Int. Joint Conf. on
Articial Intelligence (IJCAI-95) (ed. C. S. Mellish), pp. 924929. San Mateo, CA: Morgan
Kaufmann.
Little, R. J. A. and Rubin, D. B. 1987 Statistical Analysis with Missing Data. John Wiley &
Sons, Ltd/Inc.
Liu, H. and Motoda, H. 1998 Feature Selection for Knowledge Discovery and Data Mining.
Kluwer Academic.
Lovins, J. B. 1968 Development of a stemming algorithm. Mech. Transl. Comput. Linguistics
11, 2231.
McCallum, A. and Nigam, K. 1998 A comparison of event models for naive Bayes text classi-
cation. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 4148. Menlo
Park, CA: AAAI Press.
McCallum A., Freitag, D. and Pereira, F. 2000a Maximum entropy Markov models for informa-
tion extraction and segmentation. In Proc. 17th Int. Conf. on Machine Learning, pp. 591598.
San Francisco, CA: Morgan Kaufmann.
McCallum, A., Nigam, K. and Ungar, L. H. 2000b Efcient clustering of high-dimensional
data sets with application to reference matching. In Proc. 6th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining, pp. 169178. New York: ACM Press.
McCallum, A. K., Nigam, K., Rennie, J. and Seymore, K. 2000c Automating the construction
of Internet portals with machine learning. Information Retrieval 3, 127163.
McCann, K., Hastings, A. and Huxel, G. R. 1998 Weak trophic interactions and the balance of
nature. Nature 395, 794798.
McClelland, J. L. and Rumelhart, D. E. 1986 Parallel Distributed Processing: Explorations in
the Microstructure of Cognition. Cambridge, MA: MIT Press.
McEliece, R. J. 1977 The Theory of Information and Coding. Reading, MA: Addison-Wesley.
McEliece, R. J. and Yildirim, M. 2002 Belief propagation on partially ordered sets. In Math-
ematical Systems Theory in Biology, Communications, and Finance (ed. D. Gilliam and J.
Rosenthal). Institute for Mathematics and its Applications, University of Minnesota.
McEliece, R. J., MacKay, D. J. C. and Cheng, J. F. 1997 Turbo decoding as an instance of
Pearls belief propagation algorithm. IEEE J. Select. Areas Commun. 16, 140152.
MacKay, D. J. C. and Peto, L. C. B. 1995a A hierarchical Dirichlet language model. Natural
Language Engng 1, 119.
McLachlan, G. and Peel, D. 2000 Finite Mixture Models. John Wiley & Sons, Ltd/Inc.
Mahmoud, H. M. and Smythe, R. T. 1995 A survey of recursive trees. Theory Prob. Math.
Statist. 51, 127.
Manber, U. and Myers, G. 1990 Sufx arrays: a new method for on-line string searches. In
Proc. 1st Ann. ACMSIAM Symp. on Discrete Algorithms, pp. 319327. Philadelphia, PA:
Society for Industrial and Applied Mathematics.
Mandelbrot, B. 1977 Fractals: Form, Chance, and Dimension. New York: Freeman.
Marchiori, M. 1997 The quest for correct information on the Web: hyper search engines. In
Proc. 6th Int. World-Wide Web Conf., Santa Clara, CA. Comp. Networks 29, 12251235.
Mark, E. F. 1988 Searching for information in a hypertext medical handbook. Commun ACM
31, 880886.
Maron, M. E. 1961 Automatic indexing: an experimental inquiry. J. ACM 8, 404417.
Maslov, S. and Sneppen, K. 2002 Specicity and stability in topology of protein networks.
Science 296, 910913.
Melnik, S., Raghavan, S., Yang, B. and Garcia-Molina, H. 2001 Building a distributed full-text
index for the Web. ACM Trans. Informat. Syst. 19, 217241.
Mena, J. 1999 Data Mining your Website. Boston, MA: Digital Press.
270 REFERENCES
Menczer, F. 1997 ARACHNID: adaptive retrieval agents choosing heuristic neighborhoods for
information discovery. In Proc. 14th Int. Conf. on Machine Learning, pp. 227235. San
Francisco, CA: Morgan Kaufmann.
Menczer, F. and Belew, R. K. 2000 Adaptive retrieval agents: internalizing local context and
scaling up to the Web. Machine Learning 39, 203242.
Mihail. M. and Papadimitriou, C. H. 2002 On the eigenvalue power law. In Randomization and
Approximation Techniques, Proc. 6th Int. Workshop, RANDOM 2002, Cambridge, MA, USA,
1315 September 2002 (ed. J. D. P. Rolim and S. P. Vadhan). Lecture Notes in Computer
Science, vol. 2483, pp. 254262. Springer.
Milgram, S. 1967 The small world problem. Psychology Today 1, 61.
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. 2002 Network
motifs: simple building blocks of complex networks. Science 298, 824827.
Mitchell, T. 1997 Machine Learning. McGraw-Hill.
Mitzenmacher, M. 2002 A brief history of generative models for power law and lognormal
distributions. Technical Report, Harvard University, Cambridge, MA.
Moffat, A. and Zobel, J. 1996 Self-indexing inverted les for fast text retrieval. ACM Trans.
Informat. Syst. 14, 349379.
Montgomery, A. L. 2001 Applying quantitative marketing techniques to the Internet. Interfaces
30, 90108.
Mooney, R. J. and Roy, L. 2000 Content-based book recommending using learning for text
categorization. In Proc. 5th ACM Conf. on Digital Libraries, pp. 195204. New York: ACM
Press.
Mori, S., Suen, C. andYamamoto, K. 1992 Historical review of OCR research and development.
Proc. IEEE 80, 10291058.
Moura, E. S., Navarro, G. and Ziviani, N. 1997 Indexing compressed text. In Proc. 4th South
American Workshop on String Processing (ed. R. Baeza-Yates), International Informatics
Series, pp. 95111. Ottawa: Carleton University Press.
Najork, M. and Wiener, J. 2001 Breadth-rst search crawling yields high-quality pages. In
Proc. 10th Int. World Wide Web Conf., pp. 114118. Elsevier.
Neal, R. M. 1992 Connectionist learning of belief networks. Artif. Intel. 56, 71113.
Neville-Manning, C. and Reed, T. 1996 A PostScript to plain text converter. Technical report.
(Available at http://www.nzdl.org/html/prescript.html.)
Newman, M. E. J., Moore, C. and Watts, D. J. 2000 Mean-eld solution of the small-world
network model. Phys. Rev. Lett. 84, 32013204.
Ng, A. Y. and Jordan, M. I. 2002 On discriminative vs generative classiers: a comparison of
logistic regression and Naive Bayes. Advances in Neural Information Processing Systems
14. Proc. 2001 Neural Information Processing Systems (NIPS) Conference. MIT Press.
Ng, A. Y., Zheng, A. X. and Jordan, M. I. 2001 Stable algorithms for link analysis. In Proc.
24th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval,
pp. 258266. New York: ACM Press.
Nigam, K. and Ghani, R. 2000 Analyzing the effectiveness and applicability of co-training. In
Proc. 2000 ACMCIKM Int. Conf. on Information and Knowledge Management (CIKM-00)
(ed. A. Agah, J. Callan and E. Rundensteiner), pp. 8693. New York: ACM Press.
Nigam, K., McCallum A., Thrun, S. and Mitchell, T. 2000 Text classication from labeled and
unlabeled documents using EM. Machine Learning 39, 103134.
Nothdurft, H. 2000 Salience from feature contrast: additivity across dimensions. Vision Res.
40, 11831201.
Olshausen, B. A., Anderson, C. H. and Essen, D. C. V. 1993 A neurobiological model of
visual attention and invariant pattern recognition based on dynamic routing of information.
J. Neurosci. 13, 47004719.
REFERENCES 271
Oltvai, Z. N. and Barabsi, A.-L. 2002 Lifes complexity pyramid. Science 298, 763764.
ONeill, E. T., McClain P. D. and Lavoie, B. F. 1997 A methodology for sampling the World
Wide Web Annual Review of OCLC Research. (Available at http://www.oclc.org/research/
publications/arr/ 1997/oneill/o%27neillar98%0213.htm.)
Page, L., Brin, S., Motwani, R. and Winograd, T. 1998 The PageRank citation ranking: bring-
ing order to the Web. Technical report, Stanford University. (Available at http://www-
db.stanford.edu/backrub/pageranksub.ps.)
Paine, R. T. 1992 Food-web analysis through eld measurements of per capita interaction
strength. Nature 355, 7375.
Pandurangan, G., Raghavan, P. and Upfal, E. 2002 Using PageRank to characterize Web struc-
ture. In Proc. 8th Ann. Int. Computing and Combinatorics Conf. (COCOON). Lecture Notes
in Computer Science, vol. 2387, p. 330. Springer.
Papineni, K. 2001 Why inverse document frequency? Proc. North American Association for
Computational Linguistics, pp. 2532.
Passerini, A., Pontil, M. and Frasconi, P. 2002 From margins to probabilities in multiclass
learning problems. In Proc. 15th Eur. Conf. on Articial Intelligence (ed. F. van Harmelen).
Frontiers in Articial Intelligence and Applications Series. Amsterdam: IOS Press.
Pazzani, M. 1996 Searching for dependencies in Bayesian classiers. In Proc. 5th Int. Workshop
on Articial Intelligence and Statistics, pp. 239248. Springer.
Pearl, J. 1988 Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kauf-
mann.
Pennock, D. M., Flake, G. W., Lawrence, S., Glover, E. J. and Giles, C. L. 2002 Winners
dont take all: characterizing the competition for links on the Web. Proc. Natl Acad. Sci. 99,
52075211.
Perline, R. 1996 Zipfs law, the central limit theorem, and the random division of the unit
interval. Phys. Rev. E 54, 220223.
Pew Internet Project Report 2002 Search engines. (Available at http://www.pewinternet.org/
reports/toc.asp?Report=64.)
Phadke, A. G. and Thorp, J. S. 1988 Computer Relaying for Power Systems. John Wiley &
Sons, Ltd/Inc.
Philips, T. K., Towsley, D. F. and Wolf, J. K. 1990 On the diameter of a class of random graphs.
IEEE Trans. Inform. Theory 36, 285288.
Pimm, S. L., Lawton, J. H. and Cohen, J. E. 1991 Food web patterns and their consequences.
Nature 350, 669674.
Pittel, B. 1994 Note on the heights of random recursive trees and random m-ary search trees.
Random Struct. Algorithms 5, 337347.
Platt, J. 1999 Fast training of support vector machines using sequential minimal optimization.
In Advances in Kernel Methods Support Vector Learning (ed. B. Schlkopf, C. J. C. Burges
and A. J. Smola,), pp. 185208. Cambridge, MA: MIT Press.
Popescul, A., Ungar, L. H., Pennock, D. M. and Lawrence, S. 2001 Probabilistic models for
unied collaborative and content-based recommendation in sparse-data environments. In
Proc. 17th Int. Conf. on Uncertainty in Articial Intelligence, pp. 437444. San Francisco,
CA: Morgan Kaufmann.
Porter, M. 1980 An algorithm for sufx stripping. Program 14, 130137.
Quinlan, J. R. 1986 Induction of decision trees. Machine Learning 1, 81106.
Quinlan, J. R. 1990 Learning logical denitions from relations. Machine Learning 5, 239266.
Raei, D. and Mendelzon, A. 2000 What is this page known for? Computing Web page repu-
tations. In Proc. 9th World Wide Web Conf.
Raggett, D., Hors, A. L. and Jacobs, I. (eds) 1999 HTML 4.01 Specication. W3 Consortium
Recommendation. (Available at http://www.w3.org/TR/html4/.)
272 REFERENCES
Sen, R. and Hansen, M. H. 2003 Predicting a Web users next request based on log data. J.
Computat. Graph. Stat. (In the press.)
Seneta, E. 1981 Nonnegative Matrices and Markov Chains. Springer.
Shachter, R. D. 1988 Probabilistic inference and inuence diagrams. Oper. Res. 36, 589604.
Shachter, R. D., Anderson, S. K. and Szolovits, P. 1994 Global conditioning for probabilistic
inference in belief networks. In Proc. Conf. on Uncertainty in AI, pp. 514522. San Francisco,
CA: Morgan Kaufmann.
Shahabi, C., Banaei-Kashani, F. and Faruque, J. 2001 A framework for efcient and anonymous
Web usage mining based on client-side tracking. In Proceedings of WEBKDD 2001. Lecture
Notes in Articial Intelligence, vol. 2356, pp. 113144. Springer.
Shannon, C. E. 1948a A mathematical theory of communication. Bell Syst. Tech. J. 27, 379423.
Shannon, C. E. 1948b A mathematical theory of communication. Bell Syst. Tech. J. 27, 623656.
Shardanand, U. and Maes, P. 1995 Social information ltering: algorithms for automating
word of mouth. In Proc. Conf. on Human Factors in Computing Systems, pp. 210217.
Shore, J. E. and Johnson, R. W. 1980 Axiomatic derivation of the principle of maximum entropy
and the principle of minimum cross-entropy. IEEE Trans. Inform. Theory 26, 2637.
Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. 1998 Analysis of a very large
AltaVista query log. Technical Note 1998-14, Digital System Research Center, Palo Alto,
CA.
Slonim, N. and Tishby, N. 2000 Document clustering using word clusters via the information
bottleneck method. In Proc. 23rd Int. Conf. on Research and Development in Information
Retrieval, pp. 208215. New York: ACM Press.
Slonim, N., Friedman, N. and Tishby, N. 2002 Unsupervised document classication using
sequential information maximization. In Proc. 25th Int. Conf. on Research and Development
in Information Retrieval, pp. 208215. New York: ACM Press.
Small, H. 1973 Co-citation in the scientic literature: a new measure of the relationship between
two documents. J. Am. Soc. Inf. Sci. 24, 265269.
Smyth, P., Heckerman, D. and Jordan, M. I. 1997 Probabilistic independence networks for
hidden Markov probability models. Neural Comp. 9, 227267.
Soderland, S. 1999 Learning information extraction rules for semi-structured and free text.
Machine Learning 34, 233272.
Sperberg-McQueen, C. and Burnard, L. (eds) 2002 TEI P4: Guidelines for Electronic
Text Encoding and Interchange. Text Encoding Initiative Consortium. (Available at
http://www.tei-c.org/.)
Spink, A., Jansen, B. J., Wolfram, D. and Saracevic, T. 2002 From e-sex to e-commerce: Web
search changes. IEEE Computer 35, 107109.
Stallman, R. 1997 The right to read. Commun. ACM 40, 8587.
Sutton, R. S. and Barto, A. G. 1998 Reinforcement Learning: An Introduction. Cambridge,
MA: MIT Press.
Tan, P. and Kumar, V. 2002 Discovery of Web robot sessions based on their navigational
patterns. Data Mining Knowl. Discov. 6, 935.
Tantrum, J., Murua, A. and Stuetzle, W. 2002 Hierarchical model-based clustering of large
datasets through fractionation and refractionation. In Proc. 8th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining. New York: ACM Press.
Taskar, B., Abbeel, P. and Koller, D. 2002 Discriminative probabilistic models for relational
data. In Proc. 18th Conf. on Uncertainty in Articial Intelligence. San Francisco, CA: Morgan
Kaufmann.
Tauscher, L. and Greenberg, S. 1997 Revisitation patterns in World Wide Web navigation. In
Proc. Conf. on Human Factors in Computing Systems CHI97, pp. 97137. New York: ACM
Press.
274 REFERENCES
Tedeschi, B. 2000 Easier to use sites would help e-tailers close more sales. New York Times,
12 June 2000.
Tishby, N., Pereira, F. and Bialek, W. 1999 The information bottleneck method. In Proc. 37th
Ann. Allerton Conf. on Communication, Control, and Computing (ed. B. Hajek and R. S.
Sreenivas), pp. 368377.
Titterington, D. M., Smith, A. F. M. and Makov, U. E. 1985 Statistical Analysis of Finite Mixture
Distributions. John Wiley & Sons, Ltd/Inc.
Travers, J. and Milgram, S. 1969 An experimental study of the smal world problem. Sociometry
32, 425.
Ungar, L. H. and Foster, D. P. 1998 Clustering methods for collaborative ltering. In Proc.
Workshop on Recommendation Systems at the 15th National Conf. on Articial Intelligence.
Menlo Park, CA: AAAI Press.
Vapnik, V. N. 1982 Estimation of Dependences Based on Empirical Data. Springer.
Vapnik, V. N. 1995 The Nature of Statistical Learning Theory. Springer.
Vapnik, V. N. 1998 Statistical Learning Theory. John Wiley & Sons, Ltd/Inc.
Viterbi, A. J. 1967 Error bounds for convolutional codes and an asymptotically optimum decod-
ing algorithm. IEEE Trans. Inform. Theory 13, 260269.
Walker, J. 2002 Links and power: the political economy of linking on the Web. In Proc. 13th
Conf. on Hypertext and Hypermedia, pp. 7273. New York: ACM Press.
Wall, L., Christiansen, T. and Schwartz RL. 1996 Programming Perl, 2nd edn. Cambridge,
MA: OReilly & Associates.
Wasserman, S. and Faust, K. 1994 Social Network Analysis. Cambridge University Press.
Watts, D. J. and Strogatz, S. H. 1998 Collective dynamics of small-world networks. Nature
393, 440442.
Watts, D. J., Dodds, P. S. and Newman, M. E. J. 2002 Identity and search in social networks.
Science 296, 13021305.
Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T. and Hampp, T.
1999 Maximizing text-mining performance. IEEE Intell. Syst. 14, 6369.
Weiss, Y. 2000 Correctness of local probability propagation in graphical models with loops.
Neural Comp. 12, 141.
White, H. 1970 Search parameters for the small world problem. Social Forces 49, 259.
Whittaker, J. 1990 Graphical Models in Applied Multivariate Statistics. John Wiley & Sons,
Ltd/Inc.
Wiener, E. D., Pedersen, J. O. and Weigend, A. S. 1995 A neural network approach to topic spot-
ting. In Proc. SDAIR-95, 4th Ann. Symp. on Document Analysis and Information Retrieval,
Las Vegas, NV, pp. 317332.
Witten, I. H., Moffat, A. and Bell, T. C. 1999 Managing Gigabytes: Compressing and Indexing
Documents and Images, 2nd edn. San Francisco, CA: Morgan Kaufmann.
Witten, I. H., Neville-Manning, C. and Cunningham, S. J. 1996 Building a digital library for
computer science research: technical issues. In Proc. Australasian Computer Science Conf.,
Melbourne, Australia.
Wolf, J., Squillante, M., Yu, P., Sethuraman, J. and Ozsen, L. 2002 Optimal crawling strategies
for Web search engines. In Proc. 11th Int. World Wide Web Conf., pp. 136147.
Xie, Y. and OHallaron, D. 2002 Locality in search engine queries and its implications for
caching. In Proc. IEEE Infocom 2002, pp. 12381247. Piscataway, NJ: IEEE Press.
Yang, Y. 1999 An evaluation of statistical approaches to text categorization. Information
Retrieval 1, 6990.
Yang, Y. and Liu, X. 1999 A re-examination of text categorization methods In Proc. SIGIR-99,
22nd ACM Int. Conf. on Research and Development in Information Retrieval (ed. M. A.
Hearst, F. Gey and R. Tong), pp. 4249. New York: ACM Press.
REFERENCES 275
Yedidia, J., Freeman, W. T. and Weiss, Y. 2000 Generalized belief propagation. Neural Comp.
12, 141.
York, J. 1992 Use of the Gibbs sampler in expert systems. Artif. Intell. 56, 115130.
Zamir, O. and Etzioni, O. 1998 Web document clustering: a feasibility demonstration. In Proc.
21st Int. Conf. on Research and Development in Information Retrieval (SIGIR), pp. 4654.
New York: ACM Press.
Zelikovitz, S. and Hirsh, H. 2001 Using LSI for text classication in the presence of background
text. In Proc. 10th Int. ASM Conf. on Information and Knowledge Management, pp. 113118.
New York: ACM Press.
Zhang, T. and Iyengar, V. S. 2002 Recommender systems using linear classiers. J. Machine
Learn. Res. 2, 313334.
Zhang, T. and Oles, F. J. 2000 A probability analysis on the value of unlabeled data for classica-
tion problems. In Proc. 17th Int. Conf. on Machine Learning, Stanford, CA, pp. 11911198.
Zhu, X., Yu, J. and Doyle, J. 2001 Heavy tails, generalized coding, and optimal Web layout. In
Proc. 2001 IEEE INFOCOM Conf., vol. 3, pp. 16171626. Piscataway, NJ: IEEE Press.
Zukerman, I., Albrecht, D. W. and Nicholson, A. E. 1999 Predicting users requests on the
WWW. In Proc. UM99: 7th Int. Conf. on User Modeling, pp. 275284. Springer.
This Page Intentionally Left Blank
Index
277
278 INDEX
Dirichlet distribution 7, 188, 23940, Excite search engine 202, 203, 204,
250 205
Dirichlet prior 7, 12, 96, 97, 189 expectation 237
discrete distributions 2378 expectation maximization (EM) 5, 16
discrete power-law (continuous) algorithm 1013, 21
distributions 22, 23, 23840 Naive Bayes and 112
discriminative (decision-boundary) exploratory browsing state 232
approach to classication 18 exponential distribution 22, 239, 240
disordered lattice approach 63 exponential family 2401
distributed search algorithms 6870 extensibility 33
document clustering 11620 eXtensible Markup Language (XML)
algorithm 1179 30
background and examples 1167 extreme value distribution 241
related approaches 11920
document identier (DID) 7880 FASTUS system 120
document similarity 834 feature selection 1024
Document Type Denition (DTD) 30 feature space 102
frameset 32 Fermis model 257, 61
of HTML 31 lters 102
nite-state Markov models 10, 14
strict 32
rewall 160
transitional 32
rst-in rst-out (FIFO) policy 47
domain name service (DNS) 38, 49
Fish algorithm 157
domain name system 379
FOIL 1156
dominant eigenvalue 129
frameset DTD 32
dominant eigenvector 129
Freenet 73
dynamic coordination 159
frequentist interpretation of
probability 2
e-commerce logs 212 freshness 167
eigengap 138
80/20 rule 20, 24 Gamma distribution 22, 239, 240
element name 31 Gaussian distribution 11, 22, 23,
element types 30 2389
email 11011 Gentoo 89
ltering 1718 geometric distribution 238240
product recommendations 2246 GET method 39, 43, 174
embarrassment level 169 Gibbs sampling 15
energy function 5 Gnutella 70, 73
entropy 243, 2445 good citizen search engines 173
error function 5 Google search engine 44, 53, 57, 225
error log 41 grace period 160
error rate 86 gradient descent 5
escape matrix 137 graph theory 2357
exchange 160 graph visiting vs crawling 49
280 INDEX
Naive Bayes classier 18, 947, 102, page request prediction 199201
105, 106 PageRank 1348, 151
Naive Bayes model 21, 93, 96, 119, models 678
220 power law of 578
nearest-neighbor collaborative stability of 139
ltering 2158 panel data 212
basic principles 215 parent based prediction 1534
computational issues and ParetoZipf distribution (Zipf) 22, 80
clustering 218 partitioning 160
curse of dimensionality 2178 path 35
pattern recognition 102
dening weights 2167
Pearson correlation coefcient 104
negative binomial distribution 240
peer-to-peer networks 70, 73
network value 227
perceptrons 19, 98
networks and recommendations
period 128
2248
Perl programming language 40, 81
diffusion model 2268
PerronFrobenius theorem 12830
email-based product
Pew Internet Project Report 44
recommendations 2246
PHITS 116, 142
neural networks 14, 19, 93
PLSA 116
news ltering 11011 Poisson distribution 22, 96, 194, 238,
news stories, classication of 10710 240
NewsWeeder 11011 polynomial time algorithms 71
next search action 207 polysemy 88
Nielsen 177, 212 popularity 126, 1501
noisy OR models 14 Portable Document Format (PDF) 81
nonnegative matrices 12830 tokenization 81
nonstationary Markov chain 193 posterior distribution 247
normal distribution 240 PostScript 81
tokenization of 81
one-shot coding strategy 102 power-law connectivity 535
one-step ahead prediction 191 power-law distribution 227, 52, 53
one vs all coding strategy 102 origin of 257
ontology 106 power-law size 53
Open Directory Project (ODP) 44 precision 86, 105
Open System Interconnect (OSI) 36 predicates 157
optical character recognition (OCR) preferential attachment models 636
81 PreScript 81
optimal resource allocation 168 primitive matrix 129
INDEX 283