Search quality in practice

Search quality in
practice
Alexander Sibiryakov, ex-Yandex engineer, data scientist at Avast!
sixty-one@yandex.ru
1

Agenda
• What is search quality?
• Examples of search quality problems.
• Evaluating search quality. Methods.
• Signals is the key.
• Producing good snippets.
2

Agenda
3

• Search quality - abstract term, includes relevance,
user experience, and reveals overall effectiveness
of search by humans.
• Relevance - in search, is the measure of conformity
of user information need to document found.
4

Relevance is subjective
A user takes relevance in a very subjective way:
• The context of the problem, he is trying to solve
• awareness about the problem,
• user interface
• document annotations,
• presentation form,
• order,
• previous experience with this search system.
5

Seznam.cz, new search UI with big screenshots
6

images.yandex.ru - image search from yandex.ru
7

Search systems behavior
could be learned by users
• Seznam.cz has very good document base on
Czech internet, bigger than Google, but has less
powerful ranking and very sensitive to query
formulation.
• Yandex is very bad on software development
queries, because of lack of documents or bad
ranking.
8

Problems
• No definitive formulation. Considerable uncertainty.
Complex interdependencies.
• We, developers, aren’t prepared to tackle search. We
can’t manage high-tech, step-changing, cross-functional,
user-centered challenge.
• The role of search in user experience is underestimated.
Therefore, nobody measure and knows how good it is.
!
!
!
!
From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010
9

10

Poor search is bad for business and sad
for society
11

Search can be a source of information
and inspiration
12

Agenda
13

Examples of search quality problems
• Search of model no. or article
[6167 8362823] [61 67 8 362 823]
(telescopic noozle), proper tokenization
• Detection and correction of typing errors
[drzak myla] [drzak mydla]
(soap holder), lexical ambiguity
• Question search
[how to buy a used xperia] [… smartphone]
[how to buy a used fiat] [… car]
wrong weighting of important words.
14

Agenda
15

Evaluation of search
• Basement for improvement of search system,
• as usual, there is no ideal measure,
• use multiple measures,
• keep in mind properties of each measure, when
making a decision.
16

Evaluation of search:
methods
• Query-by-query comparison of two systems,
• classic Cleverdon’s Cranfield evaluation,
• Pairwise evaluation with Swiss system.
17

Query-by-query comparison
• Take random queries from the stream, for example
100.
• query each system and evaluate the whole SERP of
topN results with scale:
++ (very good)
+ (good)
- (bad)
-- (very bad)
• Count judgements of each type.
18

Query-by-query
comparison: example
• Comparing Google and Bing
[berlin buzzwords] - G++, B+
[java byteoutputstream] - G+, B-
!
Google: ++ - 1, + - 1
Bing: + - 1, - - 1
19

Cyril Cleverdone, born Bristol UK,
1914-1997
British librarian, best known for his work on the
evaluation of information retrieval systems
20

Cleverdon’s Cranfield
evaluation
• Components:
• Document collection,
• set of queries,
• set of relevance judgements.
!
• Measures (per query):
• Precision - fraction of retrieved documents that are
relevant.
• Recall - percent of all relevant documents returned by the
search system.
21

evaluation: example
• [berlin buzzwords]
No. URL Judgement
1 berlinbuzzwords.de/ R
2 https://www.facebook.com/berlinbuzzwords R
3 https://twitter.com/berlinbuzzwords R
4 www.youtube.com/playlist?list=PLq-odUc2x7i8Qg4j2fix-QN6bjup NR
5 https://developers.soundcloud.com/blog/buzzwords-contest R
6 www.retresco.de/the-berlin-buzzwords-over-and-out/ NR
7 planetcassandra.org/events/berlin-de-berlin-buzzwords-2014/ R
!
!
!
!
!
Pr = CRel / C = 5 / 7 = 0,71
Re = CRel / CRelOverall
22

evaluation: averaging
• Macro-average:
PRMaA= (Pr1 + Pr2 + … + PrN) / N
• Micro-average:
PRMiA = (CRel1 + CRel2 + … + CRelN) / (C1 + C2 + … + CN)
!
N - count of judged SERP’s
• Variations:
Pr1, Pr5, Pr10 - counting only top 1, 5, 10 results.
!
23

Normalized Discounted Cumulative Gain (NDCG)
• Measures usefulness, or gain, of document based
on its position in the result list.
• The gain is accumulated from the top of the result
list to the bottom with the gain of each result
discounted at lower ranks.
DCGp = 2reli −1
pΣ
log2 (i +1) i=1
NDCGp = DCGp
IDCGp
reli - graded relevance of the result at position i,
DCGp - discounted cumulative gain for p positions.
From http://en.wikipedia.org/wiki/Discounted_cumulative_gain
24

Pairwise evaluation with Swiss system
(experimental)
• Judgement of document pairs,
• «Which document is more relevant to the query X?»
• answers are:
Left, right, equal.
• Chosen document is getting one point, in case of «equal», both are
getting by one point.
• Pairs preparation using Swiss tournament system:
• First pass. All documents are ordered randomly or by default
ranking. Then take first document from first half, and first from
second (1-st with 5-th, 2-nd with 6-th, and so on) to get pair.
• In the next pass, only winners of previous pass are judged. The
same way, taking documents from first and second halfs starting
from top to create pairs for judgement.
25

Which document is more relevant to the
query
[berlin buzzwords] ?
26

Pairwise evaluation with
Swiss system
• About 19 judgements is needed for 10 documents
retrieved for 1 query.
• After judgement is finished, the ranking is built by
gathered points.
• According to position the weights are assigned to
the documents.
• Using weights, the machine-learned model can be
trained.
27

Pairwise evaluation with Swiss
system: weights assignment
• For example, we can use exponential weight
decrement:
W = P*EXP (1/pos)
!
1. 8,13 (3)
2. 1,64 (1)
3. 1,39 (1)
4. 0 (0)
5. 0 (0)
28
9
6,75
4,5
2,25
0
1 2 3 4 5

Agenda
29

Signals is the key: agenda
• Production system: what data is available?
• Text relevance: approaches, no silver bullet.
• Social signals.
• How to mix signals: manual linear model, gradient
boosted decision trees.
30

Production system: what
data is available?
• Documents:
• CTR of the document,
• absolute number of clicks,
• count of times, when document was clicked first in SERP,
• the same, but last
• count of clicks on the same SERP before/after the document was
clicked.
• Displays (shows):
• Count of times document was displayed on SERP,
• count of unique queries, where document was displayed,
• document position: max, min, average, median, etc.
31

Production system: what
data is available?
• Queries:
• Absolute click count on query,
• Abandonment rate,
• CTR of the query,
• Time spent on SERP,
• Time spent till first/last click,
• Query frequency,
• Count of words in query,
• IDF of words of query: min/max/average/median, etc.,
• Count of query reformulations: min/max/average/median.
• CTR of reformulations.
32

Text relevance: use cases
• Phrase search,
• search of named entities (cities, names, etc.)
• search of codes, articles, telephone numbers,
• search of questions,
• search of set expressions (e.g. «to get cold»)
• …
33

Text relevance: signals
• BM25F zoned version: meta-description, meta-keywords, title, body of the
document,
• calculate BM25 on query expansions: word forms, thesaurus based,
abbreviations, translit, fragments,
• min/max/average/median of count of subsequent query words found in the
document,
• the same, but query order,
• the same, but with distance +/- 1,2,3 words,
• min/max of IDF of query words found,
• to build language model of document and use it for ranking,
• language model of queries, of different words count, use probabilities as a
signals.
34

Text relevance: example
model
ScoreTR =
a * BM25 +
b * BM25FTitle +
c * BM25FDescr +
MAX(SubseqQWords)^d,
!
a, b, c, d - can be estimated manually, or using
stochastic gradient descent.
35

Social signals
• Count of readers/commenters of content,
• count of comments published during some time
period (velocity),
• time since last comment,
• speed of likes growth,
• time since last like,
• absolute count of likes,
• etc.
36

How to mix signals: learning-to-rank
Learning to rank or machine-learned ranking
(MLR) is the application of machine learning,
typically supervised, semi-supervised or
reinforcement learning, in the construction of ranking
models for information retrieval systems.
!
!
!
From Wikipedia, M. Mohri, et al. Foundations of Machine Learning, The MIT
Press, 2012
37

How to mix signals: full-scale process
• The training set preparation:
• Documents,
• Queries,
• Relevance judgements.
• Framework:
• Querying of search and dump of feature vectors (incl.
assigning relevance judgements),
• learning model,
• evaluation of model,
• adoption of model in production system,
• repeat after some time.
38

How to mix signals: DIY way
• Choose manually some set of features, which you think are good
predictors,
• create a simple linear model from these predictors,
• fit coefficients manually by selecting few (10) representative queries.
ScoreTR =
a * BM25 +
MAX(SubseqQWords)^b +
c * CTR +
d * Likes +
e * QLength;
!
a, b, c, d, e - needs to be fit.
39

How to mix signals: more work
• Get some relevance judgements:
• pairwise evaluation,
• classic Cranfield way,
• using some good signal, sacrificing it *
• Learn a more complex model: Ranking-SVM, or
Gradient Boosted Decision Trees (GBDT).
* - make sure there are no big correlations with other
signals.
40

Decision tree
F5 > 0.5
F11 > 0.21 F7 > 0.001
F2 > 0.72
0.7 0.27
0.9
0.1
0.3
41

Gradient boosted decision trees
+ + + +
…
S = ⍺D1 ⍺D2 ⍺D3 ⍺D4 ⍺DN
⍺ - step,
Di - result of each weak predictor (tree),
N - count of weak predictors
!
Each weak predictor is learned on subsample from the
whole training set.
42

Yahoo! Learning to rank challenge, 2011

Agenda
44

Producing good snippets:
text summarization
The problem is to generate a summary from original
document taking into account
1. Query words,
2. length,
[mardi gras fat tuesday]
3. style.
45

Producing good snippets: types
1. Static - generated once, their content will not
change when query changes, may not have query
words at all.
2. Dynamic - generated individually for each query,
usually contain query words.
Almost all modern search systems use dynamic
generation of snippets or combination.
46

Producing good snippets: algorithm
1. Generate presentation of the document as a set of
paragraphs, sentences and words.
2. Generate candidates for snippet for given query.
3. For each candidate generate signals and rank
candidates with machine learned model.
4. Selection of most suitable candidate(s) fitting
requirements.
47

Producing good snippets:
example signals
• Length of candidate text,
• amount of query words in candidate text,
• BM25,
• IDF of query words in candidate text,
• is there beginning/ending of sentence ?
• conformity of query words order,
• conformity of word forms between query and text,
• etc.
48

Thank you.
Alexander Sibiryakov, sixty-one@yandex.ru
49

Search quality in practice

More Related Content

Similar to Search quality in practice (20)

Recently uploaded (20)

Search quality in practice