SlideShare a Scribd company logo
Search quality in 
practice 
Alexander Sibiryakov, ex-Yandex engineer, data scientist at Avast! 
sixty-one@yandex.ru 
1
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
2
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
3
• Search quality - abstract term, includes relevance, 
user experience, and reveals overall effectiveness 
of search by humans. 
• Relevance - in search, is the measure of conformity 
of user information need to document found. 
4
Relevance is subjective 
A user takes relevance in a very subjective way: 
• The context of the problem, he is trying to solve 
• awareness about the problem, 
• user interface 
• document annotations, 
• presentation form, 
• order, 
• previous experience with this search system. 
5
Seznam.cz, new search UI with big screenshots 
6
images.yandex.ru - image search from yandex.ru 
7
Search systems behavior 
could be learned by users 
• Seznam.cz has very good document base on 
Czech internet, bigger than Google, but has less 
powerful ranking and very sensitive to query 
formulation. 
• Yandex is very bad on software development 
queries, because of lack of documents or bad 
ranking. 
8
Problems 
• No definitive formulation. Considerable uncertainty. 
Complex interdependencies. 
• We, developers, aren’t prepared to tackle search. We 
can’t manage high-tech, step-changing, cross-functional, 
user-centered challenge. 
• The role of search in user experience is underestimated. 
Therefore, nobody measure and knows how good it is. 
! 
! 
! 
! 
From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 
9
From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 
10
Poor search is bad for business and sad 
for society 
From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 
11
Search can be a source of information 
and inspiration 
From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 
12
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
13
Examples of search quality problems 
• Search of model no. or article 
[6167 8362823] [61 67 8 362 823] 
(telescopic noozle), proper tokenization 
• Detection and correction of typing errors 
[drzak myla] [drzak mydla] 
(soap holder), lexical ambiguity 
• Question search 
[how to buy a used xperia] [… smartphone] 
[how to buy a used fiat] [… car] 
wrong weighting of important words. 
14
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
15
Evaluation of search 
• Basement for improvement of search system, 
• as usual, there is no ideal measure, 
• use multiple measures, 
• keep in mind properties of each measure, when 
making a decision. 
16
Evaluation of search: 
methods 
• Query-by-query comparison of two systems, 
• classic Cleverdon’s Cranfield evaluation, 
• Pairwise evaluation with Swiss system. 
17
Query-by-query comparison 
• Take random queries from the stream, for example 
100. 
• query each system and evaluate the whole SERP of 
topN results with scale: 
++ (very good) 
+ (good) 
- (bad) 
-- (very bad) 
• Count judgements of each type. 
18
Query-by-query 
comparison: example 
• Comparing Google and Bing 
[berlin buzzwords] - G++, B+ 
[java byteoutputstream] - G+, B- 
! 
Google: ++ - 1, + - 1 
Bing: + - 1, - - 1 
19
Cyril Cleverdone, born Bristol UK, 
1914-1997 
British librarian, best known for his work on the 
evaluation of information retrieval systems 
20
Cleverdon’s Cranfield 
evaluation 
• Components: 
• Document collection, 
• set of queries, 
• set of relevance judgements. 
! 
• Measures (per query): 
• Precision - fraction of retrieved documents that are 
relevant. 
• Recall - percent of all relevant documents returned by the 
search system. 
21
Cleverdon’s Cranfield 
evaluation: example 
• [berlin buzzwords] 
No. URL Judgement 
1 berlinbuzzwords.de/ R 
2 https://www.facebook.com/berlinbuzzwords R 
3 https://twitter.com/berlinbuzzwords R 
4 www.youtube.com/playlist?list=PLq-odUc2x7i8Qg4j2fix-QN6bjup NR 
5 https://developers.soundcloud.com/blog/buzzwords-contest R 
6 www.retresco.de/the-berlin-buzzwords-over-and-out/ NR 
7 planetcassandra.org/events/berlin-de-berlin-buzzwords-2014/ R 
! 
! 
! 
! 
! 
Pr = CRel / C = 5 / 7 = 0,71 
Re = CRel / CRelOverall 
22
Cleverdon’s Cranfield 
evaluation: averaging 
• Macro-average: 
PRMaA= (Pr1 + Pr2 + … + PrN) / N 
• Micro-average: 
PRMiA = (CRel1 + CRel2 + … + CRelN) / (C1 + C2 + … + CN) 
! 
N - count of judged SERP’s 
• Variations: 
Pr1, Pr5, Pr10 - counting only top 1, 5, 10 results. 
! 
23
Normalized Discounted Cumulative Gain (NDCG) 
• Measures usefulness, or gain, of document based 
on its position in the result list. 
• The gain is accumulated from the top of the result 
list to the bottom with the gain of each result 
discounted at lower ranks. 
DCGp = 2reli −1 
pΣ 
log2 (i +1) i=1 
NDCGp = DCGp 
IDCGp 
reli - graded relevance of the result at position i, 
DCGp - discounted cumulative gain for p positions. 
From http://en.wikipedia.org/wiki/Discounted_cumulative_gain 
24
Pairwise evaluation with Swiss system 
(experimental) 
• Judgement of document pairs, 
• «Which document is more relevant to the query X?» 
• answers are: 
Left, right, equal. 
• Chosen document is getting one point, in case of «equal», both are 
getting by one point. 
• Pairs preparation using Swiss tournament system: 
• First pass. All documents are ordered randomly or by default 
ranking. Then take first document from first half, and first from 
second (1-st with 5-th, 2-nd with 6-th, and so on) to get pair. 
• In the next pass, only winners of previous pass are judged. The 
same way, taking documents from first and second halfs starting 
from top to create pairs for judgement. 
25
Which document is more relevant to the 
query 
[berlin buzzwords] ? 
26
Pairwise evaluation with 
Swiss system 
• About 19 judgements is needed for 10 documents 
retrieved for 1 query. 
• After judgement is finished, the ranking is built by 
gathered points. 
• According to position the weights are assigned to 
the documents. 
• Using weights, the machine-learned model can be 
trained. 
27
Pairwise evaluation with Swiss 
system: weights assignment 
• For example, we can use exponential weight 
decrement: 
W = P*EXP (1/pos) 
! 
1. 8,13 (3) 
2. 1,64 (1) 
3. 1,39 (1) 
4. 0 (0) 
5. 0 (0) 
28 
9 
6,75 
4,5 
2,25 
0 
1 2 3 4 5
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
29
Signals is the key: agenda 
• Production system: what data is available? 
• Text relevance: approaches, no silver bullet. 
• Social signals. 
• How to mix signals: manual linear model, gradient 
boosted decision trees. 
30
Production system: what 
data is available? 
• Documents: 
• CTR of the document, 
• absolute number of clicks, 
• count of times, when document was clicked first in SERP, 
• the same, but last 
• count of clicks on the same SERP before/after the document was 
clicked. 
• Displays (shows): 
• Count of times document was displayed on SERP, 
• count of unique queries, where document was displayed, 
• document position: max, min, average, median, etc. 
31
Production system: what 
data is available? 
• Queries: 
• Absolute click count on query, 
• Abandonment rate, 
• CTR of the query, 
• Time spent on SERP, 
• Time spent till first/last click, 
• Query frequency, 
• Count of words in query, 
• IDF of words of query: min/max/average/median, etc., 
• Count of query reformulations: min/max/average/median. 
• CTR of reformulations. 
32
Text relevance: use cases 
• Phrase search, 
• search of named entities (cities, names, etc.) 
• search of codes, articles, telephone numbers, 
• search of questions, 
• search of set expressions (e.g. «to get cold») 
• … 
33
Text relevance: signals 
• BM25F zoned version: meta-description, meta-keywords, title, body of the 
document, 
• calculate BM25 on query expansions: word forms, thesaurus based, 
abbreviations, translit, fragments, 
• min/max/average/median of count of subsequent query words found in the 
document, 
• the same, but query order, 
• the same, but with distance +/- 1,2,3 words, 
• min/max of IDF of query words found, 
• to build language model of document and use it for ranking, 
• language model of queries, of different words count, use probabilities as a 
signals. 
34
Text relevance: example 
model 
ScoreTR = 
a * BM25 + 
b * BM25FTitle + 
c * BM25FDescr + 
MAX(SubseqQWords)^d, 
! 
a, b, c, d - can be estimated manually, or using 
stochastic gradient descent. 
35
Social signals 
• Count of readers/commenters of content, 
• count of comments published during some time 
period (velocity), 
• time since last comment, 
• speed of likes growth, 
• time since last like, 
• absolute count of likes, 
• etc. 
36
How to mix signals: learning-to-rank 
Learning to rank or machine-learned ranking 
(MLR) is the application of machine learning, 
typically supervised, semi-supervised or 
reinforcement learning, in the construction of ranking 
models for information retrieval systems. 
! 
! 
! 
From Wikipedia, M. Mohri, et al. Foundations of Machine Learning, The MIT 
Press, 2012 
37
How to mix signals: full-scale process 
• The training set preparation: 
• Documents, 
• Queries, 
• Relevance judgements. 
• Framework: 
• Querying of search and dump of feature vectors (incl. 
assigning relevance judgements), 
• learning model, 
• evaluation of model, 
• adoption of model in production system, 
• repeat after some time. 
38
How to mix signals: DIY way 
• Choose manually some set of features, which you think are good 
predictors, 
• create a simple linear model from these predictors, 
• fit coefficients manually by selecting few (10) representative queries. 
ScoreTR = 
a * BM25 + 
MAX(SubseqQWords)^b + 
c * CTR + 
d * Likes + 
e * QLength; 
! 
a, b, c, d, e - needs to be fit. 
39
How to mix signals: more work 
• Get some relevance judgements: 
• pairwise evaluation, 
• classic Cranfield way, 
• using some good signal, sacrificing it * 
• Learn a more complex model: Ranking-SVM, or 
Gradient Boosted Decision Trees (GBDT). 
* - make sure there are no big correlations with other 
signals. 
40
Decision tree 
F5 > 0.5 
F11 > 0.21 F7 > 0.001 
F2 > 0.72 
0.7 0.27 
0.9 
0.1 
0.3 
41
Gradient boosted decision trees 
+ + + + 
… 
S = ⍺D1 ⍺D2 ⍺D3 ⍺D4 ⍺DN 
⍺ - step, 
Di - result of each weak predictor (tree), 
N - count of weak predictors 
! 
Each weak predictor is learned on subsample from the 
whole training set. 
42
Yahoo! Learning to rank challenge, 2011
Agenda 
• What is search quality? 
• Examples of search quality problems. 
• Evaluating search quality. Methods. 
• Signals is the key. 
• Producing good snippets. 
44
Producing good snippets: 
text summarization 
The problem is to generate a summary from original 
document taking into account 
1. Query words, 
2. length, 
[mardi gras fat tuesday] 
3. style. 
45
Producing good snippets: types 
1. Static - generated once, their content will not 
change when query changes, may not have query 
words at all. 
2. Dynamic - generated individually for each query, 
usually contain query words. 
Almost all modern search systems use dynamic 
generation of snippets or combination. 
46
Producing good snippets: algorithm 
1. Generate presentation of the document as a set of 
paragraphs, sentences and words. 
2. Generate candidates for snippet for given query. 
3. For each candidate generate signals and rank 
candidates with machine learned model. 
4. Selection of most suitable candidate(s) fitting 
requirements. 
47
Producing good snippets: 
example signals 
• Length of candidate text, 
• amount of query words in candidate text, 
• BM25, 
• IDF of query words in candidate text, 
• is there beginning/ending of sentence ? 
• conformity of query words order, 
• conformity of word forms between query and text, 
• etc. 
48
Thank you. 
Alexander Sibiryakov, sixty-one@yandex.ru 
49

More Related Content

PDF
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
PDF
Combining IR with Relevance Feedback for Concept Location
PDF
Recommender Systems, Matrices and Graphs
PPTX
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
PPTX
Moz 2013 Ranking Factors - Matt Peters MozCon
PDF
Software Quality in Practice
PDF
Web search-metrics-tutorial-www2010-section-1of7-introduction
PDF
Searchland: Search quality for Beginners
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Combining IR with Relevance Feedback for Concept Location
Recommender Systems, Matrices and Graphs
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Moz 2013 Ranking Factors - Matt Peters MozCon
Software Quality in Practice
Web search-metrics-tutorial-www2010-section-1of7-introduction
Searchland: Search quality for Beginners

Similar to Search quality in practice (20)

PPTX
Relevancy and Search Quality Analysis - Search Technologies
PPTX
How Google works
PDF
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
PDF
Better Search Engine Testing - Eric Pugh
PPTX
Evolving the Optimal Relevancy Ranking Model at Dice.com
PPTX
Information Retrieval Evaluation
PDF
Evaluating Search Performance
PPTX
Machine Learned Relevance at A Large Scale Search Engine
PPTX
Lecture 10- Information Retrieval Evaluation.pptx
PPTX
page ranking web crawling
PPTX
PAGE RANKING
PDF
MMM, Search!
PDF
Everything You Wish You Knew About Search
PPTX
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
PDF
Evaluating Search Performance
PDF
Charting Searchland, ACM SIG Data Mining
PDF
PPTX
Haystack keynote 2019: What is Search Relevance? - Max Irwin
PPTX
Opinion-Based Entity Ranking
PPT
information technology materrailas paper
Relevancy and Search Quality Analysis - Search Technologies
How Google works
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Better Search Engine Testing - Eric Pugh
Evolving the Optimal Relevancy Ranking Model at Dice.com
Information Retrieval Evaluation
Evaluating Search Performance
Machine Learned Relevance at A Large Scale Search Engine
Lecture 10- Information Retrieval Evaluation.pptx
page ranking web crawling
PAGE RANKING
MMM, Search!
Everything You Wish You Knew About Search
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Evaluating Search Performance
Charting Searchland, ACM SIG Data Mining
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Opinion-Based Entity Ranking
information technology materrailas paper
Ad

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Touch Screen Technology
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Mushroom cultivation and it's methods.pdf
PDF
project resource management chapter-09.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Touch Screen Technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
1 - Historical Antecedents, Social Consideration.pdf
DP Operators-handbook-extract for the Mautical Institute
Hindi spoken digit analysis for native and non-native speakers
Building Integrated photovoltaic BIPV_UPV.pdf
TLE Review Electricity (Electricity).pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Approach and Philosophy of On baking technology
Web App vs Mobile App What Should You Build First.pdf
A comparative analysis of optical character recognition models for extracting...
Tartificialntelligence_presentation.pptx
Mushroom cultivation and it's methods.pdf
project resource management chapter-09.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Ad

Search quality in practice

  • 1. Search quality in practice Alexander Sibiryakov, ex-Yandex engineer, data scientist at Avast! [email protected] 1
  • 2. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 2
  • 3. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 3
  • 4. • Search quality - abstract term, includes relevance, user experience, and reveals overall effectiveness of search by humans. • Relevance - in search, is the measure of conformity of user information need to document found. 4
  • 5. Relevance is subjective A user takes relevance in a very subjective way: • The context of the problem, he is trying to solve • awareness about the problem, • user interface • document annotations, • presentation form, • order, • previous experience with this search system. 5
  • 6. Seznam.cz, new search UI with big screenshots 6
  • 7. images.yandex.ru - image search from yandex.ru 7
  • 8. Search systems behavior could be learned by users • Seznam.cz has very good document base on Czech internet, bigger than Google, but has less powerful ranking and very sensitive to query formulation. • Yandex is very bad on software development queries, because of lack of documents or bad ranking. 8
  • 9. Problems • No definitive formulation. Considerable uncertainty. Complex interdependencies. • We, developers, aren’t prepared to tackle search. We can’t manage high-tech, step-changing, cross-functional, user-centered challenge. • The role of search in user experience is underestimated. Therefore, nobody measure and knows how good it is. ! ! ! ! From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 9
  • 10. From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 10
  • 11. Poor search is bad for business and sad for society From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 11
  • 12. Search can be a source of information and inspiration From «Search Patterns» P. Morville & J. Callender, O’Reilly, 2010 12
  • 13. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 13
  • 14. Examples of search quality problems • Search of model no. or article [6167 8362823] [61 67 8 362 823] (telescopic noozle), proper tokenization • Detection and correction of typing errors [drzak myla] [drzak mydla] (soap holder), lexical ambiguity • Question search [how to buy a used xperia] [… smartphone] [how to buy a used fiat] [… car] wrong weighting of important words. 14
  • 15. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 15
  • 16. Evaluation of search • Basement for improvement of search system, • as usual, there is no ideal measure, • use multiple measures, • keep in mind properties of each measure, when making a decision. 16
  • 17. Evaluation of search: methods • Query-by-query comparison of two systems, • classic Cleverdon’s Cranfield evaluation, • Pairwise evaluation with Swiss system. 17
  • 18. Query-by-query comparison • Take random queries from the stream, for example 100. • query each system and evaluate the whole SERP of topN results with scale: ++ (very good) + (good) - (bad) -- (very bad) • Count judgements of each type. 18
  • 19. Query-by-query comparison: example • Comparing Google and Bing [berlin buzzwords] - G++, B+ [java byteoutputstream] - G+, B- ! Google: ++ - 1, + - 1 Bing: + - 1, - - 1 19
  • 20. Cyril Cleverdone, born Bristol UK, 1914-1997 British librarian, best known for his work on the evaluation of information retrieval systems 20
  • 21. Cleverdon’s Cranfield evaluation • Components: • Document collection, • set of queries, • set of relevance judgements. ! • Measures (per query): • Precision - fraction of retrieved documents that are relevant. • Recall - percent of all relevant documents returned by the search system. 21
  • 22. Cleverdon’s Cranfield evaluation: example • [berlin buzzwords] No. URL Judgement 1 berlinbuzzwords.de/ R 2 https://www.facebook.com/berlinbuzzwords R 3 https://twitter.com/berlinbuzzwords R 4 www.youtube.com/playlist?list=PLq-odUc2x7i8Qg4j2fix-QN6bjup NR 5 https://developers.soundcloud.com/blog/buzzwords-contest R 6 www.retresco.de/the-berlin-buzzwords-over-and-out/ NR 7 planetcassandra.org/events/berlin-de-berlin-buzzwords-2014/ R ! ! ! ! ! Pr = CRel / C = 5 / 7 = 0,71 Re = CRel / CRelOverall 22
  • 23. Cleverdon’s Cranfield evaluation: averaging • Macro-average: PRMaA= (Pr1 + Pr2 + … + PrN) / N • Micro-average: PRMiA = (CRel1 + CRel2 + … + CRelN) / (C1 + C2 + … + CN) ! N - count of judged SERP’s • Variations: Pr1, Pr5, Pr10 - counting only top 1, 5, 10 results. ! 23
  • 24. Normalized Discounted Cumulative Gain (NDCG) • Measures usefulness, or gain, of document based on its position in the result list. • The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks. DCGp = 2reli −1 pΣ log2 (i +1) i=1 NDCGp = DCGp IDCGp reli - graded relevance of the result at position i, DCGp - discounted cumulative gain for p positions. From http://en.wikipedia.org/wiki/Discounted_cumulative_gain 24
  • 25. Pairwise evaluation with Swiss system (experimental) • Judgement of document pairs, • «Which document is more relevant to the query X?» • answers are: Left, right, equal. • Chosen document is getting one point, in case of «equal», both are getting by one point. • Pairs preparation using Swiss tournament system: • First pass. All documents are ordered randomly or by default ranking. Then take first document from first half, and first from second (1-st with 5-th, 2-nd with 6-th, and so on) to get pair. • In the next pass, only winners of previous pass are judged. The same way, taking documents from first and second halfs starting from top to create pairs for judgement. 25
  • 26. Which document is more relevant to the query [berlin buzzwords] ? 26
  • 27. Pairwise evaluation with Swiss system • About 19 judgements is needed for 10 documents retrieved for 1 query. • After judgement is finished, the ranking is built by gathered points. • According to position the weights are assigned to the documents. • Using weights, the machine-learned model can be trained. 27
  • 28. Pairwise evaluation with Swiss system: weights assignment • For example, we can use exponential weight decrement: W = P*EXP (1/pos) ! 1. 8,13 (3) 2. 1,64 (1) 3. 1,39 (1) 4. 0 (0) 5. 0 (0) 28 9 6,75 4,5 2,25 0 1 2 3 4 5
  • 29. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 29
  • 30. Signals is the key: agenda • Production system: what data is available? • Text relevance: approaches, no silver bullet. • Social signals. • How to mix signals: manual linear model, gradient boosted decision trees. 30
  • 31. Production system: what data is available? • Documents: • CTR of the document, • absolute number of clicks, • count of times, when document was clicked first in SERP, • the same, but last • count of clicks on the same SERP before/after the document was clicked. • Displays (shows): • Count of times document was displayed on SERP, • count of unique queries, where document was displayed, • document position: max, min, average, median, etc. 31
  • 32. Production system: what data is available? • Queries: • Absolute click count on query, • Abandonment rate, • CTR of the query, • Time spent on SERP, • Time spent till first/last click, • Query frequency, • Count of words in query, • IDF of words of query: min/max/average/median, etc., • Count of query reformulations: min/max/average/median. • CTR of reformulations. 32
  • 33. Text relevance: use cases • Phrase search, • search of named entities (cities, names, etc.) • search of codes, articles, telephone numbers, • search of questions, • search of set expressions (e.g. «to get cold») • … 33
  • 34. Text relevance: signals • BM25F zoned version: meta-description, meta-keywords, title, body of the document, • calculate BM25 on query expansions: word forms, thesaurus based, abbreviations, translit, fragments, • min/max/average/median of count of subsequent query words found in the document, • the same, but query order, • the same, but with distance +/- 1,2,3 words, • min/max of IDF of query words found, • to build language model of document and use it for ranking, • language model of queries, of different words count, use probabilities as a signals. 34
  • 35. Text relevance: example model ScoreTR = a * BM25 + b * BM25FTitle + c * BM25FDescr + MAX(SubseqQWords)^d, ! a, b, c, d - can be estimated manually, or using stochastic gradient descent. 35
  • 36. Social signals • Count of readers/commenters of content, • count of comments published during some time period (velocity), • time since last comment, • speed of likes growth, • time since last like, • absolute count of likes, • etc. 36
  • 37. How to mix signals: learning-to-rank Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. ! ! ! From Wikipedia, M. Mohri, et al. Foundations of Machine Learning, The MIT Press, 2012 37
  • 38. How to mix signals: full-scale process • The training set preparation: • Documents, • Queries, • Relevance judgements. • Framework: • Querying of search and dump of feature vectors (incl. assigning relevance judgements), • learning model, • evaluation of model, • adoption of model in production system, • repeat after some time. 38
  • 39. How to mix signals: DIY way • Choose manually some set of features, which you think are good predictors, • create a simple linear model from these predictors, • fit coefficients manually by selecting few (10) representative queries. ScoreTR = a * BM25 + MAX(SubseqQWords)^b + c * CTR + d * Likes + e * QLength; ! a, b, c, d, e - needs to be fit. 39
  • 40. How to mix signals: more work • Get some relevance judgements: • pairwise evaluation, • classic Cranfield way, • using some good signal, sacrificing it * • Learn a more complex model: Ranking-SVM, or Gradient Boosted Decision Trees (GBDT). * - make sure there are no big correlations with other signals. 40
  • 41. Decision tree F5 > 0.5 F11 > 0.21 F7 > 0.001 F2 > 0.72 0.7 0.27 0.9 0.1 0.3 41
  • 42. Gradient boosted decision trees + + + + … S = ⍺D1 ⍺D2 ⍺D3 ⍺D4 ⍺DN ⍺ - step, Di - result of each weak predictor (tree), N - count of weak predictors ! Each weak predictor is learned on subsample from the whole training set. 42
  • 43. Yahoo! Learning to rank challenge, 2011
  • 44. Agenda • What is search quality? • Examples of search quality problems. • Evaluating search quality. Methods. • Signals is the key. • Producing good snippets. 44
  • 45. Producing good snippets: text summarization The problem is to generate a summary from original document taking into account 1. Query words, 2. length, [mardi gras fat tuesday] 3. style. 45
  • 46. Producing good snippets: types 1. Static - generated once, their content will not change when query changes, may not have query words at all. 2. Dynamic - generated individually for each query, usually contain query words. Almost all modern search systems use dynamic generation of snippets or combination. 46
  • 47. Producing good snippets: algorithm 1. Generate presentation of the document as a set of paragraphs, sentences and words. 2. Generate candidates for snippet for given query. 3. For each candidate generate signals and rank candidates with machine learned model. 4. Selection of most suitable candidate(s) fitting requirements. 47
  • 48. Producing good snippets: example signals • Length of candidate text, • amount of query words in candidate text, • BM25, • IDF of query words in candidate text, • is there beginning/ending of sentence ? • conformity of query words order, • conformity of word forms between query and text, • etc. 48