Advances in Information Retrieval
Advances in Information Retrieval
Advances in
Information Retrieval
38th European Conference on IR Research, ECIR 2016
Padua, Italy, March 20–23, 2016
Proceedings
123
Lecture Notes in Computer Science 9626
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7409
Nicola Ferro Fabio Crestani
•
Advances in
Information Retrieval
38th European Conference on IR Research, ECIR 2016
Padua, Italy, March 20–23, 2016
Proceedings
123
Editors
Nicola Ferro Fabrizio Silvestri
Department of Information Engineering Yahoo! Labs London
University of Padua London
Padova UK
Italy
Giorgio Maria Di Nunzio
Fabio Crestani Department of Information Engineering
Faculty of Informatics University of Padua
University of Lugano (USI) Padova
Lugano Italy
Switzerland
Claudia Hauff
Marie-Francine Moens TU Delft - EWI/ST/WIS
Department of Computer Science Delft
Katholieke Universiteit Leuven The Netherlands
Heverlee
Belgium Gianmaria Silvello
Department of Information Engineering
Josiane Mothe University of Padua
Systèmes d’informations, Big Data Padova
et Recherche d’Information Italy
Institut de Recherche en Informatique
de Toulouse IRIT/équipe SIG
Toulouse Cedex 04
France
LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI
These proceedings contain the full papers, short papers, and demonstrations selected
for presentation at the 38th European Conference on Information Retrieval (ECIR
2016). The event was organized by the Information Management Systems (IMS)
research group1 of the Department of Information Engineering2 of the University of
Padua3, Italy. The conference was held during March 20–23 2016, in Padua, Italy.
ECIR 2016 received a total of 284 submissions in three categories: 201 full papers
out of which seven papers in the reproducibility track, 66 short papers, and 17
demonstrations.
The geographical distribution of the submissions was as follows: 51 % were from
Europe, 21 % from Asia, 19 % from North and South America, 7 % from North Africa
and the Middle East, and 2 % from Australasia.
All submissions were reviewed by at least three members of an international two-tier
Program Committee. Of the full papers submitted to the conference, 42 were accepted
for oral presentation (22 % of the submitted ones) and eight as posters (4 % of the
submitted ones). Of the short papers submitted to the conference, 20 were accepted for
poster presentation (30 % of the submitted ones). In addition, six demonstrations (35 %
of the submitted ones) were accepted. The accepted contributions represent the state
of the art in information retrieval, cover a diverse range of topics, propose novel
applications, and indicate promising directions for future research.
We thank all Program Committee members for their time and effort in ensuring the
high quality of the ECIR 2016 program.
ECIR 2016 continued the reproducibility track introduced at ECIR 2015, which
specifically invited the submission of papers reproducing a single paper or a group of
papers from a third party, where the authors were not directly involved in the original
paper. Authors were requested to emphasize the motivation for selecting the papers to
be reproduced, the process of how results were attempted to be reproduced (success-
fully or not), the communication that was necessary to gather all information, the
potential difficulties encountered, and the result of the process. Of the seven papers
submitted to this track, four were accepted (57 % of the submitted ones).
A panel on “Data-Driven Information Retrieval” was organized at ECIR by
Maristella Agosti. The panel stems from the fact that information retrieval has always
been concerned with finding the “needle in a haystack” to retrieve the most relevant
information from huge amounts of data, able to best address user information needs.
Nevertheless, nowadays we are facing a radical paradigm shift, common also to many
other research fields, and information retrieval is becoming an increasingly data-driven
science due, for example, to recent developments in machine learning, crowdsourcing,
1
http://ims.dei.unipd.it/
2
http://www.dei.unipd.it/
3
http://www.unipd.it/
VI Preface
user interaction analysis, and so on. The goal of the panel is to discuss the emergent
trends in this area, their advantages, their pitfalls, and their implications for the future
of the field.
Additionally, ECIR 2016 hosted four tutorials and four workshops covering a range
of information retrieval topics. These were selected by workshop and tutorial
committees.
The workshops were:
– Third International Workshop on Bibliometric-Enhanced Information Retrieval
(BIR2016)
– First International Workshop on Modeling, Learning and Mining for Cross/
Multilinguality (MultiLingMine 2016)
– ProActive Information Retrieval: Anticipating Users’ Information Needs (ProAct IR)
– First International Workshop on Recent Trends in News Information Retrieval
(NewsIR 2016)
The following ECIR 2016 tutorials were selected:
– Collaborative Information Retrieval: Concepts, Models and Evaluation
– Group Recommender Systems: State of the Art, Emerging Aspects and Techniques,
and Research Challenges
– Living Labs for Online Evaluation: From Theory to Practice (LiLa2016)
– Real-Time Bidding Based Display Advertising: Mechanisms and Algorithms
(RTBMA 2016)
Short descriptions of these workshops and tutorials are included in the proceedings.
We would like to thank our invited speakers for their contributions to the program:
Jordan Boyd-Graber (University of Colorado, USA), Emine Yilmaz (University
College London, UK), and Domonkos Tikk (Gravity R&D, Hungary). Short descrip-
tions of these talks are included in the proceedings.
We are grateful to the panel led by Stefan Rüger for selecting the recipients of the
2015 Microsoft BCS/BCS IRSG Karen Spärck Jones Award, and we congratulate
Jordan Boyd-Graber and Emine Yilmaz on receiving this award (unique to 2015, the
panel decided to make two full awards).
Considering the long history of ECIR, which is now at it 38th edition, ECIR 2016
introduced a new award, the Test of Time (ToT) Award, to recognize research that has
had long-lasting influence, including impact on a subarea of information retrieval
research, across subareas of information retrieval research, and outside of the infor-
mation retrieval research community (e.g., non-information retrieval research or
industry).
On the final day of the conference, the Industry Day ran in parallel with the con-
ference session with the goal of bringing an exciting program containing a mix of
invited talks by industry leaders with presentations of novel and innovative ideas from
the search industry. A short description of the Industry Day is included in these
proceedings.
ECIR 2016 was held under the patronage of: Regione del Veneto (Veneto Region),
Comune di Padova (Municipality of Padua), University of Padua, Department of
Informantion Engineering, and Department of Mathematics.
Preface VII
Finally, ECIR 2016 would have not been possible without the generous financial
support from our sponsors: Google (gold level); Elsevier, Spotify, and Yahoo! Labs
(palladium level); Springer (silver level); and Yandex (bronze level). The conference
was supported by the ELIAS Research Network Program of the European Science
Foundation, University of Padua, Department of Informantion Engineering and
Department of Mathematics.
General Chair
Nicola Ferro University of Padua, Italy
Program Chairs
Fabio Crestani University of Lugano (USI), Switzerland
Marie-Francine Moens KU Leuven, Belgium
Workshop Chairs
Paul Clough University of Sheffield, UK
Gabriella Pasi University of Milano Bicocca, Italy
Tutorial Chairs
Christina Lioma University of Copenhagen, Denmark
Stefano Mizzaro University of Udine, Italy
Demo Chairs
Giorgio Maria Di Nunzio University of Padua, Italy
Claudia Hauff TU Delft, The Netherlands
Sponsorship Chair
Emanuele Di Buccio University of Padua, Italy
Program Committee
Full-Paper Meta-Reviewers
Giambattista Amati Fondazione Ugo Bordoni, Italy
Leif Azzopardi University of Glasgow, UK
Roberto Basili University of Rome Tor Vergata, Italy
Mohand Boughanem IRIT, Paul Sabatier University, France
Paul Clough University of Sheffield, UK
Bruce Croft University of Massachusetts Amherst, USA
Arjen de Vries Radboud University, The Netherlands
Norbert Fuhr University of Duisburg-Essen, Germany
Eric Gaussier Université Joseph Fourier, France
Cathal Gurrin Dublin City University, Ireland
Gareth Jones Dublin City University, Ireland
Organization XI
Additional Reviewers
Aggarwal, Nitish Hasibi, Faegheh
Agun, Daniel Herrera, Jose
Balaneshin-Kordan, Saeid Hung, Hui-Ju
Basile, Pierpaolo Jin, Xin
Biancalana, Claudio Kaliciak, Leszek
Boididou, Christina Kamateri, Eleni
Bordea, Georgeta Kotzyba, Michael
Caputo, Annalina Lin, Yu-San
Chen, Yi-Ling Lipani, Aldo
de Gemmis, Marco Loni, Babak
Fafalios, Pavlos Low, Thomas
Farnadi, Golnoosh Ludwig, Philipp
Freund, Luanne Luo, Rui
Fu, Tao-Yang Mota, Pedro
Gialampoukidis, Ilias Narducci, Fedelucio
Gossen, Tatiana Nikolaev, Fedor
Grachev, Artem Onal, K. Dilek
Grossman, David Palomino, Marco
Organization XVII
Student Mentors
Paavo Arvola University of Tampere, Finland
Rafael E. Banchs I2R Singapore
Rafael Berlanga Llavori Universitat Jaume I, Spain
Pia Borlund University of Copenhagen, Denmark
Davide Buscaldi Université Paris XIII, France
Fidel Cacheda University of A Coruña, Spain
Marta Costa-Jussà Instituto Politécnico Nacional México, Mexico
Walter Daelemans University of Antwerp, The Netherlands
Kareem M. Darwish Qatar Computing Research Institute, Qatar
Maarten de Rijke University of Amsterdam, The Netherlands
Marcelo Luis Errecalde Universidad Nacional de San Luís, Argentina
Julio Gonzalo UNED, Spain
Hugo Jair Escalante INAOE Puebla, Mexico
Jaap Kamps University of Amsterdam, The Netherlands
Heikki Keskustalo University of Tampere, Finaland
Greg Kondrak University of Alberta, Canada
Zornitsa Kozareva Yahoo! Labs, USA
Mandar Mitra Indian Statistical Institute, India
Manuel Montes y Gómez INAOE Puebla, Mexico
Alessandro Moschitti Qatar Computing Research Institute, Qatar
Preslav Nakov Qatar Computing Research Institute, Qatar
Doug Oard University of Maryland, USA
Iadh Ounis University of Glasgow, UK
Karen Pinel-Sauvagnat IRIT, Université de Toulouse, France
Ian Ruthven University of Strathclyde, UK
Grigori Sidorov Instituto Politécnico Nacional México, Mexico
Thamar Solorio University of Houston, USA
Elaine Toms University of Sheffield, UK
Christa Womser-Hacker University of Hildesheim, Germany
XVIII Organization
Patronage
Platinum Sponsors
Organization XIX
Gold Sponsors
Palladium Sponsors
Silver Sponsors
Bronze Sponsors
Keynote Talks
Machine Learning Shouldn’t be a Black Box
Jordan Boyd-Graber
marshaling armies for world conquest. Alliances are fluid: friends are betrayed and
enemies embraced as the game develops. However, users’ conversations let us predict
when friendships break: betrayers writing ostensibly friendly messages before a
betrayal become more polite, stop talking about the future, and change how much they
write [13]. Diplomacy may be a nerdy game, but it is a fruitful testbed to teach
computers to understand messy, emotional human interactions.
A game with higher stakes is politics. However, just like Diplomacy, the words that
people use reveal their underlying goals; computational methods can help expose the
“moves” political players can use. With collaborators in political science, we’ve built
models that: show when politicians in debates strategically change the topic to influ-
ence others [9, 11]; frame topics to reflect political leanings [10]; use subtle linguistic
phrasing to express their political leaning [7]; or create political subgroups with larger
political movements [12].
Conversely, games also teach humans how computers think. Our trivia-playing
robot [1, 6, 8] faced off against four former Jeopardy champions in front of 600 high
school students.1 The computer claimed an early lead, but we foolishly projected the
computer’s thought process for all to see. The humans learned to read the algorithm’s
ranked dot products and schemed to answer just before the computer. In five years of
teaching machine learning, I’ve never had students catch on so quickly to how linear
classifiers work. The probing questions from high school students in the audience
showed they caught on too. (Later, when we played again against Ken Jennings,2 he sat
in front of the dot products and our system did much better.)
Advancing machine learning requires closer, more natural interactions. However,
we still require much of the user—reading distributions or dot products—rather than
natural language interactions. Document exploration tools should describe in words
what a cluster is, not just provide inscrutable word clouds; deception detection systems
should say why a betrayal is imminent; and question answers should explain how it
knows Aaron Burr shot Alexander Hamilton. My work will complement machine
learning’s ubiquity with transparent, empathetic, and useful interactions with users.
Bibliography
1. Boyd-Graber, J., Satinoff, B., He, H., Daumé III, H.: Besting the quiz master:
crowdsourcing incremental classification games. In: Empirical Methods in Natural
Language Processing (2012). http://www.cs.colorado.edu/*jbg/docs/qb_emnlp_
2012.pdf
2. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves:
how humans interpret topic models. In: Proceedings of Advances in Neural
Information Processing Systems (2009). http://www.cs.colorado.edu/*jbg/docs/
nips2009-rtl.pdf
1
https://www.youtube.com/watch?v=LqsUaprYMOw
2
https://www.youtube.com/watch?v=kTXJCEvCDYk
Machine Learning Shouldn’t be a Black Box XXV
3. Grissom II, A., He, H., Boyd-Graber, J., Morgan, J., Daumé III, H.: Don’t until the
final verb wait: reinforcement learning for simultaneous machine translation. In:
Proceedings of Empirical Methods in Natural Language Processing (2014). http://
www.cs.colorado.edu/*jbg/docs/2014_emnlp_simtrans.pdf
4. He, H., Grissom II, A., Boyd-Graber, J., Daumé III, H.: Syntax-based rewriting for
simultaneous machine translation. In: Empirical Methods in Natural Language
Processing (2015). http://www.cs.colorado.edu/*jbg/docs/2015_emnlp_rewrite.
pdf
5. Hu, Y., Boyd-Graber, J., Satinoff, B., Smith, A.: Interactive topic modeling. Mach.
Learn. 95(3), 423–469 (2014). http://dx.doi.org/10.1007/s10994-013-5413-0
6. Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Daumé III, H.: A neural
network for factoid question answering over paragraphs. In: Proceedings of
Empirical Methods in Natural Language Processing (2014). http://www.cs.
colorado.edu/*jbg/docs/2014_emnlp_qb_rnn.pdf
7. Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using
recursive neural networks. In: Proceedings of the Association for Computational
Linguistics (2014). http://www.cs.colorado.edu/*jbg/docs/ 2014_acl_rnn_
ideology.pdf
8. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered
composition rivals syntactic methods for text classification. In: Association for
Computational Linguistics (2015). http://www.cs.colorado.edu/*jbg/docs/2015_
acl_dan.pdf
9. Nguyen, V.A., Boyd-Graber, J., Resnik, P.: SITS: A hierarchical non-parametric
model using speaker identity for topic segmentation in multiparty conversations.
In: Proceedings of the Association for Computational Linguistics (2012). http://
www.cs.colorado.edu/*jbg/docs/acl_2012_sitspdf
10. Nguyen, V.A., Boyd-Graber, J., Resnik, P.: Lexical and hierarchical topic
regression. In: Proceedings of Advances in Neural Information Processing Sys-
tems (2013). http://www.cs.colorado.edu/*jbg/docs/2013_shlda.pdf
11. Nguyen, V.A., Boyd-Graber, J., Resnik, P., Cai, D., Midberry, J., Wang, Y.:
Modeling topic control to detect influence in conversations using nonparametric
topic models. Mach. Learn. 95, 381–421 (2014). http://www.cs.colorado.edu/
*jbg/docs/mlj_2013_influencer.pdf
12. Nguyen, V.A., Boyd-Graber, J., Resnik, P., Miler, K.: Tea party in the house: a
hierarchical ideal point topic model and its application to Republican legislators in
the 112th Congress. In: Association for Computational Linguistics (2015). http://
www.cs.colorado.edu/*jbg/docs/2015_acl_teaparty.pdf
13. Niculae, V., Kumar, S., Boyd-Graber, J., Danescu-Niculescu-Mizil, C.: Linguistic
harbingers of betrayal: a case study on an online strategy game. In: Association for
Computational Linguistics (2015). http://www.cs.colorado.edu/*jbg/docs/2015_
acl_diplomacy.pdf
14. Talley, E.M., Newman, D., Mimno, D., Herr, B.W., Wallach, H.M., Burns, G.A.P.C.,
Leenders, A.G.M., McCallum, A.: Database of NIH grants using machine-learned
categories and graphical clustering. Nat. Methods 8(6), 443–444 (2011)
A Task-Based Perspective to Information
Retrieval
Emine Yilmaz
The need for search often arises from a persons need to achieve a goal, or a task such as
booking travels, organizing a wedding, buying a house, etc. [1]. Contemporary search
engines focus on retrieving documents relevant to the query submitted as opposed to
understanding and supporting the underlying information needs (or tasks) that have led
a person to submit the query. Therefore, search engine users often have to submit
multiple queries to the current search engines to achieve a single information need [2].
For example, booking travels to a location such as London would require the user to
submit various different queries such as flights to London, hotels in London, points of
interest around London as all of these queries are related to possible subtasks the user
might have to perform in order to arrange their travels.
Ideally, an information retrieval (IR) system should be able to understand the
reason that caused the user to submit a query and it should help the user achieve the
actual task by guiding her through the steps (or subtasks) that need to be completed.
Even though designing such systems that can characterize/identify tasks, and can
respond to them efficiently is listed as one of the grant challenges in IR [1], very little
progress has been made in this direction [3].
Having identified that users often have to reformulate their queries in order to
achieve their final goal, most current search engines attempt to assist users towards a
better expression of their needs by suggesting queries to them, other than the currently
issued query. However, query suggestions mainly focus on helping the user refine their
current query, as opposed to helping them identify and explore aspects related to their
current complex tasks. For example, when a user issues the query “flights to Barce-
lona”, it is clear that the user is planning to travel to Barcelona and it is very likely that
the user will also need to search for hotels in Barcelona or for shuttles from Barcelona
airport. Since query suggestions mainly focuses on refining the current query, sug-
gestions provided commonly used search engines are mostly of the form “flights to
Barcelona from <LOCATION>”, or “<FLIGHT CARRIER NAME> flights to Bar-
celona” and the result pages provided by these systems do not contain any information
that could help users book hotels or shuttles from the airport.
For very common tasks such as arranging travels, it may be possible to manually
identify and guide the user through a list of (sub)tasks that needs to be achieved to
achieve a certain task (booking a flight, finding a hotel, looking for points of interests,
etc. when the user trying to arrange her travels). However, given the variety of tasks
search engines are used for, this would only be possible for a very small subset of them.
Furthermore, quite often search engines are used to achieve such complex tasks that
A Task-Based Perspective to Information Retrieval XXVII
often the searcher herself lacks the task knowledge necessary to decide which step to
tackle next [2]. For example, a searcher looking for information about how to maintain
a car with no prior knowledge would first need to use the search engine to identify the
parts of the car that need maintenance and issue separate queries to learn about
maintaining each part. Hence, retrieval systems that can automatically detect the task
the user trying to achieve and guide her through the process are needed, where a search
task has been previously defined as an atomic information need that consists of a set of
related (sub)tasks [2].
With the introduction of new types of devices in our everyday lives, search systems
are now being used via very different types of devices. The types of devices search
systems are used over are becoming increasingly smaller (e.g. mobile phones, smart
watches, smart glasses etc.), which limit the types of interactions users may have with
the systems. Searching over devices with such small interfaces is not easy as it requires
more effort to type and interact with the system. Hence, building IR systems that can
reduce the interactions needed with the device is highly critical for such devices.
Therefore, task based information retrieval systems will be even more valuable for such
small interfaces, which are increasingly being introduced/used.
Devising task based information retrieval systems have several challenges that have
to be tackled. In this talk, I will start with describing the problems that need to be
solved when designing such systems, comparing and contrasting them we the tradi-
tional way in building IR systems. In particular, devising such task based systems
would involve tackling several challenges, such as (1) devising methodologies for
accurately extracting and representing tasks, (2) building and designing new interfaces
for task based IR systems, (3) devising methodologies for evaluating the quality of task
based IR systems, and (4) task based personalization of IR systems. I will talk about the
initial attempts made in tackling in these challenges, as well as the initial method-
ologies we have built in order to tackle each of these challenges.
References
1. Belkin, N.: Some(what) grand challenges for IR. ACM SIGIR Forum 42(1), 47–54 (2008)
2. Jones, R., Klinkner, K.L.: Beyond the session timeout: automatic hierarchical segmentation of
search topics in query logs. In: Proceedings of ACM CIKM 2008 Conference on Information
and Knowledge Management, pp. 699–708 (2008)
3. Kelly,D., Arguello, J., Capra, R.: NSF workshop on task-based information searchsystems. In:
ACM SIGIR Forum, vol. 47, no. 2, December 2013
Lessons Learnt at Building Recommendation
Services in Industry Scale
Domonkos Tikk
Gravity R&D experienced many challenges while scaling up their services. The sheer
quantity of data handled on a daily basis increased exponentially. This presentation will
cover how overcoming these challenges permanently shaped our algorithms and system
architecture used to generate these recommendations. Serving personalized recom-
mendations requires real-time computation and data access for every single request. To
generate responses in real-time, current user inputs have to be compared against their
history in order to deliver accurate recommendations.
We then combine this user information with specific details about available items as
the next step in the recommendation process. It becomes more difficult to provide
accurate recommendations as the number of transactions and items increase. It also
becomes difficult because this type of analysis requires the combination of multiple
complex algorithms that all may require heterogeneous inputs.
Initially, the architecture was designed for matrix factorization based models [4]
and serving huge numbers of requests but with a limited number of items. Now,
Gravity is using MF, neighborhood based models [5], context-aware recommenders
[2, 3] and metadata based models to generate recommendations for millions of items
within their databases, and now Gravity is experimenting with applying deep learning
technology for recommendations [1]. This required a shift from a monolithic archi-
tecture with in-process caching to a more service oriented architecture with multi-layer
caching. As a result of an increase in the number of components and number of clients,
managing the infrastructure can be quite difficult.
Lessons Learnt at Building Recommendation Services in Industry Scale XXIX
Even with these challenges, we do not believe that it is worthwhile to use a fully
distributed system. It adds unneeded complexity, resources, and overhead to the sys-
tem. We prefer an approach of firstly optimizing current algorithms and architecture
and only moving to a distributed system when no other options are left.
References
1. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with
recurrent neural networks. CoRR (Arxiv) abs/1511.06939 (2015). http://arxiv.org/abs/1511.
06939
2. Hidasi, B., Tikk, D.: Fast ALS-based tensor factorization for context-aware recommendation
from implicit feedback. In: Flach, P., et al. (eds.) ECML PKDD 2012. LNCS vol. 7524,
pp. 67–82. Springer, Berlin (2012)
3. Hidasi, B., Tikk, D.: General factorization framework for context-aware recommendations.
Data Mining and Knowledge Discovery, pp. 1–30 (2015). http://dx.doi.org/10.1007/s10618-
015-0417-y
4. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Scalable collaborative filtering approaches for
large recommender systems. J. Mach. Learn. Res. 10, 623–656 (2009)
5. Takács, G., Pilászy, I., Németh, B., Tikk, D.: Matrix factorization and neighbor based
algorithms for the Netflix Prize problem. In: 2nd ACM Conference on Recommendation
Systems, pp. 267–274. Lausanne, Switzerland, 21–24 October 2008
Contents
Machine Learning
Question Answering
Ranking
Evaluation Methodology
Probabilistic Modelling
Evaluation Issues
Multimedia
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces
for Geospatial Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Markus Schedl and Fang Zhou
Summarization
Reproducibility
Retrieval Models
Applications
Collaborative Filtering
Short Papers
Two Scrolls or One Click: A Cost Model for Browsing Search Results . . . . . 696
Leif Azzopardi and Guido Zuccon
Demos
Industry Day
Workshops
Tutorials
1 Introduction
Thanks to the growth of social networks e.g., Twitter1 , users can freely express
their opinions on many topics in the form of tweets - short messages, maximum
140 letters. For example, after reading a web document which mentions a special
event, e.g., Boston bombing, readers can write their tweets about the event on
their timeline. These tweets, called social information [18] not only reveal reader’s
opinions but reflect the content of the document and describe facts about the
event. From this observation, an interesting idea is that social information can
be utilized as mutual reinforcement for web document summarization.
1
http://twitter.com - a microblogging system.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 3–14, 2016.
DOI: 10.1007/978-3-319-30671-1 1
4 M.-T. Nguyen and M.-L. Nguyen
2 Summarization by Ranking
This section shows our proposal of social context summarization by ranking in
three steps: basic idea, feature selection and summarization.
Table 1. The features; italic in the second column is the distance features; S is a
sentence, T is a tweet; LCS is the longest common sub string
2.3 Summarization
The goal of our approach is to select important sentences and representative tweets
as summaries because they provide more information regarding the content of a
document rather than only providing sentences. In our method, tweets are utilized
to enrich the summary when calculating the score of a sentence, and sentences are
also considered as mutual reinforcement information in computing the score of a
tweet. More precisely, the weight of each instance is computed by the entailment
relation using the features and the top K instances, which have the highest score
will be selected as the summaries.
3.25 1.46
Relation
s1 t1
4.23 0.72
Sentence Tweet
Generation
.... .... Collection
Twitter
Document 1.78 1.42
sn tm
Document wing Social wing
Summarization by Ranking
Fig. 1. The overview of summarization using DWEG; si and tj denote a sentence and
a tweet in document and social wing; red lines are inter-relation and blue lines are
intra-relation; weight of each node (e.g., 3.25 at s1 ) is the entailment value.
SoRTESum: A Social Context Framework 7
1
m
score(si ) = rteScore(si , tj ) (1)
m j=1
1
F
rteScore(si , tj ) = fk (si , tj ) (2)
F
k=1
RTE-Sum Dual Wing: In this method, a sentence RTE score was calculated
by using remaining sentences as the main part (intra-relation) following tweets as
auxiliary information (inter-relation) in an accumulative mechanism. For exam-
ple, the score of si was calculated by s1 to sn ; at the same time, the score was
also computed by relevant tweets t1 to tm . Finally, RTE value of a sentence was
average of all entailment values. The calculation is shown in Eq. (3).
n
m
score(si ) = δ ∗ rteScore(si , sk ) + (1 − δ) ∗ rteScore(si , tj ) (3)
k=1 j=1
RTE value of a tweet was also computed as the same mechanism in Eq. (4).
m
n
score(tj ) = δ ∗ rteScore(tj , tk ) + (1 − δ) ∗ rteScore(tj , si ) (4)
k=1 i=1
where δ is the damping factor; n and m are the number of sentences and tweets.
3.2 Baselines
The following systems were used to compare to SoRTESum:
– Random Method: selects sentences and comments randomly.
– SentenceLead: chooses the first x sentences as the summarization [11].
– LexRank: summarizes a given news article using LexRank algorithm7 [2].
– L2R: uses RankBoost with local and cross features [16], using RankLib8 .
– Interwing-sent2vec: uses Cosine similarity; by Eq. (1). A sentence to vector
tool was utilized to generated vectors9 (size = 100 and window = 5) with 10
million sentences from Wikipedia10 .
– Dualwing-sent2vec: uses Cosine similarity; by Eq. (3) and (4).
– RTE One Wing: uses one wing (document/tweet) to calculate RTE score.
ROUGE-1 ROUGE-2
System Avg-P Avg-R Avg-F Avg-P Avg-R Avg-F
Random 0.140 0.205 0.167 0.031 0.044 0.037
Sentence Lead 0.196 0.341 0.249 0.075 0.136 0.096
LexRank 0.127 0.333 0.183 0.030 0.088 0.045
Interwing-sent2vec 0.208 0.315 0.250 0.069 0.116 0.086
Dualwing-sent2vec 0.148 0.194 0.168 0.044 0.058 0.050
RTE-Sum one wing 0.137 0.385 0.202 0.048 0.143 0.072
L2R* [16] 0.202 0.320 0.248 0.067 0.120 0.086
CrossL2R* [16] 0.215 0.366 0.270 0.086 0.158 0.111
SoRTESum inter wing 0.189 0.389 0.255 0.071 0.158 0.098
SoRTESum dual wing 0.186 0.400 0.254 0.068 0.162 0.096
ROUGE-1 ROUGE-2
System Avg-P Avg-R Avg-F Avg-P Avg-R Avg-F
Random 0.138 0.179 0.156 0.049 0.072 0.059
LexRank 0.100 0.336 0.154 0.035 0.131 0.056
Interwing-sent2vec 0.177 0.222 0.197 0.055 0.071 0.062
Dualwing-sent2vec 0.153 0.195 0.171 0.039 0.055 0.046
RTE-Sum one wing 0.145 0.277 0.191 0.054 0.089 0.067
L2R* [16] 0.155 0.276 0.199 0.049 0.089 0.064
CrossL2R* [16] 0.165 0.287 0.209 0.053 0.099 0.069
SoRTESum inter wing 0.154 0.289 0.201 0.051 0.104 0.068
SoRTESum dual wing 0.161 0.296 0.209 0.056 0.111 0.074
In addition, the performance in the document side is better than that on the
tweet side because comments are usually generated from document content (sim-
ilarly [18]) supporting the Reflection hypothesis stated in Sect. 1.
SoRTESum outperforms L2R [16] in both ROUGE-1, 2, on document and
tweet side even though it is a supervised method. This shows the efficiency of our
approach as well as the features. In other words, our method performs compara-
bly CrossL2R [16] in both ROUGE-1, 2. This is because (1) CrossL2R is also a
supervised method; and (2) a salient score of an instance in [16] was computed
by maximal ROUGE-1 between this instance and corresponding ground-truth
highlight sentences. As the results, this model tends to select highly similar sen-
tences and tweets with the highlights improving the overall performance of the
10 M.-T. Nguyen and M.-L. Nguyen
model. However, even with this, our models still obtain comparable result of
0.255 vs. 0.270 in document side and the same result of 0.209 on tweet side of
ROUGE-1. In ROUGE-2, although CrossL2R slightly dominates SoRTESum on
document side (0.111 vs. 0.098), on tweet side, conversely, SoRTESum outper-
forms 0.05 % (0.074 vs. 0.069). This shows that out approach is also appropriate
for tweet summarization and supports our hypothesis stated in Sect. 1.
We discuss some important different points with [17] (using same method -
ranking and the dataset) due to experimental settings and re-running the exper-
iments. Firstly, [17] uses IDF-modified-cosine similarity, thus the noise of tweets
may badly affect to the summarization (same conclusion with [2]). The perfor-
mance of CrossL2R-T and HGRW-T supports this conclusion (decreasing from
0.295 to 0.293, see [17]). On the other hand, our method combines a set of RTE
features helping to avoid the tweet noise; hence, the performance increases from
0.201 to 0.209. In addition, the IDF-modified-cosine similarity needs a large cor-
pus to calculate TF and IDF with a bag of words [2] whereas our approach only
requires a single document and its social information to extract important sen-
tences. This shows that our method is insensitive to the number of documents
as well as tweets. In addition, new features e.g., word2vec similarity can be easy
integrated into our model while adding new features into the IDF-modified-cosine
similarity is still an open question. Finally, their work considers the impact of
tweets volume and latency for sentence extraction. It is difficult to take these val-
ues for news comments as well as forum comments. In this sense, our method can
be flexibly to adapt for other domains. Of course, tweets did not come from the
news sources challenge both the two methods because there is no content con-
sistency between sentences and tweets. However, we guess that even in this case,
our method may be still effective because SoRTESum captures words/tokens
overlapping based on a set of features while only using IDF-modified-cosine sim-
ilarity may limit HGRW. In another words, both the two methods are inefficient
in dealing informal tweets e.g., very short, abbreviated, or ungrammatical tweets.
It is possibly solved by integrating a sophisticated preprocessing step.
The performance of SoRTESum inter wing is the same with SoRTESum dual
wing on document side of ROUGE-1 (0.255 vs. 0.254); on tweet side, however,
SoRTESum dual wing dominates SoRTESum inter wing (0.209 vs. 0.201). This
is because a score of a tweet in SoRTESum dual wing was calculated by accumu-
lating corresponding sentences and remaining tweets; therefore, long instances
obtain higher scores. As a result, the model tends to select longer instances.
However, the performance is not much different.
SoRTESum achieves a slight improvement of 0.51 % in comparison to Sen-
tence Lead [11] because the highlights were generated by the same mechanism
of Sentence Lead, taking some first sentences and changing some keywords. We
guess that the results will change when SoRTESum is evaluated on other datasets
where the highlights are selected from the original document instead of abstract
generation.
SoRTESum one wing obtains acceptable (outperforms Random and
LexRank) showing the efficiency of our features. Interwing-sent2vec yields com-
parable results of ROUGE-1 on both sides indicating vector representation can
SoRTESum: A Social Context Framework 11
The role of two feature groups The role of two feature groups
0.07 0.008
ROUGE-1 0.007 ROUGE-1
0.06
ROUGE-2 0.006 ROUGE-2
0.05 0.005
0.004
Value
Value
0.04
0.003
0.03 0.002
0.02 0.001
0
0.01
-0.001
0 -0.002
d-feature s-feature d-feature s-feature
RTE-Sum inter wing RTE-Sum inter wing
0.4 0.4
F-score
F-score
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
d value d value
In Table 5 (the web interface can be seen at SoRTESum system12 ), both models
yield the same results in document summarization, in which S1, S2, and S3 are
important sentences. Clearly, the content of these sentences completely relate
to the highlights, which mention about the death of Tamerlan Tsarnaev at the
Boston bombing event or attending information in his college. In contrast, S4
mentioning his father information has light relevance.
In tweet summarization, the two methods generate the same three tweets and
the remaining one is different. The summarization contains one the same tweet
(T1 in SoRTESum inter wing and T3 in SoRTESum dual wing); the other ones
are different making the difference of summarization performance between the
two models. They are quite relevant to this event, but do not directly mention
the death of Tamerlan Tsarnaev e.g., T2. This leads to lower performance for
both models.
Finally, although social information can help to improve summary perfor-
mance, other irrelevant data can badly affect the generation. This is because a
score of an instance was calculated by an accumulative mechanism; therefore,
common information (sentences or tweets) can achieve high score. For example,
12
http://150.65.242.101:9293.
SoRTESum: A Social Context Framework 13
Table 5. Summary example of our methods; bold style is important instances; [+]
shows a strongly relevance and [-] is a light relevance.
Highlights
+ HL1: Police identified Tamerlan Tsarnaev, 26, as the dead Boston bombing suspect
+ HL2: Tamerlan studied engineering at Bunker Hill Community College in Boston
+ HL3: He was a competitive boxer for a club named Team Lowell
Summary Sentences
+S1: Tamerlan Tsarnaev, the 26-year-old identified by police as the dead Boston bombing suspect,
called his uncle Thursday night and asked for forgiveness, the uncle said
+S2: Police have identified Tamerlan Tsarnaev as the dead Boston bombing suspect
+S3: Tamerlan attended Bunker Hill Community College as a part-time student for three semesters,
Fall 2006, Spring 2007, and Fall 2008
-S4: He said Tamerlan has relatives in the United States and his father is in Russia
Summary Tweets
RTE-Sum inter wing RTE-Sum dual wing
+T1: Before his death Tamerlan Tsarnaev called - T1: I proudly say I was the 1st 1 to write this on
an uncle and asked for his forgiveness. Said he is twitter. Uncle,Tamerlan Tsarnaev called, asked for
married and has a baby forgiveness
-T2: I proudly say I was the 1st 1 to write this on - T2: So apparently the dead suspect has a wife & baby?
twitter. Uncle,Tamerlan Tsarnaev called, asked for And beat his girlfriend enough to be arrested?
forgiveness (same woman?)
- T3: So apparently the dead suspect has a wife & baby? + T3: Before his death Tamerlan Tsarnaev called
And beat his girlfriend enough to be arrested? an uncle and asked for his forgiveness. Said he is
(same woman?) married and has a baby
+T4: Tamerlan Tsarnaev ID’d as dead Boston +T4: #BostonMarathon bomber Tamerlan called
blast suspect - USA Today - USA TODAY, uncle couple of hours before he was shot dead
Tamerlan Tsarnaev ID’d as dead said ’I love you and forgive me
some tweets mention about the forgiveness of Tamerlan Tsarnaev’s uncle e.g.,
T2. This, obviously, does not directly show the information of Tamerlan Tsar-
naev’s death, but this tweet received a lot of attention from readers when reading
this event. More importantly, all sentences and tweets in Table 5 contain key-
words that relate to Tamerlan Tsarnaev’s event. This illustrates the efficiency of
our method and suggests that the performance of the models can be improved
based on informative phrases as Generation hypothesis in Sect. 1.
4 Conclusion
Acknowledgment. We would like to thank to Preslav Nakov and Wei Gao for useful
discussions and insightful comments on earlier drafts; Chien-Xuan Tran for building
the web interface. We also thank to anonymous reviewers for their detailed comments
for improving our paper. This work was partly supported by JSPS KAKENHI Grant
number 3050941.
References
1. Dagan, I., Dolan, B., Magnini, B., Roth, D.: Recognizing textual entailment: ratio-
nal, evaluation and approaches - erratum. Nat. Lang. Eng. 16(1), 105–105 (2010)
2. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text
summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
3. Gao, W., Li, P., Darwish, K.: Joint topic modeling for event summarization across
news, social media streams. In: CIKM, pp. 1173–1182 (2012)
4. Meishan, H., Sun, A., Lim, E.-P.: Comments-oriented blog summarization by sen-
tence extraction. In: CIKM, pp. 901–904 (2007)
5. Meishan, H., Sun, A., Lim, E.-P.: Comments-oriented document summarization:
understanding document with readers’ feedback. In: SIGIR, pp. 291–298 (2008)
6. Po, H., Sun, C., Longfei, W., Ji, D.-H., Teng, C.: Social summarization via auto-
matically discovered social context. In: IJCNLP pp. 483–490 (2011)
7. Huang, L., Li, H., Huang, L.: Comments-oriented document summarization based
on multi-aspect co-feedback ranking. In: Wang, J., Xiong, H., Ishikawa, Y., Xu, J.,
Zhou, J. (eds.) WAIM 2013. LNCS, vol. 7923, pp. 363–374. Springer, Heidelberg
(2013)
8. Lin, C.-Y., Hovy, E.H.: Automatic evaluation of summaries using n-gram co-
occurrence statistics. In: HLT-NAACL, pp. 71–78 (2003)
9. Yue, L., Zhai, C.X., Sundaresan, N.: Rated aspect summarization of short com-
ments. In: WWW, pp. 131–140 (2009)
10. Luhn, H.P.: The automatic creation of literature abstracts. IBM J. Res. Dev. 2(2),
159–165 (1958)
11. Nenkova, A.: Automatic text summarization of newswire: lessons learned from the
document understanding conference. In: AAAI pp. 1436–1441 (2005)
12. Nguyen, M.-T., Ha, Q.-T., Nguyen, T.-D., Nguyen, T.-T., Nguyen, L.-M.: Recog-
nizing textual entailment in vietnamese text: an experimental study. In: KSE
(2015). doi:10.1109/KSE.2015.23
13. Nguyen, M.-T., Kitamoto, A., Nguyen, T.-T.: TSum4act: a framework for retriev-
ing and summarizing actionable tweets during a disaster for reaction. In: Cao, T.,
Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015.
LNCS, vol. 9078, pp. 64–75. Springer, Heidelberg (2015)
14. Porter, M.F.: Snowball: a language for stemming algorithms (2011)
15. Wan, X., Yang, J.: Multi-document summarization using cluster-based link analy-
sis. In: SIGIR, pp. 299–306 (2008)
16. Wei, Z., Gao, W.: Utilizing microblogs for automatic news highlights extraction.
In: COLING, pp. 872–883 (2014)
17. Wei, Z., Gao, W.: Gibberish, assistant, or master? Using tweets linking to news
for extractive single-document summarization. In: SIGIR, pp. 1003–1006 (2015)
18. Yang, Z., Cai, K., Tang, J., Zhang, L., Zhong, S., Li, J.: Social context summa-
rization. In: SIGIR, pp. 255–264 (2011)
A Graph-Based Approach to Topic Clustering
for Online Comments to News
1 Introduction
Online news outlets attract large volumes of comments every day. The Huffington
Post, for example, received an estimated 140,000 comments in a 3 day period1 ,
while The Guardian has reported receiving 25,000 to 40,000 comments per day2 .
These figures suggest that online commenting forums are important for readers
as a means to share their opinions on recent news. The resulting vast number of
comments and information they contain makes them relevant to multiple stake-
holders in the media business. All user groups involved in online commenting on
news would profit from easier access to the multiple topics discussed within a
large set of comments. For example, commenter posters would be able to gain
1
http://goo.gl/3f8Hqu.
2
http://www.theguardian.com/commentisfree/2014/aug/10/readers-editor-online-ab
use-women-issues.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 15–29, 2016.
DOI: 10.1007/978-3-319-30671-1 2
16 A. Aker et al.
abstract labels are created for topics, which in turn are projected onto comment
clusters.
The paper is organised as follows. Section 2 reviews relevant previous work.
In Sect. 3 we describe the dataset we work with, which we downloaded from
The Guardian online news portal. Section 4 discusses our clustering and cluster
labelling approaches. The experimental setup for the evaluation of the proposed
methods on Guardian data is reported in Sect. 5. The results are reported in
Sect. 6 and discussed in Sect. 7. In Sect. 8 we conclude the paper and outline
directions for future work.
2 Related Work
2.1 Comment Clustering
4
Soft clustering methods allow one data item to be assigned to multiple clusters.
5
I.e. one comment can be assigned to only one cluster.
6
http://tagme.di.unipi.it/.
18 A. Aker et al.
3 Data
metric may be biased by the exact match as found in the quotes and may not be
sensitive enough to capture similarity in comments that do not contain exactly
matching quotes. For this reason, we expect that clustering results will be better
if quotes are removed from the comments before computing similarity. To test
this assumption we created two sets of training data. In the first set positive
instances are comment pairs where the quote is left in the comments. In the sec-
ond set positive instances are pairs of comments where we removed the shared
quotes from the comments. For both training sets we set the topical similarity
measure for each positive instance to be
quoteScore = len(quoteC1 ) + len(quoteC2 )/2 ∗ len(sentence) (1)
as the outcome. len(X) returns the length of X in words and quoteCi is the
segment of comment Ci quoted from sentence in the original article. When
computing the quoteScore we make sure that the quoted sentence has at least
10 words. We add comment pairs to the positive training data whose quoteScore
values are >= 0.5 – a value we obtained empirically.
The negative instances are created by pairing randomly selected comments
from two different articles from The Guardian. They are used to present the
linear regression algorithm with the instances of comment pairs that are not
on the same topic or are only weakly topically related. The topical similarity
measure for each such pair was set to 0. We have in total 14,700 positive pairs
and the same number of negative instances.
4 Methods
4.1 Graph-based Clustering
Our graph-based clustering approach is based on the Markov Cluster Algorithm
(MCL) [20] shown in Algorithm 1. The nodes (V ) in the graph G(V, E, W ) are
20 A. Aker et al.
the comments. Edges (E) are created between the nodes and have associated
weights (W ). Each comment is potentially connected to every other comment
using an undirected edge. An edge is present if the associated weight is greater
than 0. Such a graph may be represented as a square matrix M of order |V |,
whose rows and columns correspond to nodes in the graph and whose cell values
mi,j , where mi,j > 0, indicate the presence of an edge of weight mi,j between
nodes Vi and Vj . Following the recommendation in [20] we link all nodes to
themselves with mi,i = 1. Other edge weights are computed based on comment-
comment similarity features described in the next section below.
Once such a graph is constructed, MCL repeats steps 11–13 in the Algo-
rithm until the maximum number of iterations iter is reached7 . First in step
11 the matrix is normalized and transformed to a column stohastic matrix,
next expanded (step 12) and finally inflated (step 13). The expansion operator
is responsible for allowing flow to connect different regions of the graph. The
inflation operator is responsible for both strengthening and weakening this flow.
These two operations are controlled by two parameters, the power p – > 2 results
in too few clusters – and the inflation parameter r – >= 2 results in too many
clusters. After some experimentation we set p to 2 and r to 1.5, as these resulted
in a good balance between too many and too few clusters.
Once MCL terminates, the clusters are read off the rows of the final matrix
(step 15 in Algorithm 1). For each row i in the matrix the comments in columns
j are added to cluster i if the cell value Mi,j > 0 (the rows for items that belong
7
MCL runs a predefined number of iterations. We ran MCL with 5000 iterations.
A Graph-Based Approach to Topic Clustering for Online Comments to News 21
to the same cluster will each redundantly specify that cluster). In this setting the
MCL algorithm performs hard clustering, i.e. assigns each comment to exactly
one cluster.
– Same Thread: Returns 1 if both C1 and C2 are within the same thread
otherwise 0.
– Reply Relationship: If C1 replies to C2 (or vise versa) this feature returns 1
otherwise 0. Reply relationship is transitive, so that the reply is not necessarily
direct, instead it holds: reply(Cx , Cy ) ∧ reply(Cy , Cz ) ⇒ reply(Cx , Cz )
To obtain the weights we train a linear regression9 model using training data
derived from news articles and comments as described in Sect. 3.1 above. The
target value for positive instances is the value of quoteScore from Eq. 1 and for
negative instances is 0.
We create an edge within the graph between comments Ci and Cj with
weight wi,j = Sim Score(Ci , Cj ) if Sim Score is above 0.3, a minimum similar-
ity threshold value set experimentally.
We aim to create abstractive cluster labels since abstractive labels can be more
meaningful and can capture a more holistic view of a comment cluster than words
or phrases extracted from it. We adopt the graph-based topic labelling algorithm
of Hulpus et al. [8], which uses DBPedia [3], and modify it for comment cluster
labelling.
Our use of the Hulpus et al. method proceeds as follows. An LDA model,
trained on a large collection of Guardian news articles, plus their associated
comments, was used to assign 5 (most-probable) topics to each cluster.10 A
separate label is created for each such topic, by using the top 10 words of the topic
(according to the LDA model) to look up corresponding DBPedia concepts.11
The individual concept graphs so-identified are then expanded using a restricted
set of DBPedia relations,12 and the resulting graphs merged, using the DBPedia
merge operation. Finally, the central node of the merged graph is identified,
9
We used Weka (http://www.cs.waikato.ac.nz/ml/weka/) implementation of linear
regression.
10
The number of topics (k) to assign was determined empirically, i.e. we varied
2<k<10, and chose k=5 based on the clarity of the labels generated.
11
We take the most-common sense. The 10 word limit is to reduce noise. Less than 10
DBPedia concepts may be identified, as not all topic words have an identically-titled
DBPedia concept.
12
To limit noise, we reduce the relation set c.f. Hulpus et al. to include only
skos:broader, skos:broaderOf, rdfs:subClassOf, rdfs. Graph expansion is limited to
two hops.
A Graph-Based Approach to Topic Clustering for Online Comments to News 23
providing the label for the topic.13 The intuition is that the label thus obtained
should encompasses all the abstract concepts that the topic represents.14 Thus,
for example, a DBPedia concept set such as {Atom, Energy, Electron, Quantum,
Orbit, Particle} might yield a label such as Theoretical Physics.
5 Experiments
We compare our graph-based clustering approach against LDA which has been
established as a successful method for comment clustering when compared to
alternative methods (see Sect. 2). We use two different LDA models: LDA1 and
LDA2 15 . The LDA1 model is trained on the entire data set described in Sect. 3.1.
In this model we treat the news article and its associated comments as a single
document. This training data set is large and contains a variety of topics. When
we require the clustering method to identify a small number of topics, we expect
these to be very general, so that the resulting comment clusters are less homo-
geneous than they would be if only comments of a single article are considered
when training the LDA model, as do Llewellyn et al. [14].
Therefore we also train a second LDA model (LDA2), which replicates the
setting reported in Llewellyn et al. [14]. For each test article we train a separate
LDA2 model on its comments. In training we include the entire comment set for
each article in the training data, i.e. both the first 100 comments that are clus-
tered and summarised by human annotators, as well as the remaining comments
not included in the gold standard. In building LDA2 we treated each comment
in the set as separate document.
LDA requires a predetermined number of topics. We set the number of topics
to 9 since the average number of clusters within the gold standard data is 8.97.
We use 9 topics within both LDA1 and LDA2. Similar to Llewellyn et al. [14]
we also set the α and β parameters to 5 and 0.01 respectively for both models.
Once the models are generated they are applied to the test comments for
which we have gold standard clusters. LDA distributes the comments over the
pre-determined number of topics using probability scores. Each topic score is the
probability that the given comment was generated by that topic. Like [14] we
select the most probable topic/cluster for each comment. Implemented in this
way, the LDA model performs hard clustering.
For evaluation the automatic clusters are compared to the gold standard
clusters described in Sect. 3.2. Amigo et al. [2] discuss several metrics to evalu-
ate automatic clusters against the gold standard data. However, these metrics
13
Several graph-centrality metrics were explored: betweeness centrality, load centrality,
degree centrality, closeness centrality, of which the last was used for the results
reported here.
14
Hulpus et al. [8] merge together the graphs of multiple topics, so as to derive a single
label to encompass them. We have found it preferable to provide a separate label
for each topic, i.e. so the overall label for a cluster comprises 5 label terms for the
individual topics.
15
We use the LDA implementation from http://jgibblda.sourceforge.net/.
24 A. Aker et al.
are tailored for hard clustering. Although our graph-based approach and base-
line LDA models perform hard clustering, the gold standard data contains soft
clusters. Therefore, the evaluation metric needs to be suitable for soft-clustering.
In this setting hard clusters are regarded as a special case of possible soft clus-
ters and will likely be punished by the soft-clustering evaluation method. We use
fuzzy BCubed Precision, Recall and F-Measure metrics reported in [7,9]. Accord-
ing to the analysis of formal constraints that a cluster evaluation metric needs to
fulfill [2], fuzzy BCubed metrics are superior to Purity, Inverse Purity, Mutual
Information, Rand Index, etc. as they fulfill all the formal cluster constraints:
cluster homogeneity, completeness, rag bag and clusters size versus quantity. The
fuzzy metrics are also applicable to hard clustering.
To evaluate the association of comment clusters with labels created by the
cluster labelling algorithm, we create an annotation task by randomly selecting
22 comment clusters along with their system generated labels. In the annotation
bench for each comment cluster label, three random clusters are chosen along
with the comment cluster for which the system generated the label. Three anno-
tators (A, B, C) are chosen for this task. Annotators are provided with a cluster
label and asked to choose the comment cluster that best describes the label from
a list of four comment clusters. As the comment clusters are chosen at random,
the label can correspond to more than one comment clusters. The annotators
are free to choose more than one instance for the label, provided it abstracts the
semantics of the cluster in some form.
In some instances, the comment label can be too generic or even very abstract.
It can happen that a label does not correspond to any of the comment clusters.
In such cases, the annotators are asked not to select any clusters. These instances
are marked NA (not assigned) by the annotation bench. Inter-annotator agree-
ment is measured using Fleiss Kappa metric [6]. We report overall agreement
as well as agreement between all pairs of annotators. The output of the clus-
ter labelling algorithm is then evaluated with the annotated set using standard
classification metrics.
6 Results
Clustering results are shown in Table 1. A two-tailed paired t-test was performed
for a pairwise comparison of the fuzzy Bcubed metrics across all four automatic
systems and human-to-human setting.
Firstly, we observe that human-to-human clusters are significantly better
than each of the automatic approaches in all evaluation metrics16 . Furthermore,
we cannot retain our hypothesis that the graph-based approach trained on the
training data with quotes removed performs better than the one that is trained
16
The difference in these results is significant at the Bonferroni corrected level of
significance of p < 0.0125, adjusted for 4-way comparison between the human-to-
human and all automatic conditions.
A Graph-Based Approach to Topic Clustering for Online Comments to News 25
Table 1. Cluster evaluation results. The scores shown are macro averaged. For all
systems the metrics are computed relative to the average scores over Human1 and
Human2. graphHuman indicates the setting where similarity model for graph-based
approach is trained with quotes included in the comments (see Sect. 4.1).
Table 2. Annotator agreement (Fleiss Kappa) for comment labelling over 22 comment
clusters
on data with quotes intact.17 Although the results in the quotes removed condi-
tion are better for all metrics, none of the differences is statistically significant.
We use the better performing model (graph without quotes) for comparisons
with other automatic methods.
Secondly, the LDA1 baseline performs significantly better than the re-
implementation of previous work, LDA2, in all metrics. This indicates that train-
ing LDA model on the larger data set is superior to training it on a small set
of articles and their comments, despite the generality of topics that arises from
compressing topics from all articles into 9 topic clusters for LDA1.
Finally, the quotes removed graph-based approach (column 4 in Table 1) sig-
nificantly outperforms the better performing LDA1 baseline in all metrics. This
indicates that graph-based method is superior to LDA, which has been iden-
tified as best performing method in several previous studies (cf. Sect. 2). In
addition, clustering comments using graph-based methods removes the need for
prior knowledge about the number of topics - a property of the news comment
domain that cannot be considered by LDA topic modelling.
Tables 2 and 3 present results from the evaluation of the automatically gen-
erated comment cluster labels. Table 2 shows the agreement between pairs of
annotators and overall, as measured by Fleiss’ Kappa on the decision: given the
label, which cluster does it describe best. Overall there is a substantial agreement
of κ = 0.61 between the three annotators. The annotator pair B-C, however,
achieves only moderate agreement of κ = 0.45, suggesting that some annota-
tors make idiosyncratic choices when assigning more generic abstractive labels
to clusters.
17
We apply both models on comments regardless whether they contain quotes or not.
However, in case of graph-Human-quotesRemoved before it is applied on the testing
data we make sure that the comments containing quotes are also quotes free.
26 A. Aker et al.
Table 3. Evaluation results of the cluster labeling system for each of the 3 annotators.
NA corresponds to the number of labels not assigned.
Table 3 shows the evaluation scores for the automatically generated labels,
given as precision, recall and F scores results, along with the percentage of labels
not assigned (NA) to any cluster. Overall, annotators failed to assign labels to
any cluster in 40.9 % of cases. In the remaining cases, where annotators did
assign the labels to clusters, this was done with fairly high precision (0.8), and
so as to achieve an overall average recall of 0.5, suggesting that meaningful labels
had been created.
7 Discussion
The comment clustering results demonstrate that graph-based clustering is able
to outperform the current state-of-the-art method LDA as implemented in previ-
ous work at the task of clustering reader’s comments to online news into topics.
In addition to the quantitative study reported above we also performed a
qualitative analysis of the results of the graph-based clustering approach. That
analysis reveals that disagreements in human and automatic assignment of com-
ments to clusters are frequently due to the current approach ignoring largely
conversational structure and treating each comment as an independent docu-
ment. Commenting forums, however, are conversations and as such they exhibit
internal structuring where two comments are functionally related to each other,
so that the first pair part (FPP) makes relevant the second pair part (SPP).
In our automatic clusters we frequently found answers, questions, responses to
compliments and other stand-alone FPPs or SPPs that were unrelated to the
rest of an otherwise homogeneous cluster. For example, the comment “No, just
describing another right wing asshole”. is found as the only odd comment in oth-
erwise homogeneous cluster of comments about journalistic standards in political
reporting. Its FPP “Wait, are you describing Hillary Clinton?” is assigned to a
different cluster about US politician careers. We assume that our feature reply
relationship was not sufficiently weighted to account for this, so that we need
to consider alternative ways of training, which can help identify conversational
functional pairs.
A further source of clustering disagreements is the fact that humans cluster
both according to content and to the conversational action a comment performs,
while the current system only clusters according to a comment’s content. There-
fore, humans have clusters labelled jokes, personal attacks to commenters or empty
A Graph-Based Approach to Topic Clustering for Online Comments to News 27
sarcasm, support, etc., in addition to the clusters with content labels. A few com-
ments have been clustered by the annotators along both dimensions, content and
action, and can be found in multiple clusters (soft clustering). Our graph-based
method reported in this work produces hard clusters and is as such compara-
ble with the relevant previous work. However, we have not addressed the soft-
clustering requirement of the domain and gold standard data, which has most
likely been partly reflected in the difference between human and automatic clus-
tering results. When implementing soft-clustering in future one way to proceed
would be to add automatic recognition of a comment’s conversational action,
which would make graph based clustering more human-like and therefore more
directly comparable to the gold standard data we have.
Our evaluation of cluster labelling reveals that even though the labelling
system has acceptable precision, recall is rather low, due, in large part, to the
high number of NA labels. We qualitatively analysed those instances that were
NA for more than one annotator. Barring three instances, where the system
generated labels like concepts in Metaphysics, Chemical elements, Water with
no obvious connection to the underlying cluster content, labels generated by
the system describe the cluster in a meaningful way. However, in some cases
annotators failed to observe the connection between the comment cluster and
the label. This may be due to the fact that users expect a different level of
granularity – either more general or more specific – for labeling. For instance, a
comment talking about a dry, arid room can have a label like laconium but users
may prefer having a label that corresponds to dryness. This is very subjective
and poses a problem for abstractive labelling techniques in general.
The expansion of a graph using DBPedia relations encompasses related con-
cepts. However, this expansion can also include abstract labels like Construction,
organs, monetary economics, Articles containing video clips etc. This happens
due to merging of sub-graphs representing concepts too close to the abstract
concepts. In these cases, the most common abstract node may get selected as
the label. These nodes can be detrimental to the quality of the labels. This can
be prevented by controlled expansion using more filtered DBPedia relations and
a controlled merging.
8 Conclusion
We have presented graph-based approaches for the task of assigning reader com-
ments on online news into labeled topic clusters. Our graph-based method is a
novel approach to comment clustering, and we demonstrate that it is superior
to LDA topic modeling – currently the best performing approach as reported
in previous work. We model the similarity between graph nodes (comments) as
a linear combination of different similarity features and train the linear regres-
sion model on an automatically generated training set consisting of comments
containing article quotations.
For cluster labeling we implement a graph-based algorithm that uses DBPe-
dia concepts to produce abstractive labels that generalise over the content of clus-
ters in a meaningful way, thus enhancing readability and relevance of the labels.
28 A. Aker et al.
User evaluation results indicate that there is a scope for improvement, although
in general the automatic approach produces meaningful labels as judged by
human annotators.
Our future work will address soft-clustering, improve feature weighting and
investigate new features to better model conversational structure and dialogue
pragmatics of the comments. Furthermore, we aim to create better training data.
Currently, the quote-based approach for obtaining positive training instances
yields few comment pairs that are in some reply structure – a comment replying
to a previous comment is unlikely to quote the same sentence in the article and
thus comment-pairs where one comment replies to the other are not taken to
the training data. Due to this our regression model does not give much weight
to the reply feature even though this feature is very likely to suggest comments
are in the same topical structure. Finally, we also aim to improve the current
DBPedia-based labeling approach, as well as explore alternative approaches to
abstractive labeling to make cluster labels more appropriate.
References
1. Aker, A., Kurtic, E., Hepple, M., Gaizauskas, R., Di Fabbrizio, G.: Comment-to-
article linking in the online news domain. In: Proceedings of MultiLing, SigDial
2015 (2015)
2. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering
evaluation metrics based on formal constraints. Inf. Retrieval 12(4), 461–486 (2009)
3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia:
a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC
2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
5. Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering
engines. ACM Comput. Surv. (CSUR) 41(3), 17 (2009)
6. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. bull.
76(5), 378 (1971)
7. Hüllermeier, E., Rifqi, M., Henzgen, S., Senge, R.: Comparing fuzzy partitions: a
generalization of the rand index and related measures. IEEE Trans. Fuzzy Syst.
20(3), 546–556 (2012)
8. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic
labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference
on Web Search and Data Mining, WSDM 2013, pp. 465–474, NY, USA (2013).
http://doi.acm.org/10.1145/2433396.2433454
9. Jurgens, D., Klapaftis, I.: Semeval-2013 task 13: Word sense induction for graded
and non-graded senses. In: Second Joint Conference on Lexical and Computational
Semantics (* SEM), vol. 2, pp. 290–299 (2013)
10. Khabiri, E., Caverlee, J., Hsu, C.F.: Summarizing user-contributed comments. In:
ICWSM (2011)
A Graph-Based Approach to Topic Clustering for Online Comments to News 29
11. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic
models. In: Proceedings of the 49th Annual Meeting of the Association for Com-
putational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Asso-
ciation for Computational Linguistics (2011)
12. Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic
labelling. In: Proceedings of the 23rd International Conference on Computational
Linguistics: Posters, pp. 605–613. Association for Computational Linguistics (2010)
13. Liu, C., Tseng, C., Chen, M.: Incrests: Towards real-time incremental short text
summarization on comment streams from social network services. IEEE Trans.
Knowl. Data Eng. 27, 2986–3000 (2015)
14. Llewellyn, C., Grover, C., Oberlander, J.: Summarizing newspaper comments. In:
Eighth International AAAI Conference on Weblogs and Social Media (2014)
15. Ma, Z., Sun, A., Yuan, Q., Cong, G.: Topic-driven reader comments summarization.
In: Proceedings of the 21st ACM International Conference on Information and
Knowledge Management, pp. 265–274. ACM (2012)
16. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadina, I., Tadić, M., Gornostay, T.:
Term extraction, tagging, and mapping tools for under-resourced languages. In:
Proceedings of the 10th Conference on Terminology and Knowledge Engineering
(TKE 2012), pp. 20–21 (2012)
17. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence mea-
sures. In: Proceedings of the Eighth ACM International Conference on Web Search
and Data Mining, pp. 399–408. ACM (2015)
18. Salton, G., Lesk, E.M.: Computer evaluation of indexing and text processing. J.
ACM 15, 8–36 (1968)
19. Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search
results. In: Proceedings of the Fifth ACM International Conference on Web Search
and Data Mining, pp. 223–232. ACM (2012)
20. Van Dongen, S.M.: Graph clustering by flow simulation (2001)
Leveraging Semantic Annotations to Link
Wikipedia and News Archives
1 Introduction
Today in this digital age, the global news industry is going through a drastic shift
with a substantial increase in online news consumption. With new affordable
devices available, general users can easily and instantly access online digital
news archives using broadband networks. As a side effect, this ease of access
to overwhelming amounts of information makes it difficult to obtain a holistic
view on past events. There is thus an increasing need for more meaningful and
effective representations of online news data (typically collections of digitally
published news articles).
The free encyclopedia Wikipedia has emerged as a prominent source of infor-
mation on past events. Wikipedia articles tend to summarize past events by
abstracting from fine-grained details that mattered when the event happened.
Entity profiles in Wikipedia contain excerpts that describe events that are sem-
inal to the entity. As a whole, they give contextual information and can help to
build a good understanding of the causes and consequences of the events.
Online news articles are published contemporarily to the events and report
fine-grained details by covering all angles. These articles have been preserved for
a long time as part of our cultural heritage through initiatives taken by media
houses, national libraries, or efforts such as the Internet Archive. The archives
of The New York Times, as a concrete example, go back until 1851.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 30–42, 2016.
DOI: 10.1007/978-3-319-30671-1 3
Leveraging Semantic Annotations to Link Wikipedia and News Archives 31
No. Wikiexcerpt
Jaber Al-Ahmad Al-Sabah: After much discussion of a border dispute between Kuwait
1 and Iraq, Iraq invaded its smaller neighbor on August 2, 1990 with the stated intent of
annexing it. Apparently, task of the invading Iraqi army was to capture or kill Sheikh Jaber.
Guam: The United States returned and fought the Battle of Guam on July 21, 1944, to
recapture the island from Japanese military occupation. More than 18,000 Japanese were
2
killed as only 485 surrendered. Sergeant Shoichi Yokoi, who surrendered in January 1972,
appears to have been the last confirmed Japanese holdout in Guam.
2 Model
Each document d in our document collection C consists of a textual part dtext , a
temporal part dtime , a geospatial part dspace , and an entity part dentity . As a bag
of words, dtext is drawn from a fixed vocabulary V derived from C. Similarly,
dtime , dspace and dentity are bags of temporal expressions, geolocations, and
named-entity mentions respectively. We sometimes treat the entire collection
C as a single coalesced document and refer to its corresponding parts as Ctext ,
Ctime , Cspace , and Centity . In our approach, we use the Wikipedia Current Events
Portal1 to distinguish event-specific terms by coalescing into a single document
devent . Time unit or chronon τ indicates the time passed (to pass) since (until)
a reference date such as the UNIX epoch. A temporal expression t is an interval
[tb, te] ∈ T × T , in time domain T , with begin time tb and end time te. Moreover,
a temporal expression t is described as a quadruple [tbl , tbu , tel , teu ] [5] where
tbl and tbu gives the plausible bounds for begin time tb, and tel and teu give the
bounds for end time te. A geospatial unit l refers to a geographic point that is
represented in the geodetic system in terms of latitude (lat) and longitude (long).
A geolocation s is represented by its minimum bounding rectangle (MBR) and
is described as a quadruple [tp, lt, bt, rt]. The first point (tp, lt) specifies the top-
left corner, and the second point (bt, rt) specifies the bottom-right corner of the
MBR. We fix the smallest MBR by setting the resolution [resollat × resollong ]
of space. A named entity e refers to a location, person, or organization from the
YAGO [15] knowledge base. We use YAGO URIs to uniquely identify each entity
in our approach. A query q is derived from a given Wikiexcerpt in the following
way: the text part qtext is the full text, the temporal part qtime contains explicit
temporal expressions that are normalized to time intervals, the geospatial part
qspace contains the geolocations, and the entity part qentity contains the named
entities mentioned. To distinguish contextual terms, we use the textual content
of the source Wikipedia article of a given Wikiexcerpt and refer to it as dwiki .
3 Approach
In our approach, we design a two-stage cascade retrieval model. In the first stage,
our approach performs an initial round of retrieval with the text part of the query
to retrieve top-K documents. It then treats these documents as pseudo-relevant
and expands the temporal, geospatial, and entity parts of the query. Then, in the
second stage, our approach builds independent query models using the expanded
query parts, and re-ranks the initially retrieved K documents based on their
divergence from the final integrated query model. As output it then returns
top-k documents (k < K). Intuitively, by using pseudo-relevance feedback to
expand query parts, we cope with overly specific (and sparse) annotations in the
original query and instead consider those that are salient to the query event for
estimating the query models.
1
http://en.wikipedia.org/wiki/Portal:Currentevents.
Leveraging Semantic Annotations to Link Wikipedia and News Archives 33
For our linking task, we extend the KL-divergence framework [27] to the text,
time, geolocation, and entity dimensions of the query and compute an overall
divergence score. This is done in two steps: First, we independently estimate a
query model for each of the dimensions. Let Qtext be the unigram query-text
model, Qtime be the query-time model, Qspace be the query-space model, and
Qentity be the query-entity model. Second, we represent the overall query model
Q as a joint distribution over the dimensions and exploit the additive property
of the KL-divergence to combine divergence scores for query models as,
In the above equation, analogous to the query, the overall document model D is
also represented as the joint distribution over Dtext , Dtime , Dspace , and Dentity
which are the independent document models for the dimensions.
The KL-divergence framework with the independence assumption gives us the
flexibility of treating each dimension in isolation while estimating query models.
This would include using different background models, expansion techniques
with pseudo-relevance feedback, and smoothing. The problem thus reduces to
estimating query models for each of the dimensions which we describe next.
Query-Text Model. Standard likelihood-based query modeling methods that
rely on the empirical terms become ineffective for our task. As an illustration,
consider the first example in Table 1. A likelihood-based model would put more
stress on {Iraq} that has the highest frequency, and suffer from topical drift
due to the terms like {discussion, border, dispute, Iraq}. It is hence necessary to
make use of a background model that emphasizes event-specific terms.
We observe that a given qtext contains two factors, first, terms that give
background information, and second, terms that describe the event. To stress on
the latter, we combine a query-text model with a background model estimated
from: (1) the textual content of the source Wikipedia article dwiki ; and (2) the
textual descriptions of events listed in the Wikipedia Current Events portal,
devent . The dwiki background model puts emphasis on the contextual terms that
are discriminative for the event, like {Kuwait, Iraq, Sheikh, Jaber }. On the other
hand, the background model devent puts emphasis on event-specific terms like
{capture, kill, invading}. Similar approaches that combine multiple contextual
models have shown significant improvement in result quality [24,25].
We combine the query model with a background model by linear interpola-
tion [28]. The probability of a word w from the Qtext is estimated as,
P (w|Qtext ) = (1−λ)·P (w|qtext ) + λ· β·P (w|devent )+(1−β)·P (w|dwiki ) . (2)
A term w is generated from the background model with probability λ and from
the original query with probability 1 − λ. Since we use a subset of the available
terms, we finally re-normalize the query model as in [20]. The new generative
probability P̂ (w | Qtext ) is computed as,
34 A. Mishra and K. Berberich
P (w | Qtext )
P̂ (w | Qtext ) = . (3)
w ∈V P (w | Qtext )
where the 1(·) function returns 1 if there is an overlap between a time unit
τ and an interval [tbl , tbu , tel , teu ]. The denominator computes the area of the
temporal expression in T ×T . For any given temporal expression, we can compute
its area and its intersection with other expressions as described in [5]. Intuitively,
the above equation assigns higher probability to time units that overlap with a
larger number of specific (smaller area) intervals in qtime .
The query-time model estimated so far has hard temporal boundaries and
suffers from the issue of near-misses. For example, if the end boundary of the
query-time model is “10 January 2014” then the expression “11 January 2014” in
a document will be disregarded. To address this issue, we perform an additional
Gaussian smoothing. The new probability is estimated as,
P̂ (τ | Qtime ) = Gσ (t) · P (τ | Qtime ) (5)
t∈T ×T
Analogous to the Eq. 4, the 1(·) function returns 1 if there is an overlap between
a space unit l and a MBR as [tp, lt, bt, rt]. Intuitively, query-space model assigns
higher probability to l if it overlaps with a larger number of more specific (MBR
with smaller area) geolocations in qspace . As the denominator, it is easy to com-
pute | [tp, lt, bt, rt] | as |s| = (rt − lt + resollat ) ∗ (tp − bt + resollong ). Addition
of the small constant ensures that for all s, |s| > 0.
Similar to the query-time model, to address the issue of near misses we esti-
mate P̂ (l|QSpace ) that additionally smooths P (l|Qspace ) using a Gaussian kernel
as described in Eq. 5 and also re-normalize as per Eq. 3.
Query-Entity Model. The query-entity model Qentity captures the entities
that are salient to an event and builds a probability distribution over an entity
space. To estimate Qentity we make use of the initially retrieved pseudo-relevant
documents to construct a background model that assigns higher probability to
entities that are often associated with an event. Let DR be the set of pseudo-
relevant documents. The generative probability of entity e is estimated as,
P (e | Qentity ) = (1 − λ) · P (e | qentity ) + λ · P (e | dentity ) (8)
d∈DR
where P (e | qentity ) and P (e | dentity ) are the likelihoods of generating the entity
from the original query and a document d ∈ DR respectively.
Document Model. To estimate the document models for each dimension, we
follow the same methodology as for the query with an additional step of Dirichlet
smoothing [28]. This has two effects: First, it prevents undefined KL-Divergence
scores. Second, it achieves an IDF-like effect by smoothing the probabilities of
expressions that occur frequently in the C. The generative probability of a term
w from document-text model Dtext is estimated as,
P̂ (w | Dtext ) + μP (w | Ctext )
P (w | Dtext ) = (9)
|Dtext | + μ
4 Experiments
Next, we describe our experiments to study the impact of the different compo-
nents of our approach. We make our experimental data publicly available2 .
2
http://resources.mpi-inf.mpg.de/d5/linkingWiki2News/.
36 A. Mishra and K. Berberich
Document Collection. For the first set of experiments, we use The New York
Times3 Annotated Corpus (NYT) which contains about two million news articles
published between 1987 and 2007. For the second set, we use the ClueWeb12-B13
(CW12) corpus4 with 50 million web pages crawled in 2012.
Test Queries. We use the English Wikipedia dump released on February 3rd
2014 to generate two independent sets of test queries: (1) NYT-Queries, contains
150 randomly sampled Wikiexcerpts targeting documents from the NYT corpus;
(2) CW-Queries contains 150 randomly sampled Wikiexcerpts targeting web
pages from CW12 corpus. NYT-Queries have 104 queries, out of 150, that come
with at least one temporal expression, geolocation, and named-entity mention. In
the remaining 46 test queries, 17 do not have any temporal expressions, 28 do not
have any geolocations, and 27 do not have any entity mentions. We have 4 test
queries where our taggers fail to identify any additional semantics. CW-Queries
have 119 queries, out of 150, that come with at least one temporal expression,
geolocation and entity mention. 19 queries do not mention any geolocation, and
26 do not have entity mentions.
Relevance Assessments were collected using the CrowdFlower platform5 . We
pooled top-10 results for the methods under comparison, and asked assessors to
judge a document as (0) irrelevant, (1) somewhat relevant, or (2) highly relevant
to a query. Our instructions said that a document can only be considered highly
relevant if its main topic was the event given as the query. Each query-document
pairs was judged by three assessors. Both experiments resulted in 1778 and 1961
unique query-document pairs, respectively. We paid $0.03 per batch of five query-
document pairs for a single assessor.
Effectiveness Measures. As a strict effectiveness measure, we compare our
methods based on mean reciprocal rank (MRR). We also compare our methods
using normalized discounted cumulative gain (NDCG) and precision (P) at cutoff
levels 5 and 10. We also report the mean average precision (MAP) across all
queries. For MAP and P we consider a document relevant to a query if the
majority of assessors judged it with label (1) or (2). For NDCG we plug in the
mean label assigned by assessors.
Methods. We compare the following methods: (1) txt considers only the query-
text model that uses the background models estimated from the current events
portal and the source Wikipedia article (Eq. 2); (2) txtT uses the query-text and
query-time model (Eq. 4); (3) txtS uses the query-text and query-space model
(Eq. 7); (4) txtE uses the query-text and query-entity model (Eq. 8); (5) txtST
uses the query-text, query-time and query-space model; (6) txtSTE uses all four
query models to rank documents.
Parameters. We set the values for the different parameters in query and docu-
ment models for all the methods by following [27]. For the NYT corpus, we treat
3
http://corpus.nytimes.com.
4
http://www.lemurproject.org/clueweb12.php/.
5
http://www.crowdflower.com/.
Leveraging Semantic Annotations to Link Wikipedia and News Archives 37
top-100 documents retrieved in the first stage as pseudo-relevant. For CW12 cor-
pus with general web pages, we set this to top-500. The larger number of top-K
documents for the CW12 corpus is due to the fact that web pages come with
fewer annotations than news articles. In Eq. 2 for estimating the Qtext , we set
β = 0.5 thus giving equal weights to the background models. For the interpola-
tion parameters, we set λ = 0.85 in Eqs. 2 and 8. For the Gaussian smoothing
in Eq. 6 we set σ = 1. The smallest possible MBRs in Eq. 7 is empirically set as
resollat × resollong = 0.1 × 0.1.
Implementation. All methods have been implemented in java. To annotate
named entities in the test queries and documents from the NYT corpus, we use
the AIDA [16] system. For the CW12 corpus, we use the annotations released
as Freebase Annotations of the ClueWeb Corpora6 . To annotate geolocations
in the query and NYT corpus, we use an open-source gazetteer-based tool7
that extracts locations and maps them to GeoNames8 knowledge base. To get
geolocations for CW12 corpus we filter entities by mapping them from Freebase
to GeoNames ids. Finally, we run Stanford Core NLP9 on the test queries, NYT
corpus and CW12 corpus to get the temporal annotations.
Results. Tables 2 and 3 compare the different methods on our two datasets.
Both tables have two parts: (a) results on the entire query set; and (b) results
on a subset of queries with at least one temporal expression, geolocation, and
entity mention. To denote the significance of the observed improvements to the
txt method, we perform one-sided paired student’s T test at two alpha-levels:
0.05 (‡), and 0.10 (†), on the MAP, P@5, and P@10 scores [8]. We find that the
txtSTE method is most effective for the linking task.
In Table 2 we report results for the NYT-Queries. We find that the txtSTE
method that combines information in all the dimensions achieves the best result
across all metrics except P@5. The txt method that uses only the text already
gets a high MRR score. The txtS method that adds geolocations to text is able to
add minor improvements in NDCG@10 over the txt method. The txtT method
achieves a considerable improvement over txt. This is consistent for both NYT-
Queries (a) and NYT-Queries (b). The txtE method that uses named-entities
along with text shows significant improvement in P@5 and marginal improve-
ments across other metrics. The txtST method that combines time and geolo-
cations achieves significant improvements over txt. Finally, the txtSTE method
proves to be the best and shows significant improvements over the txt.
In Table 3, we report results for the CW-Queries. We find that the txtSTE
method outperforms other methods across all the metrics. Similar to previous
results, we find that the txt method already achieves high MRR score. How-
ever, in contrast, the txtT approach shows improvements in terms of P@5 and
NDCG@5, with a marginal drop in P@10 and MAP. The geolocations improve
6
http://lemurproject.org/clueweb12/FACC1/.
7
https://github.com/geoparser/geolocator.
8
http://www.geonames.org/.
9
http://nlp.stanford.edu/software/corenlp.shtml.
38 A. Mishra and K. Berberich
Measures txt txtT txtS txtE txtST txtSTE txt txtT txtS txtE txtST txtSTE
MRR 0.898 0.897 0.898 0.898 0.898 0.902 0.921 0.936 0.921 0.921 0.936 0.942
P@5 0.711 0.716 0.709 0.716 ‡ 0.719 0.717 0.715 0.740 ‡ 0.715 0.723 ‡ 0.742 ‡ 0.740 ‡
P@10 0.670 0.679 0.669 0.671 0.679 0.682 † 0.682 0.692 0.681 0.684 0.692 0.696 †
MAP 0.687 0.700 0.687 0.688 0.701 0.704 † 0.679 0.702 † 0.679 0.682 0.703 † 0.708 ‡
NDCG@5 0.683 0.696 0.682 0.685 0.697 0.697 0.686 0.721 0.686 0.689 0.721 0.723
NDCG@10 0.797 0.813 0.798 0.796 0.814 0.815 0.794 0.823 0.795 0.795 0.823 0.825
Measures txt txtT txtS txtE txtST txtSTE txt txtT txtS txtE txtST txtSTE
MRR 0.824 0.834 0.831 0.827 0.833 0.836 0.837 0.855 0.846 0.842 0.854 0.855
P@5 0.448 0.460 0.451 0.456 † 0.467 ‡ 0.475 ‡ 0.456 0.468 0.459 † 0.466 ‡ 0.478 ‡ 0.488 ‡
P@10 0.366 0.349 0.366 0.375 ‡ 0.367 0.375 † 0.377 0.358 0.378 0.390 ‡ 0.377 0.386 ‡
MAP 0.622 0.616 0.628 0.640 ‡ 0.640 † 0.653 ‡ 0.623 0.616 0.631 0.647 ‡ 0.642 † 0.661 ‡
NDCG@5 0.644 0.657 0.651 0.654 0.666 0.673 0.655 0.675 0.664 0.669 0.684 0.695
NDCG@10 0.729 0.723 0.734 0.746 0.744 0.755 0.736 0.736 0.744 0.759 0.755 0.769
the quality of the results in terms of MAP and significantly improve P@10.
Though individually time and geolocations show only marginal improvements,
their combination as the txtST method shows significant increase in MAP. We
find that the txtE method performs better than other dimensions with a sig-
nificant improvement over txt across all metrics. Finally, the best performing
method is txtSTE as it shows the highest improvement in the result quality.
Discussion. As a general conclusion of our experiments we find that leveraging
semantic annotations like time, geolocations, and named entities along with text
improves the effectiveness of the linking task. Because all our methods that utilize
semantic annotations (txtS, txtT, txtE, txtST, and txtSTE ) perform better than
the text-only (txt) method. However, the simple txt method already achieves a
decent MRR score in both experiments. This highlights the effectiveness of the
event-specific background model in tackling the verbosity of the Wikiexcerpts.
Time becomes an important indicator to identify relevant news articles but it is
not very helpful when it comes to general web pages. This is because the temporal
expressions in the news articles often describe the event time period accurately
thus giving a good match to the queries while this is not seen with web pages. We
find that geolocations and time together can better identify relevant documents
when combined with text. Named entities in the queries are not always salient
to the event but may represent the context of the event. For complex queries,
it is hard to distinguish salient entities which reduces the overall performance
due to topical drifts on a news corpus. However, they prove to be effective to
identify relevant web pages which can contain more general information thus
also mentioning the contextual entities. The improvement of our method over a
simple text-based method is more pronounced for the ClueWeb corpus than the
news corpus because of mainly two reasons: firstly, the news corpus is too narrow
Leveraging Semantic Annotations to Link Wikipedia and News Archives 39
with much smaller number of articles; and secondly, it is slightly easier to retrieve
relatively short, focused, and high quality news articles. This is supported by the
fact that all methods achieve much higher MRR scores for the NYT-Queries.
Gain/Loss Analysis. To get some insights into where the improvements for the
txtEST method comes from, we perform a gain/loss analysis based on NDCG@5.
The txtSTE method shows biggest gain (+0.13) in NDCG@5 for the following
query in NYT-Queries:
West Windsor Township, New Jersey: The West Windsor post office was found
to be infected with anthrax during the anthrax terrorism scare back in 2001-2002.
The single temporal expression 2001-2002 refers to a time period when there
were multiple anthrax attacks in New Jersey through the postal facilities. Due to
the ambiguity, the txtT and txtS methods become ineffective for this query. Their
combination, however, as the txtST method becomes the second best method
achieving NDCG@5 of 0.7227. The txtEST combines the entity Anthrax and
becomes the best method by achieving NDCG@5 of 0.8539. This method suffers
worst in terms of NDCG@5 (−0.464) for the following query in CW-Queries:
Human Rights Party Malaysia: The Human Rights Party Malaysia is a Malaysian
human rights-based political party founded on 19 July 2009, led by human rights
activist P. Uthayakumar.
The two entities, Human Rights Party Malaysia and P. Uthayakumar and one
geolocation, Malaysia, do prove to be discriminative for the event. Time becomes
an important indicator to identify relevant documents as txtT becomes most
effective by achieving NDCG@5 of 0.9003. However, a combination of text,
time, geolocations and named entities as leveraged by txtEST achieves a lower
NDCG@5 of 0.4704.
Easy and Hard Query Events. Finally, we identify the easiest and the hardest
query events across both our testbeds. We find that the following query, in the
CW-Queries, gets the highest minimum P@10 across all methods:
Primal Therapy: In 1989, Arthur Janov established the Janov Primal Center in
Venice (later relocated to Santa Monica) with his second wife, France.
For this query even the simple txt method gets a perfect P@10 score of 1.0.
Terms Janov, Primal, and Center retrieve documents that are pages from the
center’s website, and are marked relevant by the assessors. Likewise, we identify
the hardest query as the following one from the NYT-Queries set:
Police aviation in the United Kingdom: In 1921, the British airship R33 was able
to help the police in traffic control around the Epsom and Ascot horse-racing events.
For this query none of the methods were able to identify any relevant documents
thus all getting a P@10 score equal to 0. This is simply because this relatively
old event is not covered in the NYT corpus.
40 A. Mishra and K. Berberich
5 Related Work
In this section, we put our work in context with existing prior research. We
review five lines of prior research related to our work.
First, we look into efforts to link different document collections. As the earli-
est work, Henzinger et al. [14] automatically suggested news article links for an
ongoing TV news broadcast. Later works have looked into linking related text
across multiple archives to improve their exploration [6]. Linking efforts also go
towards enriching social media posts by connecting them to news articles [26].
Recently, Arapakis et al. [1] propose automatic linking system between news
articles describing similar events.
Next, we identify works that use time to improve document retrieval qual-
ity [23]. To leverage time, prior works have proposed methods that are moti-
vated from cognitive psychology [21]. Time has also been considered as a feature
for query profiling and classification [17]. In the realm of document retrieval,
Berberich et al. [5] exploit explicit temporal expressions contained in queries to
improve result quality. As some of the latest work, Peetz et al. [20] detect tem-
poral burstiness of query terms, and Mishra et al. [19] leverage explicit temporal
expressions to estimate temporal query models. Efron et al. [11] present a kernel
density estimation method to temporally match relevant tweets.
There have been many prior initiatives [7,13] to investigate geographical
information retrieval. The GeoCLEF search task examined geographic search
in text corpus [18]. More recent initiatives like the NTCIR-GeoTime task [12]
evaluated adhoc retrieval with geographic and temporal constraints.
We look into prior research works that use entities for information retrieval.
Earlier initiatives like INEX entity ranking track [10] and TREC entity track
[3] focus on retrieving relevant entities for a given topic. More recently, INEX
Linked Data track [4] aimed at evaluating approaches that additionally use text
for entity ranking. As the most recent work, Dalton et al. [9] show significant
improvement for document retrieval.
Divergence-based retrieval models for text have been well-studied in the past.
In their study, Zhai et al. [27,28] compare techniques of combining backgrounds
models to query and documents. To further improve the query model estima-
tion, Shen et al. [24] exploit contextual information like query history and click
through history. Bai et al. [2] combine query models estimated from multiple
contextual factors.
6 Conclusion
We have addressed a novel linking problem with the goal of establishing connec-
tions between excerpts from Wikipedia, coined Wikiexcerpts, and news articles.
For this, we cast the linking problem into an information retrieval task and
present approaches that leverage additional semantics that come with a Wikiex-
cerpt. Comprehensive experiments on two large datasets with independent test
query sets show that our approach that leverages time, geolocations, named
entities, and text is most effective for the linking problem.
Leveraging Semantic Annotations to Link Wikipedia and News Archives 41
References
1. Arapakis, I., et al.: Automatically embedding newsworthy links to articles: From
implementation to evaluation. JASIST 65(1), 129–145 (2014)
2. Bai, J., et al.: Using query contexts in information retrieval. In: SIGIR.(2007)
3. Balog, K., et al.: Overview of the TREC 2010 entity track. In: DTIC.(2010)
4. Bellot, P., et al.: Report on INEX 2013. ACM SIGIR Forum 47(2), 21–32 (2013)
5. Berberich, Klaus, Bedathur, Srikanta, Alonso, Omar, Weikum, Gerhard: A Lan-
guage Modeling Approach for Temporal Information Needs. In: Gurrin, Cathal,
He, Yulan, Kazai, Gabriella, Kruschwitz, Udo, Little, Suzanne, Roelleke, Thomas,
Rüger, Stefan, van Rijsbergen, Keith (eds.) ECIR 2010. LNCS, vol. 5993, pp. 13–
25. Springer, Heidelberg (2010)
6. Bron, Marc, Huurnink, Bouke, de Rijke, Maarten: Linking Archives Using Doc-
ument Enrichment and Term Selection. In: Gradmann, Stefan, Borri, Francesca,
Meghini, Carlo, Schuldt, Heiko (eds.) TPDL 2011. LNCS, vol. 6966, pp. 360–371.
Springer, Heidelberg (2011)
7. Cozza, Vittoria, Messina, Antonio, Montesi, Danilo, Arietta, Luca, Magnani, Mat-
teo: Spatio-Temporal Keyword Queries in Social Networks. In: Catania, Barbara,
Guerrini, Giovanna, Pokorný, Jaroslav (eds.) ADBIS 2013. LNCS, vol. 8133, pp.
70–83. Springer, Heidelberg (2013)
8. Croft, B., et al.: Search Engines: Information Retrieval in Practice. Addison-
Wesley, Reading.(2010)
9. Dalton, J., et al.: Entity query feature expansion using knowledge base links. In:
SIGIR.(2014)
10. Demartini, Gianluca, Iofciu, Tereza, de Vries, Arjen P.: Overview of the INEX 2009
Entity Ranking Track. In: Geva, Shlomo, Kamps, Jaap, Trotman, Andrew (eds.)
INEX 2009. LNCS, vol. 6203, pp. 254–264. Springer, Heidelberg (2010)
11. Efron, M., et al.: Temporal feedback for tweet search with non-parametric density
estimation. In: SIGIR.(2014)
12. Gey, F., et al.: NTCIR-GeoTime overview: Evaluating geographic and temporal
search. In: NTCIR.(2010)
13. Hariharan, R., et al.: Processing spatial-keyword (SK) queries in geographic infor-
mation retrieval (GIR) systems. In: SSDBM.(2007)
14. Henzinger, M.R., et al.: Query-free news search. World Wide Web 8, 101–126
(2005)
15. Hoffart, J., et al.: YAGO2: A spatially and temporally enhanced knowledge base
from Wikipedia. In: IJCAI.(2013)
16. Hoffart, J., et al.: Robust Disambiguation of Named Entities in Text. In:
EMNLP.(2011)
17. Kulkarni, A., et al.: Understanding temporal query dynamics. In: WSDM.(2011)
18. Mandl, Thomas, Gey, Fredric C., Di Nunzio, Giorgio Maria, Ferro, Nicola, Larson,
Ray R., Sanderson, Mark, Santos, Diana, Womser-Hacker, Christa, Xie, Xing: Geo-
CLEF 2007: The CLEF 2007 Cross-Language Geographic Information Retrieval
Track Overview. In: Peters, Carol, Jijkoun, Valentin, Mandl, Thomas, Müller,
Henning, Oard, Douglas W., Peñas, Anselmo, Petras, Vivien, Santos, Diana (eds.)
CLEF 2007. LNCS, vol. 5152, pp. 745–772. Springer, Heidelberg (2008)
19. Mishra, A., et al.: Linking wikipedia events to past news. In: TAIA.(2014)
20. Peetz, M., et al.: Using temporal bursts for query modeling. Inf. retrieval 17(1),
74–108 (2014)
42 A. Mishra and K. Berberich
1 Introduction
User response (e.g., click-through or conversion) prediction plays a critical part
in many web applications including web search, recommender systems, sponsored
search, and display advertising. In online advertising, for instance, the ability of
targeting individual users is the key advantage compared to traditional offline
advertising. All these targeting techniques, essentially, rely on the system func-
tion of predicting whether a specific user will think the potential ad is “relevant”,
i.e., the probability that the user in a certain context will click a given ad [6].
Sponsored search, contextual advertising, and the recently emerged real-time
bidding (RTB) display advertising all heavily rely on the ability of learned mod-
els to predict ad click-through rates (CTR) [32,41]. The applied CTR estimation
models today are mostly linear, ranging from logistic regression [32] and naive
Bayes [14] to FTRL logistic regression [28] and Bayesian probit regression [12],
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 45–57, 2016.
DOI: 10.1007/978-3-319-30671-1 4
46 W. Zhang et al.
all of which are based on a huge number of sparse features with one-hot encod-
ing [1]. Linear models have advantages of easy implementation, efficient learning
but relative low performance because of the failure of learning the non-trivial
patterns to catch the interactions between the assumed (conditionally) inde-
pendent raw features [12]. Non-linear models, on the other hand, are able to
utilise different feature combinations and thus could potentially improve esti-
mation performance. For example, factorisation machines (FMs) [29] map the
user and item binary features into a low dimensional continuous space. And the
feature interaction is automatically explored via vector inner product. Gradient
boosting trees [38] automatically learn feature combinations while growing each
decision/regression tree. However, these models cannot make use of all possible
combinations of different features [20]. In addition, many models require feature
engineering that manually designs what the inputs should be. Another problem
of the mainstream ad CTR estimation models is that most prediction models
have shallow structures and have limited expression to model the underlying
patterns from complex and massive data [15]. As a result, their data modelling
and generalisation ability is still restricted.
Deep learning [25] has become successful in computer vision [22], speech
recognition [13], and natural language processing (NLP) [19,33] during recent
five years. As visual, aural, and textual signals are known to be spatially and/or
temporally correlated, the newly introduced unsupervised training on deep struc-
tures [18] would be able to explore such local dependency and establish a dense
representation of the feature space, making neural network models effective in
learning high-order features directly from the raw feature input. With such
learning ability, deep learning would be a good candidate to estimate online
user response rate such as ad CTR. However, most input features in CTR esti-
mation are of multi-field and are discrete categorical features, e.g., the user
location city (London, Paris), device type (PC, Mobile), ad category (Sports,
Electronics) etc., and their local dependencies (thus the sparsity in the feature
space) are unknown. Therefore, it is of great interest to see how deep learning
improves the CTR estimation via learning feature representation on such large-
scale multi-field discrete categorical features. To our best knowledge, there is
no previous literature of ad CTR estimation using deep learning methods thus
far1 . In addition, training deep neural networks (DNNs) on a large input feature
space requires tuning a huge number of parameters, which is computationally
expensive. For instance, unlike image and audio cases, we have about 1 million
binary input features and 100 hidden units in the first layer; then it requires 100
million links to build the first layer neural network.
In this paper, we take ad CTR estimation as a working example to study deep
learning over a large multi-field categorical feature space by using embedding
methods in both supervised and unsupervised fashions. We introduce two types
of deep learning models, called Factorisation Machine supported Neural Net-
work (FNN) and Sampling-based Neural Network (SNN). Specifically, FNN with
1
Although the leverage of deep learning models on ad CTR estimation has been
claimed in industry (e.g., [42]), there is no detail of the models or implementation.
Deep Learning over Multi-field Categorical Data 47
2 Related Work
Click-through rate, defined as the probability of the ad click from a specific
user on a displayed ad, is essential in online advertising [39]. In order to max-
imise revenue and user satisfaction, online advertising platforms must predict
the expected user behaviour for each displayed ad and maximise the expecta-
tion that users will click. The majority of current models use logistic regression
based on a set of sparse binary features converted from the original categorical
features via one-hot encoding [26,32]. Heavy engineering efforts are needed to
design features such as locations, top unigrams, combination features, etc. [15].
Embedding very large feature vector into low-dimensional vector spaces is
useful for prediction task as it reduces the data and model complexity and
improves both the effectiveness and the efficiency of the training and predic-
tion. Various methods of embedding architectures have been proposed [23,37].
Factorisation machine (FM) [31], originally proposed for collaborative filtering
recommendation, is regarded as one of the most successful embedding models.
FM naturally has the capability of estimating interactions between any two fea-
tures via mapping them into vectors in a low-rank latent space.
Deep Learning [2] is a branch of artificial intelligence research that attempts
to develop the techniques that will allow computers to handle complex tasks such
as recognition and prediction at high performance. Deep neural networks (DNNs)
are able to extract the hidden structures and intrinsic patterns at different lev-
els of abstractions from training data. DNNs have been successfully applied in
computer vision [40], speech recognition [8] and natural language processing
(NLP) [7,19,33]. Furthermore, with the help of unsupervised pre-training, we
can get good feature representation which guides the learning towards basins of
attraction of minima that support better generalisation from the training data
[10]. Usually, these deep models have two stages in learning [18]: the first stage
performs model initialisation via unsupervised learning (i.e., the restricted Boltz-
mann machine or stacked denoising auto-encoders) to make the model catch the
input data distribution; the second stage involves a fine tuning of the initialised
model via supervised learning with back-propagation. The novelty of our deep
learning models lies in the first layer initialisation, where the input raw features
are high dimensional and sparse binary features converted from the raw cate-
gorical features, which makes it hard to train traditional DNNs in large scale.
48 W. Zhang et al.
Compared with the word-embedding techniques used in NLP [19,33], our mod-
els deal with more general multi-field categorical features without any assumed
data structures such as word alignment and letter-n-gram etc.
CTR
Fully Connected
Fully Connected
Fully Connected
l1 = tanh(W 1 z + b1 ), (3)
where W 1 ∈ RM ×J , b1 ∈ RM and z ∈ RJ .
where starti and endi are starting and ending feature indexes of the i-th field,
W i0 ∈ R(K+1)×(endi −starti +1) and x is the input vector as described at beginning.
All weights W i0 are initialised with the bias term wi and vector v i respectively
(e.g., W i0 [0] is initialised by wi , W i0 [1] is initialised by vi1 , W i0 [2] is initialised
by vi2 , etc.). In this way, z vector of the first layer is initialised as shown in Fig. 1
via training a factorisation machine (FM) [31]:
N
N
N
yFM (x) := sigmoid w0 + wi xi + v i , v j xi xj , (6)
i=1 i=1 j=i+1
where each feature i is assigned with a bias weight wi and a K-dimensional vector
v i and the feature interaction is modelled as their vectors’ inner product v i , v j .
In this way, the above neural nets can learn more efficiently from factorisation
machine representation so that the computational complexity problem of the
high-dimensional binary inputs has been naturally bypassed. Different hidden
layers can be regarded as different internal functions capturing different forms
of representations of the data instance. For this reason, this model has more
abilities of catching intrinsic data patterns and leads to better performance.
The idea using FM in the bottom layer is ignited by Convolutional Neural
Networks (CNNs) [11], which exploit spatially local correlation by enforcing
a local connectivity pattern between neurons of adjacent layers. Similarly, the
inputs of hidden layer 1 are connected to the input units of a specific field. Also,
the bottom layer is not fully connected as FM performs a field-wise training
for one-hot sparse encoded input, allowing local sparsity, illustrated as the dash
lines in Fig. 1. FM learns good structural data representation in the latent space,
helpful for any further model to build on. A subtle difference, though, appears
between the product rule of FM and the sum rule of DNN for combination.
However, according to [21], if the observational discriminatory information is
highly ambiguous (which is true in our case for ad click behaviour), the posterior
weights (from DNN) will not deviate dramatically from the prior (FM).
Furthermore, the weights in hidden layers (except the FM layer) are ini-
tialised by layer-wise RBM pre-training [3] using contrastive divergence [17],
50 W. Zhang et al.
where ŷ is the predicted CTR in Eq. (1) and y is the binary click ground-truth
label. Using the chain rule of back propagation, the FNN weights including FM
weights can be efficiently updated. For example, we update FM layer weights
via
∂L(y, ŷ) ∂L(y, ŷ) ∂z i ∂L(y, ŷ)
i
= i
= x[starti : endi ] (8)
∂W 0 ∂z i ∂W 0 ∂z i
∂L(y, ŷ)
W i0 ← W i0 − η · x[starti : endi ]. (9)
∂z i
Due to the fact that the majority entries of x[starti : endi ] are 0, we can accel-
erate fine-tuning by updating weights linking to positive units only.
Field 1 Field 2
(a) SNN Architecture (c) SNN First Layer Pre-Trained with Sampling-based DAE
The structure of the second model SNN is shown in Fig. 2(a). The difference
between SNN and FNN lies in the structure and training method in the bottom
layer. SNN’s bottom layer is fully connected with sigmoid activation function:
z = sigmoid(W 0 x + b0 ). (10)
Deep Learning over Multi-field Categorical Data 51
To initialise the weights of the bottom layer, we tried both restricted Boltz-
mann machine (RBM) [16] and denoising auto-encoder (DAE) [4] in the pre-
training stage. In order to deal with the computational problem of training large
sparse one-hot encoding data, we propose a sampling-based RBM (Fig. 2(b),
denoted as SNN-RBM) and a sampling-based DAE in (Fig. 2(c), denoted as
SNN-DAE) to efficiently calculate the initial weights of the bottom layer.
Instead of modelling the whole feature set for each training instance set,
for each feature field, e.g., city, there is only one positive value feature for
each training instance, e.g., city=London, we sample m negative units, e.g.,
city=Paris when m = 1, randomly with value 0. Black units in Fig. 2(b) and (c)
are unsampled and thus ignored when pre-training the data instance. With the
sampled units, we can train an RBM via contrastive divergence [17] and a DAE
via SGD with unsupervised approaches to largely reduce the data dimension
with high recovery performance. The real-value dense vector is used as the input
of the further layers in SNN.
In this way, computational complexity can be dramatically reduced and, in
turn, initial weights can be calculated quickly and back-propagation is then
performed to fine-tune SNN model.
3.3 Regularisation
On the other hand, dropout [35] is a technique which becomes a popular and
effective regularisation technique for deep learning during the recent years. We
also implement this regularisation and compare them in our experiment.
4 Experiment
Data. We evaluate our models based on iPinYou dataset [27], a public real-
world display ad dataset with each ad display information and corresponding
user click feedback. The data logs are organised by different advertisers and in a
row-per-record format. There are 19.50 M data instances with 14.79 K positive
label (click) in total. The features for each data instance are all categorical. Fea-
ture examples in the ad log data are user agent, partially masked IP, region,
city, ad exchange, domain, URL, ad slot ID, ad slot visibility, ad slot
size, ad slot format, creative ID, user tags, etc. After one-hot encoding,
the number of binary features is 937.67 K in the whole dataset. We feed each
compared model with these binary-feature data instances and the user click (1)
52 W. Zhang et al.
LR: Logistic Regression [32] is a linear model with simple implementation and
fast training speed, which is widely used in online advertising estimation.
FM: Factorisation Machine [31] is a non-linear model able to estimate feature
interactions even in problems with huge sparsity.
FNN: Factorisation-machine supported Neural Network is our proposed model
as described in Sect. 3.1.
SNN: Sampling-based Neural Network is also our proposed model with sampling-
based RBM and DAE pre-training methods for the first layer in Sect. 3.2,
denoted as SNN-RBM and SNN-DAE respectively.
Our experiment code2 of both FNN and SNN is implemented with Theano3 .
Metric. To measure the CTR estimation performance of each model, we employ
the area under ROC curve (AUC)4 . The AUC [12] metric is a widely used mea-
sure for evaluating the CTR performance.
Table 1 shows the results that compare LR, FM, FNN and SNN with RBM and
DAE on 5 different advertisers and the whole dataset. We observe that FM
is not significantly better than LR, which means 2-order combination features
might not be good enough to catch the underlying data patterns. The AUC
performance of the proposed FNN and SNN is better than the performance of
2
The source code with demo data: https://github.com/wnzhang/deep-ctr.
3
Theano: http://deeplearning.net/software/theano/.
4
Besides AUC, root mean square error (RMSE) is also tested. However, posi-
tive/negative examples are largly unbalanced in ad click scenario, and the empirically
best regression model usually provides the predicted CTR close to 0, which results
in very small RMSE values and thus the improvement is not well captured.
Deep Learning over Multi-field Categorical Data 53
Neural network training algorithms are very sensitive to the overfitting prob-
lem since deep networks have multiple non-linear layers, which makes them very
expressive models that can learn very complicated functions. For DNN models,
we compared L2 regularisation (Eq. (11)) and dropout [35] for preventing com-
plex co-adaptations on the training data. The dropout rate implemented in this
experiment refers to the probability of each unit being active.
Figure 4(a) shows the compared AUC performance of SNN-RBM regularised
by L2 norm and dropout. It is obvious that dropout outperforms L2 in all com-
pared settings. The reason why dropout is more effective is that when feeding
each training case, each hidden unit is stochastically excluded from the network
with a probability of dropout rate, i.e., each training case can be regarded as a
new model and these models are averaged as a special case of bagging [5], which
effectively improves the generalisation ability of DNN models.
As a summary of Sects. 4.4 and 4.5, for both FNN and SNN, there are two
important parameters which should be tuned to make the model more effective:
(i) the parameters of layer size decide the architecture of the neural network and
(ii) the parameter of dropout rate changes generalisation ability on all datasets
compared to neural networks just with L2 regularisation.
5
Some advanced Bayesian methods for hyperparameter tuning [34] are not considered
in this paper and may be investigated in the future work.
Deep Learning over Multi-field Categorical Data 55
(a) Dropout vs. L2 (b) FNN on 2997 dataset (c) SNN on 2997 dataset
Figure 4(b) and (c) show how the AUC performance changes with the increas-
ing of dropout in both FNN and SNN. We can find that there is an upward trend
of performance in both models at the beginning and then drop sharply with con-
tinuous decreasing of dropout rate. The distinction between two models is the
different sensitivities of the dropout. From Fig. 4(c), we can see the model SNN is
sensitive to the dropout rate. This might be caused by the connectivities in the
bottom layer. The bottom layer of the SNN is fully connected with the input vec-
tor while the bottom layer for FNN is partially connected and thus the FNN is
more robust when some hidden units are dropped out. Furthermore, the sigmoid
activation function tend to more effective than the linear activation function
in terms of dropout. Therefore, the dropout rates at the best performance of
FNN and SNN are quite different. For FNN the optimal dropout rate is around
0.8 while for SNN is about 0.99.
5 Conclusion
References
1. Beck, J.E., Park Woolf, B.: High-level student modeling with machine learning.
In: Gauthier, G., VanLehn, K., Frasson, C. (eds.) ITS 2000. LNCS, vol. 1839, pp.
584–593. Springer, Heidelberg (2000)
2. Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1),
1–127 (2009)
3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise
training of deep networks. In: NIPS, vol. 19, p. 153 (2007)
4. Bengio, Y., Yao, L., Alain, G., Vincent, P.: Generalized denoising auto-encoders
as generative models. In: NIPS, pp. 899–907 (2013)
5. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
6. Broder, A.Z.: Computational advertising. In: SODA, vol. 8, pp. 992–992 (2008)
7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)
8. Deng, L., Abdel-Hamid, O., Yu, D.: A deep convolutional neural network using
heterogeneous pooling for trading acoustic invariance with phonetic confusion. In:
ICASSP, pp. 6669–6673. IEEE (2013)
9. Elizondo, D., Fiesler, E.: A survey of partially connected neural networks. Int. J.
Neural Syst. 8(05n06), 535–558 (1997)
10. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why
does unsupervised pre-training help deep learning? JMLR 11, 625–660 (2010)
11. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4),
193–202 (1980)
12. Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale bayesian click-
through rate prediction for sponsored search advertising in microsoft’s bing search
engine. In: ICML, pp. 13–20 (2010)
13. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent
neural networks. In: ICASSP, pp. 6645–6649. IEEE (2013)
14. Hand, D.J., Yu, K.: Idiot’s bayes not so stupid after all? Int. Statist. Rev. 69(3),
385–398 (2001)
15. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R.,
Bowers, S., et al.: Practical lessons from predicting clicks on ads at facebook. In:
ADKDD, pp. 1–9. ACM (2014)
16. Hinton, G.: A practical guide to training restricted boltzmann machines. Momen-
tum 9(1), 926 (2010)
17. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
Neural comput. 14(8), 1771–1800 (2002)
18. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with
neural networks. Science 313(5786), 504–507 (2006)
19. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep struc-
tured semantic models for web search using clickthrough data. In: CIKM, pp.
2333–2338 (2013)
20. Juan, Y.C., Zhuang, Y., Chin, W.S.: 3 idiots approach for display advertising
challenge. In: Internet and Network Economics, pp. 254–265. Springer, Heidelberg
(2011)
21. Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. PAMI 20(3),
226–239 (1998)
Deep Learning over Multi-field Categorical Data 57
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS (2012)
23. Kurashima, T., Iwata, T., Takaya, N., Sawada, H.: Probabilistic latent network
visualization: inferring and embedding diffusion networks. In: KDD, pp. 1236–1245.
ACM (2014)
24. Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for
training deep neural networks. JMLR 10, 1–40 (2009)
25. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553) (2015)
26. Lee, K., Orten, B., Dasdan, A., Li, W.: Estimating conversion rate in display
advertising from past performance data. In: KDD, pp. 768–776. ACM (2012)
27. Liao, H., Peng, L., Liu, Z., Shen, X.: ipinyou global rtb bidding algorithm compe-
tition dataset. In: ADKDD, pp. 1–6. ACM (2014)
28. McMahan, H.B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L.,
Phillips, T., Davydov, E., Golovin, D., et al.: Ad click prediction: a view from the
trenches. In: KDD, pp. 1222–1230. ACM (2013)
29. Oentaryo, R.J., Lim, E.P., Low, D.J.W., Lo, D., Finegold, M.: Predicting response
in mobile advertising with hierarchical importance-aware factorization machine.
In: WSDM (2014)
30. Prechelt, L.: Automatic early stopping using cross validation: quantifying the cri-
teria. Neural Netw. 11(4), 761–767 (1998)
31. Rendle, S.: Factorization machines with libfm. ACM TIST 3(3), 57 (2012)
32. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-
through rate for new ads. In: WWW, pp. 521–530. ACM (2007)
33. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with
convolutional-pooling structure for information retrieval. In: CIKM (2014)
34. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine
learning algorithms. In: NIPS, pp. 2951–2959 (2012)
35. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to prevent neural networks from overfitting. JMLR 15(1),
1929–1958 (2014)
36. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization
and momentum in deep learning. In: ICML, pp. 1139–1147 (2013)
37. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale infor-
mation network embedding. In: WWW, pp. 1067–1077 (2015)
38. Trofimov, I., Kornetova, A., Topinskiy, V.: Using boosted trees for click-through
rate prediction for sponsored search. In: WINE, p. 2. ACM (2012)
39. Wang, X., Li, W., Cui, Y., Zhang, R., Mao, J.: Click-through rate estimation for
rare events in online advertising. In: Online Multimedia Advertising: Techniques
and Technologies, pp. 1–12 (2010)
40. Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid
and high level feature learning. In: ICCV, pp. 2018–2025. IEEE (2011)
41. Zhang, W., Yuan, S., Wang, J.: Optimal real-time bidding for display advertising.
In: KDD, pp. 1077–1086. ACM (2014)
42. Zou, Y., Jin, X., Li, Y., Guo, Z., Wang, E., Xiao, B.: Mariana: Tencent deep
learning platform and its applications. VLDB 7(13), 1772–1777 (2014)
Supervised Local Contexts Aggregation
for Effective Session Search
1 Introduction
For the majority of existing studies on web search, queries are mainly optimized
and evaluated independently [19,21]. However, this single query scenario is not
the whole story in web search. When users have some complex search tasks
(e.g. literature survey, product comparison), a single-query-search is probably
insufficient [22]. Often, users iteratively interact with the search engine multiple
times, until the task is accomplished. We call such a process a session search [9].
Formally, a session search S is defined as S = {I t = {q t , Dt , C t }|t = 1 :
T }, where S is composed of a sequence of interactions I t between user and
search engine. In each interaction I t , three steps are involved: (1) the user issues
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 58–71, 2016.
DOI: 10.1007/978-3-319-30671-1 5
Supervised Local Contexts Aggregation for Effective Session Search 59
a query q t ; (2) the search engine retrieves the top-ranked documents Dt ; (3) and
the user clicks a subset of documents C t that he/she feels attractive. Then the
user gradually refines the next query, iterates the above three steps, until the
search is done. See Table 1 for an example.
Table 1. Example Session Search
Problem Analysis. The goal of session search is to provide search results for the
current query (i.e. q T , which has not been searched, namely DT = ∅, C T = ∅)
so that the entire search task can be satisfied, based on preceding interactions
within the same session, which we call session context. How to effectively utilize
such session contexts is the key to session search, which can be challenging in
real world applications. Consider the following two examples:
Example 1 (Recency Variation). TREC11 session-60 has ten interactions
before the current query. It is reasonable to believe the context from I 1 is less
important than the context from I 10 .
Example 2 (Satisfaction Variation). TREC12 session-85 has two interac-
tions before the current query: I 1 has one SAT click, while I 2 has none. As SAT
click is a strong indicator of user satisfaction, context from I 1 should be more
important than that from I 2 .
of IR. These algorithms usually treat session contexts as side information besides
current query, which are combined in unsupervised ad hoc search algorithms
such as language modeling [6,9,10,20,24]. The unsupervised nature, however,
may not be optimal from the perspective of performance. The second category
are mainly algorithms developed in industry (e.g. Bing, Yandex) that focused on
personalized web search. Their works mainly utilized learning to rank algorithms
with abundant session context features [2,16,26,27].
However, one major disadvantage shared by the above algorithms is, they
either ignored session context variations by mixing all interactions indistinguish-
ably [10,16,24,26,27], or only considered limited possibilities (primarily recency
variation [2,6,20]) that are not powerful enough to maximize search accuracy.
These observations motivate us to propose a better algorithm for session search.
Our Proposal. Seeing the limitations of existing algorithms, in this paper we
propose a principled framework, named Supervised Local Context Aggregation
(SLCA), which can model sophisticated session context variations. Global session
context is decomposed into local contexts between consecutive interactions. We
further propose multiple weighting hypotheses, each of which corresponds to
a specific variation pattern in session context. Then the global session context
can be modeled as the weighted combination of local contexts. A supervised
ranking aggregation approach is adopted for effective and efficient optimization.
Extensive experiments on TREC11/12 Session Tracks show that our proposed
SLCA achieves the state-of-the-art results.
The rest of the paper is organized as follows: Sect. 2 reviews related litera-
ture; Sect. 3 elaborates all the details of SLCA; Sect. 4 gives all the experimental
results; finally in Sect. 5, we conclude the paper.
2 Related Works
This section presents the three main components in SLCA: local context model-
ing, local context weighting hypotheses and supervised ranking aggregation. At
end, we also present the detailed search procedures using SLCA.
The entire session context, which has multiple interactions, might be complicated
to model as a whole. Instead, we break it down into smaller units called local
62 Z. Zhang et al.
contexts, which makes the problem easier. Specifically, a local context X t is the
context contained within two consecutive interactions I t−1 and I t . For sake of
usage, such local context can be described as a feature vector jointly extracted
from I t−1 and I t , which we call Local Context Feature (LCF). As I t−1 always
happens ahead of I t , LCF actually is used as the representation for I t .
The following four kinds of LCF are explored in this paper.
(1) Added Information (Add). Query terms that appear in q t but not in q t−1
represent information that the user pursues in q t . We let those terms form
a pseudo query QAdd .
(2) Deleted Information (Del). Query terms that appear in q t−1 but not in q t
represent information that the user no longer desires in q t . We let such terms
form a pseudo query QDel .
(3) Implicit Information 1 (Imp1). If there is SAT clicks in I t−1 , it means the
user is satisfied by those clicked documents. We assume those SAT docu-
ments will produce some impact in the formation of q t . Therefore, we utilize
Lavrenko’s Relevance Model (RM) [15] to extract m (empirically set as 10)
terms from SAT documents, hoping that these extra terms will capture
user’s implicit information need. Those terms form a pseudo query QImp1 .
(4) Implicit Information 2 (Imp2). In case the SAT documents are too few to
provide accurate Imp1 modeling, we further select top n Wikipedia docu-
ments from Dt−1 in last interaction I t−1 . Again we extract m terms using
RM (m, n empirically set as 10). This strategy is inspired by the success-
ful application of RM + Wikipedia in recent TREC competitions [1]. We
let those terms form a pseudo query QImp2 . Notice we use the Wikipedia
contained in the Clueweb-09 corpus for TREC11/12 session track, there-
fore we are not using external resources beyond TREC, thus assuring fair
comparison with previous works.
For document d, we calculate its BM25 scores c w.r.t the above four pseudo
queries. We then calculate the single query ranking score h(q t , d) between q t
and d (details will be given in Sect. 4). Now the overall local context feature for
document d in I t will be as follows:
f (d|I t ) = [cAdd (d) cDel (d) cImp1 (d) cImp2 (d) h(q t , d)] (1)
Recall in Sect. 1 we analyzed that the key challenge is how to model the ses-
sion context variations, i.e. the importance of each interactions. Based on local
context modeling, this problem can be naturally formulated as follows:
Supervised Local Contexts Aggregation for Effective Session Search 63
T T
X = w t X t → Fd = wt f (d|I t ) (2)
t=1 t=1
T
(OP 1) w, θ = arg min L({FdSi }, {YdSi }; θ), FdSi = wSt f (d|ISt ) (3)
w,θ
S t=1
where w is the set of local context weights wSt for all interactions in all training
sessions, and {} indicates the feature/label set of documents for each S.
If w is known, OP 1 reduces to a standard learning to rank problem, where
for each document d we have a single feature vector FdS given by Eq. 2 and a
relevance label YdS . Any learning to rank algorithms can be utilized, such as
RankSVM [11] and GBDT [5]. Unfortunately, this is not the case. Even if we
can derive some w (probably suboptimal due to large parameter number) for
training sessions by optimizing Eq. 3, there is no way to predict w for unseen
test sessions. To solve this problem, below we propose an effective alternative.
Fig. 1. Local context weights by applying the six WHs to the example session in
Table 1.
obtain very good results in our preliminary experiments) to get two shrinking
WHs: w̃S0.6 , w̃S0.8 . The closer one interaction I t is to q T , the more important
its local context is, thus reflecting the recency variation of each local context.
(3) Current Query WH for Presence Variation. Similarly, we denote w̃Cur =
t
[w̃cur |t = 1 : T ] as the WH. For this WH, only the interaction at present (i.e.
T t
current query) is considered, i.e. w̃cur = 1 and w̃cur = 0 for t < T .
(4) Average WH for Equality Variation. We denote w̃Ave = [w̃ave t
|t = 1 :
t 1
T ], where w̃ave = T . That is, all local contexts are treated equally, which is
equivalent to previous works [16,27] that mixed all interactions indistinguishably.
(5) Novelty Variation. We denote w̃N ov = [w̃nov t
|t = 1 : T ] as the weights.
We believe, if a preceding query q is dissimilar with current query q T , then q t
t
provides some novelty to the whole session, and should be emphasized. Other-
wise, if q t is very similar with q T , or even identical, then its local context weight
should be decreased. Specifically, we define w̃nov t
= 1 − Jaccard(q T , q t ), t < T
T T t |terms∈q T ∩terms∈q t |
and w̃nov = 1, where Jaccard(q , q ) = |terms∈qT ∪terms∈qt | .
(6) SAT WH for Satisfaction Variation. If one interaction has many SAT
clicks, it means this interaction satisfies the user well, and its local context
should be more important. Denote w̃SAT = [w̃sat t
|t = 1 : T ] as importance, then
we have w̃sat = |C |, where C are SAT clicks in interaction I t . Hence wsat
t t t t
equals
the SAT number in I . For current query, we let w̃sat = maxt=1:T −1 w̃sat
t T t
.
T
For each of the above six WHs, we normalize w̃ to have t=1 w̃t = 1.
Each WH represents one specific pattern of session context variation. With
various α, the combination of six WHs produces the Multiple Variations Mod-
eling, which can be much more expressive than Global Context Modeling and
Single Variation Modeling (see Table 2). Extensions of designing other WHs is
quite straightforward, while in this paper we will stick to the above six as they
have already shown excellent performance. As an example, in Fig. 1 we show the
six WHs for the example session in Table 1.
Supervised Local Contexts Aggregation for Effective Session Search 65
With the above local context weighting hypotheses, now OP1 can be transformed
t
into the following problem, with w̃k,S being the kth WH for session S:
T K
(OP 2): α, θ = arg minα,θ L({FdSi }, {YdSi }; θ), FdSi = ( t
αk w̃k,S )f (d|ISt )
S t=1 k=1
(5)
As mentioned earlier, compared with OP 1, OP 2 has the two advantages:
(1) besides θ, OP 2 has only K extra parameters (i.e. α) to be optimized, which
is far less than OP 1; (2) OP 2 is much easier to generalize to unseen testing
sessions than OP 1, as α is shared among training and testing sessions, and WH
w̃ can adapt each individual sessions.
Nonetheless, OP 2 can still be difficult to solve, as the loss function L in Eq. 5
will contain α and θ tightly coupled together, which makes the existing solvers
for learning to rank problems inapplicable. To solve this problem efficiently and
effectively, we further relax OP 2 into the following formulation:
K
T
(OP 3): α, θ1∼K = arg min αk L({F̃dSi ,k }, {YdSi }; θk ), F̃dSi ,k = t
w̃k,S f (d|ISt )
S k=1 t=1
(6)
1
http://www.cs.cornell.edu/people/tj/svm light/svm rank.html.
Supervised Local Contexts Aggregation for Effective Session Search 67
4 Experiments
4.1 General Settings
Except (3) (4) are taken directly from original papers, we implement the
other methods, which rerank top-500 documents returned by BasicRet (see
Algorithm 2).
() indicates main evaluation metric. Bold numbers are the highest in each column.
‡, , †, means SLCA is significantly better than SinVar-Exp0.6/0.8 and CxtRank-
V1/V2, where t-test is used with p < 0.05. - represents unreported performance in
the papers of TRECBest and MDP.
as CurrentQuery WH indeed utilizes some context while SQRank does not. Ave,
Nov and SAT IPs also perform reasonably, and are much better than Current-
Query WH. Overall, these results verify the proposed SLCA algorithm.
5 Conclusion
The key to effective session search relies on how to effectively utilize preceding
contexts of interactions to improve search for current query. Previous related
research either ignored session context variations, or formulated single, simple
modeling that is not powerful enough. In this paper, we proposed Supervised
Local Context Aggregation (SLCA) algorithm, which learns a supervised ranking
model based on aggregating multiple weighting hypotheses of local contexts. On
TREC11/12 session track our algorithm has achieved the state-of-the-art results.
References
1. Bendersky, M., Fisher, D., Croft, W.B.: Umass at trec 2010 web track: term depen-
dence, spam filtering and quality bias. In: TREC (2010)
2. Bennett, P.N., White, R.W., Chu, W., Dumais, S.T., Bailey, P., Borisyuk, F., Cui,
X.: Modeling the impact of short- and long-term behavior on search personaliza-
tion. In: SIGIR (2012)
3. Cao, H., Jiang, D., Pei, J., Chen, E., Li, H.: Towards context-aware search by
learning a very large variable length hidden markov model from search logs. In:
WWW (2009)
4. Collins-Thompson, K., Bennett, P.N., White, R.W., de la Chica, S., Sontag, D.:
Personalizing web search results by reading level. In: CIKM (2011)
5. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. In:
Annals of Statistics (2001)
6. Guan, D., Zhang, S., Yang, H.: Utilizing query change for session search. In: SIGIR
(2013)
7. Guo, Q., White, R.W., Dumais, S.T., Wang, J., Anderson, B.: Predicting query
performance using query, result, and user interaction features. In: RIAO (2010)
8. Jiang, D., Leung, K.W.T., Ng, W.: Context-aware search personalization with
concept preference. In: CIKM (2011)
9. Jiang, J., He, D., Allan, J.: Searching, browsing, and clicking in a search session:
changes in user behavior by task and over time. In: SIGIR (2014)
10. Jiang, J., He, D., Han, S.: On duplicate results in a search session. In: TREC (2012)
11. Joachims, T.: Training linear svms in linear time. In: KDD (2006)
12. Kanoulas, E., Carterette, B., Hall, M., Clough, P., Sanderson, M.: Overview of the
trec 2011 session track. In: TREC (2011)
13. Kanoulas, E., Carterette, B., Hall, M., Clough, P., Sanderson, M.: Overview of the
trec 2012 session track. In: TREC (2012)
14. Kharitonov, E., Macdonald, C., Serdyukov, P., Ounis, I.: Intent models for contex-
tualising and diversifying query suggestions. In: CIKM (2013)
15. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR (2001)
16. Li, X., Guo, C., Chu, W., Wang, Y.Y.: Deep learning powered in-session contextual
ranking using clickthrough data. In: NIPS Workshop on Personalization: Methods
and Applications (2014)
Supervised Local Contexts Aggregation for Effective Session Search 71
17. Liu, C., Gwizdka, J., Liu, J.: Helping identify when users find useful documents:
examination of query reformulation intervals. In: IIiX (2010)
18. Liu, T., Zhang, C., Gao, Y., Xiao, W., Huang, H.: Bupt wildcat at trec 2011 session
track. In: TREC (2011)
19. Liu, T.Y.: Learning to Rank for Information Retrieval. Springer, Heidelberg (2011)
20. Luo, J., Zhang, S., Dong, X., Yang, H.: Designing states, actions, and rewards for
using POMDP in session search. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N.
(eds.) ECIR 2015. LNCS, vol. 9022, pp. 526–537. Springer, Heidelberg (2015)
21. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information
Retrieval. Springer, Heidelberg (2011)
22. Raman, K., Bennett, P.N., Collins-Thompson, K.: Toward whole-session relevance:
exploring intrinsic diversity in web search. In: SIGIR (2013)
23. Rendle, S., Gantner, Z., Freudenthaler, C., Schmidt-Thieme, L.: Fast context-aware
recommendations with factorization machines. In: SIGIR (2011)
24. Shen, X., Tan, B., Zhai, C.: Context-sensitive information retrieval using implicit
feedback. In: SIGIR (2005)
25. Shokouhi, M., White, R.W., Bennett, P., Radlinski, F.: Fighting search engine
amnesia: reranking repeated results. In: SIGIR (2013)
26. Ustinovskiy, Y., Serdyukov, P.: Personalization of web-search using short-term
browsing context. In: CIKM (2013)
27. Xiang, B., Jiang, D., Pei, J., Sun, X., Chen, E., Li, H.: Context-aware ranking in
web search. In: SIGIR (2010)
An Empirical Study of Skip-Gram Features
and Regularization for Learning
on Sentiment Analysis
1 Introduction
The performance of sentiment analysis systems depends heavily on the underly-
ing text representation quality. Unlike in traditional topical classification, simply
applying standard machine learning algorithms such as Naive Bayes or SVM to
unigram (“bag-of-words”) features no longer provides satisfactory accuracy [19].
In sentiment analysis, unigrams cannot capture all relevant and informative fea-
tures, resulting in information loss and suboptimal classification performance.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 72–87, 2016.
DOI: 10.1007/978-3-319-30671-1 6
An Empirical Study of Skip-Gram Features and Regularization 73
1
On the IMDB dataset, skip-grams perform worse than word vectors on the predefined
test set, but better on randomly sampled test sets, as discussed in Sect. 3.
74 C. Li et al.
the power of skip-grams: since manual assessment is time consuming, only a very
small number (300 in their experiments) of skip-gram candidates are generated
and presented to the human assessors, and an even smaller number of features
is kept. For a small number of selected skip-grams to work well, it is essential
for them to be orthogonal so that different aspects of the data can be covered
and explained. But skip-grams are judged independently of each other in both
the automatic generating procedure and the human assessment procedure; as
a result, individually informative features, when put together, could be highly
correlated and redundant for prediction purposes. Also, the skip-grams selected
are not used in conjunction with n-grams to train classifiers. In our proposed
method, the feature selection is done by regularized learning algorithms, which is
a much cheaper solution compared with manual selection. This reduction in cost
makes it possible to generate and evaluate a large number of skip-gram can-
didates. The feature selection algorithm considers all features simultaneously,
making the selected feature set less redundant.
Word Vectors. Another related line of research performs sentiment analysis
based on word vectors (or paragraph vectors) [14,15,18]. Typical word vectors
have only hundreds of dimensions, and thus represent documents more concisely
than skip-grams do. One common way of building word vectors is to train them
on top of skip-grams. After this training step, skip-grams are discarded and only
word vectors are used to train the final classifier. Classifiers trained this way are
smaller compared with those trained on skip-grams. One should note, however,
that training classifiers on a low-dimensional dense word vector representation is
not necessarily faster than training classifiers on a high-dimensional sparse skip-
gram representation, for two reasons: first, low-dimensional dense features often
work best with non-linear classifiers while high-dimensional sparse features often
work best with linear classifiers, and linear classifiers are much faster to train.
Second, sparsity in the feature matrix can be explored in the latter case to further
speed up training. Although the idea of building word vector representations on
top of skip-grams is very promising, current methods have some limitations.
Documents with word vector representations are compressed or decoded in a
highly complicated way, and the learned models based on word vectors are much
more difficult to interpret than those based directly on skip-grams. For example,
to understand what Amazon customers care about in baby products, it is hard
to infer any latent meaning from a word vector feature. On the other hand, it is
very easy to interpret a high-weight skip-gram feature such as “no smell”, which
includes potential variants like “no bad smell”, “no medicine-like smell” and “no
annoying smell”. Another limitation is that, while word vectors are trained on
skip-grams, they do not necessarily capture all the information in skip-grams. In
our method, the classifiers are trained directly on skip-grams, and thus can fully
utilize the information provided by skip-grams. We exploit the sparsity in the
feature matrix to speed up training, and feature selection is employed to shrink
the size of the classifier. Experiments show that our method generally achieves
both better performance and better interpretability.
76 C. Li et al.
In the last preprocessing step, we determine the matched documents for each of
the skip-gram candidates and their matching scores. There are several slightly
different ways of computing the matching score, but the basic idea is the same:
a phrase that matches the given n-gram tightly contributes more to the score
than a phrase that matches the n-gram loosely, and if two documents have the
same skip-gram frequency, the shorter document will receive a higher score.
An indexing service is needed for storage and matching, i.e., a service such
as Lemur [3], Terrier [4], Lucene [5] or ElasticSearch [2]. Any such platform can
be used for this purpose. For this study, we adopt the “Span Near Query”
scoring function implemented in the open source search engine ElasticSearch,
which matches the above criteria. For a given (n-gram g, slop s, document d)
triple,
skipGramFreq(g, s, d)
score(g, s, d) =
length(d)
where
s
phraseFreq(g, k, d)
skipGramFreq(g, s, d) =
length(g) + 1 + k
k=0
Regularized SVM minimizes thesum of hinge loss and a penalty term [7].
Specifically, for L2-regularized SVM, the objective is
N 1
minw (max(0, 1 − yi wT xi ))2 + λ ||w||22 ,
i=1 2
and for L1-regularized SVM, the objective is
N
minw (max(0, 1 − yi wT xi ))2 + λ||w||1 ,
i=1
where λ controls the strength of the regularization2 . We use the LibLinear pack-
age [7] for regularized SVM. Using both L1 and L2 terms to regularize SVM has
been proposed [22], but is not commonly seen in practice, possibly due to the
difficult learning procedure; hence we do not consider it in this study.
Regularized LR [12] minimizes the sum of logistic loss and some penalty
term. Specifically, for L2-regularized LR, the objective is
1 N T 1
minw − yi wT xi + log(1 + ew xi ) + λ ||w||22 ,
N i=1 2
for L1-regularized LR, the objective is
1 N T
minw − yi wT xi + log(1 + ew xi ) + λ||w||1 ,
N i=1
3 Experiments
Datasets and setup. To examine the effectiveness of skip-grams, we extract
skip-gram features from three large sentiment analysis datasets and train several
2
In the LibLinear package that we use, a different notation is used; there C = 1/λ.
3
Our code is publicly available at https://github.com/cheng-li/pyramid.
78 C. Li et al.
machine learning algorithms on these extracted features. The datasets used are
IMDB, Amazon Baby, and Amazon Phone. IMDB [15] contains 50,000 labeled
movie reviews. Reviews with ratings from 1 to 4 are considered negative; and 7 to
10 are considered positive. Reviews with neutral ratings are ignored. The overall
label distribution is well balanced (25,000 positive and 25,000 negative). IMDB
comes with a predefined train/test split, which we adopt in our experiments.
There are also another 50,000 unlabeled reviews available for unsupervised train-
ing or semi-supervised training, which we do not use. Amazon Baby (containing
Amazon baby product reviews) and Amazon Phone (containing cell phone and
accessory reviews) are both subsets of a larger Amazon review collection [17].
Here we use them for binary sentiment analysis the same manner as in IMDB
dataset. By convention, reviews with rating 1–2 are considered negative and
4–5 are positive. The neutral ones are ignored. Amazon Baby contains 136,461
positive and 32,950 negative reviews. Amazon Phone contains 47,970 positive
and 22,241 negative reviews. Amazon Baby and Amazon Phone do not have a
predefined train/test partitioning. We perform stratified sampling to choose a
random 20 % of the data as the test set. All results reported below on these two
datasets are averaged across five runs.
For each dataset, we extract skip-gram features with max size n varying
from 1 (unigram) to 5 (5-gram) and max slop varying from 0 (no extra words
can be added) to 2 (maximal 2 words can be added). For example, when max
n=2 and max slop=1, we will consider unigrams, bigrams, and skip-bigrams
with slop=1. As a result, for each dataset, 13 different feature sets are created.
The combinations (max n=1, max slop=1) and (max n=1, max slop=2) are
essentially the same as (max n=1, max slop=0), and thus not considered. For
each feature set, we run five learning algorithms on it and measure the accuracies
on the test set. The algorithms considered are L1 SVM, L2 SVM, L1 LR, L2
LR and L1+L2 LR. In order to make the feature set the only varying factor,
we use fixed hyper parameters for all algorithms across all feature sets. The
hyper parameters are chosen by cross-validation on training sets with unigram
features. For L2 SVM, C = 1/λ = 0.0625; for L1 SVM, C = 1/λ = 0.25; for
LR, λ = 0.00001; and for L1+L2 LR, α = 0.1. Performing all experiments took
about five days using a cluster with six 2.80 GHz Xeon CPUs.
Figure 1 shows how increasing max n and max slop of the skip-grams affects the
logistic regression performance on Amazon Baby. In each sub-figure, the bottom
line is the performance with standard n-gram (max slop=0) features. Along
each bottom line, moving from unigrams (max n=1) to bigrams (max n=2)
gives substantial improvement. Bigrams such as “not recommend” are effective
at capturing short distance negations, which cannot be captured by unigrams.
Moving beyond bigrams (max n=2) to higher order n-grams, we can see some
further improvement, but not as big as before. This observation is consistent with
the common practice in sentiment analysis, where trigrams are not commonly
An Empirical Study of Skip-Gram Features and Regularization 79
Table 1. The performance of our method on Amazon Baby. For each algorithm on each
feature set, the table shows its test accuracy and the number and fraction of features
selected. The accuracies which are significantly better (at 0.05 level under t-test) than
those by a corresponding slop 0 baseline are marked with *.
used compared with bigrams, and n-grams beyond trigrams are rarely used.
When we fix the max n and increase the max slop, we see the performance
further improves. For n ≥ 4, increasing max slop often brings more improvement
than increasing max n. Similar observations can be made for SVM and for the
other two datasets.
80 C. Li et al.
Table 2. The performance of our method on IMDB. For each algorithm on each feature
set, the table shows its test accuracy and the number and fraction of features selected.
Tables 1, 2, and 3 show more detailed results on these datasets. For each fixed
n, we use a paired t-test (0.05 level) to check whether increasing max slop from
0 to 1 or 2 leads to significant improvement. On Amazon Baby, all improvements
due to the increase of max slop are significant. On Amazon Phone, about half
are significant. The significance test is not done on the IMDB dataset since only
the predefined test set is used.
Tables 1, 2, and 3 also show how many features are selected by each learning
algorithm. L2 regularized algorithms do best in terms of accuracy but at the cost
of using all features. If that is acceptable in certain use cases, then L2 regulariza-
tion is recommended. On the other hand, L1 regularization can greatly reduce
the number of features used to below 1 %, sacrificing test accuracy by 1–2 %; if
this drop in performance is acceptable, then L1 regularization is recommended
for the extremely compact feature sets produced. Finally L1+L2 regularization
is a good middle choice for reducing the number of features to about 5–20 %
while at the same time maintaining test accuracy on par with L2 regularization.
An Empirical Study of Skip-Gram Features and Regularization 81
Table 3. The performance of our method on Amazon Phone. For each algorithm on
each feature set, the table shows its test accuracy and the number and fraction of
features selected. The accuracies which are significantly better (at 0.05 level under
t-test) than those by a corresponding slop 0 baseline are marked with *.
kernel.
82 C. Li et al.
4 Analysis of Skip-Grams
When designing a feature set, the primary concern is often generalizability, since
good generalizability implies good prediction performance. In sentiment analysis
6
The training parameters are the same as in IMDB.
An Empirical Study of Skip-Gram Features and Regularization 83
data, people often express the same idea in many slightly different ways, which
makes the prediction task harder as the algorithm has to learn many expressions
with small variations. Skip-grams alleviate this problem by letting the algorithm
focus on the important terms in the phrase and tolerate small changes in unim-
portant terms. Thus skip-grams perform feature grouping on top of n-grams
without requiring any external domain knowledge. This not only improves gen-
eralizability but also interpretability. Several such skip-gram examples are shown
in Table 5. They are selected by an L1+L2 regularized logistic regression model
with high weights. For each skip-gram, we show its count in the entire collection
and several n-gram instances that it matches. For each matched n-gram, the
count in the collection is also listed in the table. We can see, for example, the
skip-gram “only problem” (slop=1) could match bigram “only problem” and
trigrams “only minor problem” and “only tiny problem”. Although the bigram
“only problem” is frequent enough in the collection, the trigram “only tiny prob-
lem” only occurs in four out of 169,411 reviews. It is hard for the algorithm to
treat the trigram “only tiny problem” confidently as a positive sentiment indi-
cator. After grouping all such n-gram variants into the same skip-gram, the
algorithm can assign a large positive weight to the skip-gram as a whole, thus
also handling the rare cases properly. This also provides more concise rules and
facilitates user interpretation.
This is the worst kind of noise because the gap matches negation words and dif-
ferent instances of the skip-gram have opposite sentiments. Detecting and mod-
eling the scope of negations is very challenging in general [24]. We do not deal
with negations at skip-gram generation time; at learning time, we rely on feature
selection to eliminate such noisy skip-grams. In this particular example, the noise
is relatively low as the mismatched n-grams “I have never had to return” and “I
do not have to return” are very rare in the document collection. Therefore logistic
regression still assigns a large weight to this skip-gram. Some other skip-grams are
more likely to include negations and are thus more noisy. For example, the skip-
gram “I recommend” (slop=2) can match many occurrences of both “I highly
recommend” and “I do not recommend”. Our feature selection mechanism infers
that this skip-gram does more harm than good and assigns a small weight to it. In
practice, we find the denoising effect of feature selection to be satisfactory. Most of
the classification mistakes are not caused by skip-gram mismatch but due to the
inability to identify the subjects of the sentiment expressions: many reviews com-
pare several movies/products and thus the algorithm gets confused as to which
subject the sentiment expression should apply. Resolving this issue requires other
NLP techniques and is beyond the scope of this study.
An Empirical Study of Skip-Gram Features and Regularization 85
Fig. 2. L1+L2 LR selected features for Amazon Phone feature contribution analy-
sis, max n =3. LEFT: feature count distribution in dataset; MIDDLE: feature count
distribution of selected features; RIGHT: feature LR-weighted-distribution of selected
features
5 Conclusion
We demonstrate that skip-grams can be used to improve large scale sentiment
analysis performance in a model-efficient and scalable manner via regularized
logistic regression. We show that although n-grams beyond trigrams are often
very specific and sparse, many similar n-grams can be grouped into a single
skip-gram which benefits both model-efficiency and classification performance.
86 C. Li et al.
References
1. http://nlp.stanford.edu/software/
2. https://lucene.apache.org/
3. http://www.lemurproject.org/
4. http://terrier.org/
5. http://www.elasticsearch.org/
6. Dahl, G.E., Adams, R.P., Larochelle, H.: Training restricted Boltzmann machines
on word observations. arXiv preprint (2012). arxiv:1202.5695
7. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: Liblinear: a library
for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
8. Fernández, J., Gutiérrez, Y., Gómez, J.M., Martınez-Barco, P.: Gplsi: supervised
sentiment analysis in twitter using skipgrams. In: SemEval 2014, pp. 294–299
(2014)
9. Friedman, J., Hastie, T., Tibshirani, R.: glmnet: Lasso and elastic-net regularized
generalized linear models. R package version, 1 (2009)
10. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear
models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
11. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram
modelling. In: LREC-2006, pp. 1–4 (2006)
12. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, vol.
2. Springer, New York (2009)
13. König, A.C., Brill, E.: Reducing the human overhead in text categorization. In:
KDD, pp. 598–603. ACM (2006)
14. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents.
arXiv preprint (2014). arxiv:1405.4053
15. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning
word vectors for sentiment analysis. In: ACL 2011, pp. 142–150. Association for
Computational Linguistics (2011)
16. Massung, S., Zhai, C., Hockenmaier, J.: Structural parse tree features for text
representation. In: ICSC, pp. 9–16. IEEE (2013)
17. McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rat-
ing dimensions with review text. In: Proceedings of the 7th ACM Conference on
Recommender Systems, pp. 165–172. ACM (2013)
18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint (2013). arxiv:1301.3781
An Empirical Study of Skip-Gram Features and Regularization 87
19. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using
machine learning techniques. In: Proceedings of the ACL-02 Conference on Empir-
ical Methods in Natural Language Processing, vol. 10, pp. 79–86. Association for
Computational Linguistics (2002)
20. Paskov, H.S., West, R., Mitchell, J.C., Hastie, T.: Compressive feature learning.
In: NIPS, pp. 2931–2939 (2013)
21. Wager, S., Wang, S., Liang, P.S.: Dropout training as adaptive regularization. In:
NIPS, pp. 351–359 (2013)
22. Wang, L., Zhu, J., Zou, H.: The doubly regularized support vector machine.
Statistica Sinica 16(2), 589 (2006)
23. Wang, S.I., Manning, C.D.: Baselines and bigrams: simple, good sentiment and
topic classification. In: Proceedings of the ACL, pp. 90–94 (2012)
24. Wiegand, M., Balahur, A., Roth, B., Klakow, D., Montoyo, A.: A survey on the
role of negation in sentiment analysis. In: Proceedings of the Workshop on Nega-
tion and Speculation in Natural Language Processing, pp. 60–68. Association for
Computational Linguistics (2010)
Multi-task Representation Learning
for Demographic Prediction
Pengfei Wang, Jiafeng Guo(B) , Yanyan Lan, Jun Xu, and Xueqi Cheng
1 Introduction
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 88–99, 2016.
DOI: 10.1007/978-3-319-30671-1 7
Multi-task Representation Learning for Demographic Prediction 89
This is particularly true for traditional offline retailers1 , who collect users’ demo-
graphic information mostly in a manual way (e.g. requiring costumers to provide
demographic information when registering some shopping cards).
In this paper, we try to inference users’ demographic attributes based on
users’ purchase history. Although some recent studies suggest that demographic
attributes are predictable from different behavioral data, such as linguistics
writing [5], web browsing [16], electronic communications [8,12] and social
media [14,23], to our best knowledge, seldom practice has been conducted on
purchase behaviors in the retail scenario.
The previous work about demographic prediction usually predicts demo-
graphic attributes separately based on manually defined features [3,17,19,22,23].
For example, Zhong et al. [23] predicted six demographic attributes (i.e., gender,
age, education background, sexual orientation, marital status, blood type and
zodiac sign) separately by merging spatial, temporal and location knowledge
features into a continuous space. Obviously, manually defined features usually
require professional knowledge and often suffer from under specification. Mean-
while, by taking each attribute as independent prediction task, some attributes
may difficult to predict due to the insufficient data in training. Some recent
studies proposed to take the relations between multiple attributes into account
[3,22]. For example, Dong et al. [3] employed a Double Dependent-Variable Fac-
tor Graph model to predict gender and age simultaneously. Zhong et al. [22]
attempted to capture pairwise relations between different tasks when predicting
six demographic attributes from mobile data. However, these methods still rely
on various human-defined features which are often costly to obtain.
To tackle the above problem, in this paper we propose a Multi-task Repre-
sentation Learning(MTRL) model is used to predict users’ gender, martial sta-
tus, and education background based on users’ purchase history. MTRL learns
shared semantic representations across multiple tasks, which benefits from a
more general representation for prediction. Specifically, we characterize each user
by his/her purchase history using the bag-of-item representations. We then map
all users’ representations into semantic space learned by a multi-task approach.
Thus we can obtain a more general shared representation to guide the prediction
task separately. Compared with previous methods, the major contributions of
our work are as follows:
– We make the first attempt to investigate the prediction power of users’ pur-
chase data for demographic prediction in the retail scenario.
– We apply a multi-task learning framework (MTRL) for our problem, which
can learn a shared robust representation across tasks and alleviate the data
sparse problem.
– We conduct extensive experiments on a real-world retail dataset to demon-
strate the effectiveness of the proposed MTRL model as compared with dif-
ferent baseline methods.
1
In our work, we mainly focus on traditional retailers in offline business rather than
those in online e-commerce, where no additional behavioral data rather than trans-
actions are available for analysis. Hereafter we will use retail/retailer for simplicity
when there is no ambiguity.
90 P. Wang et al.
The rest of the paper is organized as follows. After a summary of related work
in Sect. 2, we describe the problem formalization of demographic prediction in
the retail scenario in Sect. 3. In Sect. 4 we present our proposed model in detail.
Section 5 concludes this paper and gives the future work.
2 Related Work
In this section we briefly review three research areas related to our work: demo-
graphic attribute prediction, multi-task approach, and representation learning.
3 Our Approach
In this section, we first give the motivation of our work, then we introduce the
formalization of demographic prediction problem in the retail scenario. After
that, we describe the proposed MTRL in detail. Finally, we present the learning
procedure of MTRL.
3.1 Motivation
Obviously, a fundamental problem in demographic prediction based on users’
behavioral data is how to represent users. Many existing work investigated differ-
ent types of human defined features [3,17,22]. However, defining features man-
ually costs time since expertise knowledge is required and one has to do the
same process repeatedly. Moreover, human defined features may often suffer from
under specification since it is difficult to identify those hidden complicated factors
for prediction tasks. Recent work mainly employs unsupervised feature learning
methods [8,12,23], like Singular Vector Decomposition (SVD), to automatically
extract low-dimension features from the raw data. However, the features learned
92 P. Wang et al.
in an unsupervised manner may not be suitable for the prediction tasks. There-
fore, concerning the weakness of extracting features humanly, in this paper we
proposed to automatically learn representations of users for demographic pre-
diction through a supervised method. Furthermore, some attributes are difficult
to obtain(for example, only 8.96 % of users offer their education background in
the BeiRen dataset we used). Thus the sparseness of data aggravates the diffi-
culty of modeling the task separately [2,12,23]. In addition, modeling the tasks
independently may ignore the correlations among these attributes.
Motivated by all these issues, inspired by [11], in this paper we propose a
multi-task approach to learn a general representation to predict users’ demo-
graphic attributes.
Fig. 1. The structure of Multi-task Representation Learning (MTRL) model. The lower
two layers are shared across all the tasks, while top layers are task-specific. The input
is represented as a bag of items. Then a non-linear projection W is used to gener-
ate a shared representation. Finally, for each task, additional non-linear projection V
generates task-specific representations.
by his/her purchase history, i.e., a set of items. In MTRL, we take the bag-of-
item representation as the user input x(i) , then the shared layer is fully connected
to the input layer with weights Matrix W = [wh,s ]:
Y(i),s = f ( wh,s · x(i),h )
h
where t denotes the different tasks(gender, marital status, and education back-
ground), and Y(i),j
t
is the value of j-th node corresponding to the specific repre-
sentation layer of task t.
After these, we use a softmax activation function to calculate the value of
k-th node in the output layer:
exp( j htj,k · Y(i),j
t
)
Y(i),k =
t t
exp( j hj,k · Y(i),j
t )
where H t = [htj,k ] is the matrix that maps task specific representation to the
output layer for task t, the k-th node in the output layer corresponds the value
of the k-th label in task t.
94 P. Wang et al.
where dt(i),k
is the real value of k-th node for user i under the task t, for example,
for task t, if user i choose the k-th label, then dt(i),k = 1, else dt(i),k equals 0.
λ is the regularization constant and Θ are the model parameters (i.e. Θ =
{W, Vt , Ht }).
4 Experiments
In this section, we conduct empirical experiments to demonstrate the effective-
ness of our proposed MTRL model on demographic attribute prediction in the
retail scenario. We first introduce the experimental settings. Then we compare
our MTRL model with the baseline methods to demonstrate the effectiveness of
predicting users’ demographic attributes in the retail scenario.
Dataset. We conduct our empirical experiments over a real world large scale
retail dataset, named BeiRen dataset. This dataset comes from a large retailer2
in China, which records its supermarket purchase histories during the period
from 2012 to 2013. For research purpose, the dataset has been anonymized with
all the users and items denoted by randomly assigned IDs for the privacy issue.
We first conduct some pre-process on the BeiRen dataset. We randomly col-
lected 100000 users. We extract all the transactions related to these users to
form their purchase histories, then we remove all the items bought by less than
10 times and the users with no labels. After pre-processing, the dataset con-
tains 64097 distinct items and 80540 distinct users with at least on demographic
attribute. In average, each user has bought about 225.5 distinct items.
– BoI-Single: Each user is represented by the items he/she has purchased with
the Bag-of-Item representation and a logistic model3 is learned to predict
each demographic attribute separately.
– SVD-single: A singular value decomposition (SVD)4 is first conducted over
the user-item matrix to obtain low dimensional representations of users. Then
a logistic model is learned over the low dimensional representation to predict
each demographic attribute separately. This method has been widely used in
the demographic attribute prediction task [8,16,23].
– SL: The Single Representation Learning model, which is a special case of
MTRL when there is only one single task to learn. SL model has the same
neural structure comparing with MTRL, just without considering the rela-
tionships among tasks.
2
http://www.brjt.cn/.
3
http://www.csie.ntu.edu.tw/∼cjlin/liblinear/.
4
http://tedlab.mit.edu/∼dr/SVDLIBC/.
96 P. Wang et al.
1 t t
wRecall = I(Y(i) = y(i) )
|U | i
2 × wPrecision × wRecall
wF1 =
wPrecision + wRecall
where Y(i)t
represents the calculated label of user i under task t, I(·) is an indi-
cator function. Note here we use the weighted evaluation metrics because every
class in task gender, marital status and education is as important as each other.
As we can see, the weighted recall is the prediction accuracy in the user view.
The performance of different methods is shown in Fig. 2. We have the follow-
ing observations:
wRecall
wF1
0.55
0.5
0.5 0.55
0.45
0.45
0.4 0.5
(a) gender
0.85 0.8
BoI−single BoI−single BoI−single
0.85 SVD−single 0.8 SVD−single SVD−single
0.75
SL SL SL
0.8 MTRL 0.75 MTRL MTRL
0.7
wPrecision
wRecall
0.75 0.7
wF1
0.65
0.7 0.65
0.6
0.65 0.6
0.5 0.5
0.7 0.65
0.6 BoI−single BoI−single BoI−single
SVD−single SVD−single SVD−single
0.55 0.65
SL SL 0.6 SL
MTRL MTRL MTRL
0.5 0.6
wPrecision
0.55
wRecall
wF1
0.45 0.55
0.5
0.4 0.5
0.45
0.35 0.45
– (2) Both the deep models SL and MTRL perform better than SVD-single.
The result demonstrates that the deep model can learn a better representa-
tion comparing with the shallow one(here we regard SVD-single as a shallow
model).
– (3) MTRL can improve the performance of each demographic prediction task
significantly, especially to the education prediction task with limited data.
– (4) By using a multi-task approach to learn a shared representation layer
across tasks, we can obtain a better performance than SL, which proves
that the correlations among demographic attributes are helpful. MTRL can
achieve the best performance in terms of all the evaluation measures, for
example, when compared with the second best method(SL), the improve-
ment of weighted F1-Measure on gender, marital status, and education back-
ground is 2.6 %, 1.6 %, and 6.4 % respectively. By conducting the significant
test, we find that the improvement of MTRL over the SL method is significant
(p-value < 0.01) in terms of all the evaluation metrics.
5 Conclusion
In this paper, we try to predict users’ demographic attributes given users’ pur-
chase behaviors. We propose a robust and practical representation learning algo-
rithm MTRL based on multi-task objectives. Our MTRL can learn a shared
representation across tasks, thus the sparseness problem can be avoided, espe-
cially for the task with limited data. Experiments on real-world purchase dataset
demonstrate that our model can outperform the state-of-the-art baselines con-
sistently under different evaluation metrics.
Although the MTRL model is proposed in this retail scenario, it is in fact
a general model which can be applied on other multi-task multi-class problems.
In the future, we would like to extend the usage of our MTRL model to model
more demographic attributes to verify its effectiveness. Moreover, in this paper,
we represent each user by simple bag of items as the raw input. It would be
interesting to further explore the natural transaction structures in users’ pur-
chase data for a better demographic prediction.
Acknowledgment. This research work was funded by 863 Program of China award
number under Grant 2014AA015204, 973 Program of China award number under Grant
2014CB340401, 2012CB316303, National Natural Science Foundation of China award
numbers under Grant 61472401, 61433014, 61203298, 61425016, and Key Research
Program of the Chinese Academy of Sciences under Grant NO.KGZD-EW-T03-2,
and the Youth Innovation Promotion Association CAS under Grant no.20144310,
and the Technology Innovation and Transformation Program of Shandong (Grant
No.2014CGZH1103).
References
1. Bi, B., Shokouhi, M., Kosinski, M., Graepel, T.: Inferring the demographics of
search users: Social data meets search queries. In: Proceedings of the 22nd Inter-
national Conference on World Wide Web, pp. 131–140 (2013)
2. Cutler, J., Culotta, A., Ravi, N.K.: Predicting the demographics of twitter users
from website traffic data. In: ICWSM, in press. AAAI Press, Menlo Park, California
(2015)
3. Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, N.V.: Inferring user demographics
and social strategies in mobile social networks. In: Proceedings of the 20th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 15–24 (2014)
4. Torres, S.D., Weber, I.: What and how children search on the web. In: Proceed-
ings of the 20th ACM International Conference on Information and Knowledge
Management, CIKM 2011, pp. 393–402. ACM, New York (2011)
5. Eckert, P.: Gender and sociolinguistic variation. In: Readings in Language and
Gender (1997)
Multi-task Representation Learning for Demographic Prediction 99
6. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of the
Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 109–117. ACM (2004)
7. Hinton, G., Graves, A., Mohamed, A.: Speech recognition with deep recurrent
neural networks. In: 2013 IEEE International Conference on Speech and Signal
Processing (ICASSP) (2013)
8. Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on
user’s browsing behavior. In: Proceedings of the 16th International Conference on
World Wide Web, pp. 151–160. ACM (2007)
9. Putler, D.S., Kalyanam, K.: Incorporating demographic variables in brand choice
models: An indivisible alternatives framework. Mark. Sci. 16(2), 166–181 (1997)
10. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Systems
(2012)
11. Liu, X., Gao, J., He, X., Deng, L., Duh, K., Wang, Y.-Y.: Representation
learningusing multi-task deep neural networks for semantic classification and infor-
mationretrieval. In: Proceedings of the 2015 Conference of the North American
Chapterof the Association for Computational Linguistics: Human Language Tech-
nologies (2015)
12. Stillwell, D., Kosinski, M., Graepel, T.: Private traits and attributes are predictable
from digital records of human behavior. In: Proceedings of the National Academy
of Sciences (2013)
13. Micchelli, C., Pontil, M.: Kernels for multi-task learning. In: NIPS (2005)
14. Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P.: You are who you know:
Inferring user profiles in online social networks. In: WSDM, pp. 251–260 (2010)
15. Mnih, A., Hinton, G.: Three new graphical models for statistical language mod-
elling. In: Proceedings of the 24th International Conference on Machine Learning,
pp. 641–648 (2007)
16. Murray, D., Durrell, K.: Inferring demographic attributes of anonymous internet
users. In: Masand, B., Spiliopoulou, M. (eds.) WebKDD 1999. LNCS (LNAI), vol.
1836, pp. 7–20. Springer, Heidelberg (2000)
17. Otterbacher, J.: Inferring gender of movie reviewers: exploiting writing style, con-
tent and metadata. In: Proceedings of the 19th ACM International Conference
on Information and Knowledge Management, CIKM 2010, pp. 369–378. ACM,
New York (2010)
18. Currim, I.S., Andrews, R.L.: Identifying segments with identical choice behaviors
across product categories: an intercategory logit mixture model. Int. J. Res. Mark.
19(1), 65–79 (2002)
19. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age, gender on
blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing
Weblogs, pp. 199–205. AAAI (2006)
20. Sedhain, S., Sanner, S., Braziunas, D., Xie, L., Christensen, J.: Social collabo-
rative filtering for cold-start recommendations. In: Proceedings of the 8th ACM
Conference on Recommender Systems, pp. 345–348 (2014)
21. Sun, S., Ji, Y.: Multitask multiclass support vector machines. In: Data Mining
Workshops (ICDMW) (2011)
22. Zhong, E., Tan, B., Mo, K., Yang, Q.: User demographics prediction based on
mobile data. Pervasive Mob. Comput. 9(6), 823–837 (2013)
23. Zhong, Y., Yuan, N.J., Zhong, W., Zhang, F., Xie, X.: You are where you
go: Inferring demographic attributes from location check-ins. In: WSDM 2015,
pp. 295–304. ACM, New York (2015)
Large-Scale Kernel-Based Language Learning
Through the Ensemble Nyström Methods
1 Introduction
Kernel methods [24] have been employed in many Machine Learning algorithms
[5,25] achieving state-of-the-art performances in many tasks. One drawback of
expressive but complex kernel functions, such as Sequence [2] or Tree kernels [4],
is the time and space complexity of the learning and classification phases, that
may prevent their adoption when large data volumes are involved. While impor-
tant steps have been made forward in defining linear algorithms [10,16,22,23],
the adoption of complex kernels is still limited. Some approaches have been
defined to scale with kernel base methods, such as [11,13,26,28], but still spe-
cific to kernel formulations and learning algorithms.
A viable solution to scalability issues is the Nyström methodology [29] that
maps original data in low-dimensional spaces and can be applied to the implicit
space determined by the kernel function. These linear representations thus enable
the application of scalable and performant learning methods, by capitalizing the
existing large literature on linear methods. The idea is to use kernels to decouple
representation of complex problems from the learning, and making use of the
Nyström dimensionality reduction method to derive a linear mapping effectively.
At the best of our knowledge, it is the first time this perspective is pursued in the
area of language learning acting on discrete linguistic structures whose kernels
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 100–112, 2016.
DOI: 10.1007/978-3-319-30671-1 8
Large-Scale Kernel-Based Language Learning Through the Ensemble 101
have been largely discussed [6]. In [30] a different solution, namely Distributed
Tree Kernel, has been proposed to approximate standard tree kernels [4] by defin-
ing an explicit function mapping trees to vectors. However, DTKs are designed to
approximate specific tree kernel functions, while the proposed Nyström method
can be applied to any kernel function.
In a nutshell, the Nyström method allows mapping a linguistic instance into
a low-dimensional dense vector with up to l dimensions. Here, the representation
of an instance o is obtained by selecting a set of l training instances, so-called
landmarks, and the cost of projecting o is essentially O(lk), where k is the cost
of a single kernel computation over linguistic objects such as o. This cost has
been theoretically bounded [9] and the linguistic quality of the resulting space
depends on the number of selected landmarks characterizing that space. The
overall approach is highly applicable, without a bias toward input data, adopted
kernels or learning algorithms. Moreover, the overall computational cost can be
easily distributed across several machines, by adopting the Ensemble Nyström
Method, presented in [17] as a possible learning scheme. In this variant, several
representations of an example are created by selecting p subsets of landmarks.
An approximation of the target kernel space is obtained by a linear combination
of different spaces, acquired separately. A crucial factor influencing the scalabil-
ity of our method is the cost of creating linear representations for complex (i.e.
tree) structures. When no caching scheme is adopted, linear mappings should be
invoked several times. Among the algorithms that bound the number of times
a single instance is re-used during training, we investigated the Dual Coordi-
nate Descent algorithm [15]: it is a batch learning algorithm whose achievable
accuracy is made inversely dependent on the number of iterations over a train-
ing dataset. Online schemes are also very appealing, as they avoid to keep an
entire dataset in memory; we also investigated the Soft Confidence-Weighted
Learning [27], an online learning algorithm that shows state-of-the-art accuracy
against most online algorithms. An experimental investigation on the impact of
our learning methodology has been carried out: we adopted different robust and
scalable algorithms over two kernel-based language learning tasks, i.e. Question
Classification (QC) and Argument Boundary Detection (ABD) in Semantic Role
Labeling. The compact linear approximations produced by our method achieve
results comparable with their full kernel-based counterparts, by requiring a negli-
gible fraction of the kernel computations w.r.t. standard methods. Moreover, we
trained a kernel-based ABD classifier over about 1.4 millions of examples, a size
that was hardly tractable before. In the rest of the paper, the adopted method-
ology is discussed in Sect. 2, the large-scale learning algorithms are presented in
Sect. 3, while the empirical evaluation is discussed in Sect. 4.
The Nyström method [29] allows reducing the computational cost of kernel-
based learning algorithms by providing an approximation of the Gram Matrix
underlying the used kernel function.
102 D. Croce and R. Basili
G ≈ G̃ = CW † C (2)
G ≈ G̃ = CU S − 2 S − 2 U C = (CU S − 2 )(CU S − 2 ) = X̃ X̃
1 1 1 1
(3)
x̃i = ci U S − 2
1
(4)
for future work. Several policies have been defined to determine the best selection
of landmarks to reduce the Gram Matrix approximation error. In this work
the uniform sampling without replacement is adopted, as suggested by [18],
where this policy has been theoretically and empirically shown to achieve results
comparable with other (and more complex) selection policies.
The Ensemble Nyström Method. In order to minimize the bias introduced
by the policy of landmark selection on the approximation quality, we apply a
redundant approach, called Ensemble Nyström Method presented in [17]: the
main idea is to treat each approximation generated by the Nyström method
through a sample of l columns as an “expert” and to combine p ≥ 1 such experts
to derive an improved hypothesis, typically more accurate than any of the origi-
nal experts. The Ensemble Nyström Method presented in [17] selects a collection
of p samples, each sample containing l columns of W . The ensemble method
combines the samples to construct an approximation in the form of
p
†
Gens
m,p = λ(i) C (i) W (i) C (i) (5)
i=1
p
where λ(i) reflect the confidence of each expert, with i=1 λ
(i)
= 1. Typi-
cally, the ensemble Nyström method seeks to find out the weights by mini-
mizing ||G − Gens ||2 . A simple but effective strategy is to set the weights as
λ(1) = · · · = λ(t) = p1 as shown in [17]. More details about the upper bound on
the norm-2 error of the Nyström approximation G − G̃2 /G2 are reported
in [9,17]. In practice, each expert is developed through a projection function
that takes an instance o and its vector φ(o) = x and through Eq. 4 maps it into
a l-dimensional vector x̃(i) according to the i-th independent sample, i.e. the
choice of landmarks L(i) : as a result we have p such vectors x̃(i) as i = 1, ..., p.
A unified representation x̃ for the source instance o is thus derived through the
simple concatenation of the different x̃(i) : these are exactly p so that the overall
dimensionality of x̃ is pl.
Complexity. The runtime of the Nyström method is O(l3 +nl2 ) as it depends on
the SVD evaluation on W (i.e. O(l3 )) and on the projection of the entire dataset
by C (i.e. O(nl )). The complexity of the Ensemble
2
through the multiplication
method is O p(l3 + nl2 ) . This analysis supposes that the kernel function has
a cost comparable to the other operations. For several classes of kernels, such
as Tree or Sequence Kernels [4], the above cost can be assumed negligible with
respect to the cost of building vectors ci . Under this assumption, the computa-
tion cost is O(kl) with k the cost of a single kernel computation. Regarding the
Ensemble method, it is worth noting that the construction of each projection can
be distributed. The space complexity to derive W † is O(l2 ) while the projection
of a dataset of size d is O(ld). In the Ensemble setting, the space complexity is
O(pl2 ) while the projection of a dataset of size d is O(pld).
104 D. Croce and R. Basili
1 d
minimize w2 + C max{0, 1 − yi w x̃i } (6)
w∈Rl 2 i=1
Here, Q is an d × d matrix whose entries are given by Qij = yi yj x̃ i x̃j , and
11is the vector of all ones. The minimizer w∗ of Eq. 6 and d the minimizer α∗ of
Eq. 7 are related by the primal/dual connection: w∗ = i=1 αi∗ yi x̃i . The dual
problem in Eq. 7 is a Quadratic Program (QP) with box constraints, and the
i-th coordinate αi corresponds to the i-th instance (x̃i , yi ).
According to [15], the following coordinate descent scheme can be used to
minimize Eq. 7:
– Initialize α1 = (0, . . . , 0)
– At iteration t select coordinate it
– Update αt to αt+1 via
αit+1
t
= argmin D(αt + (αit − αitt )eit )
0≤αit ≤C
Here, ei denotes the i-th standard basis vector. Since D(α) is a QP, the above
problem can be solved exactly:
∇i D(αt )
αit+1
t
= min max{0, αitt − t }, C . (9)
Qit Qit
Here, ∇i D(α) denotes the it -th coordinate of the gradient. The above updates
d
are also closely related to implicit updates. If we maintain wt := i αit yi xi ,
then the gradient ∇it D(α) can be computed efficiently using
∇it D(α) = e
it (Qα − 1) = w yit x̃it − 1
t
(10)
and kept related to αt+1 by computing wt+1 = wt + (αit+1 − αit )yi x̃i . In each
iteration, the entire dataset is used to optimize Eq. 6 and a practical choice is
to randomly access examples. In [15], the proposed method is shown reaching
an -accurate solution in O log(1/) iterations, so we can bound the number of
iterations in order to fix a-priori also the computation cost of the training time.
Passive Aggressive. The Passive Aggressive (PA) learning algorithm [5] is one
of the most popular online approaches. When an example is misclassified, the
model is updated with the hypothesis most similar to the current one, among
the set of classification hypotheses that correctly classify the example.
More formally, let (x̃t , yt ) be the t-th example where x̃t ∈ Rl is a feature
vector in a l-dimensional space and yt ∈ ±1 is the corresponding label. Let
wt ∈ Rl be the current classification hypothesis. As for the DCD, the PA classi-
fication function is linear, i.e. f (x̃) = w x̃. The learning procedure starts setting
w1 = (0, . . . , 0), and after receiving x̃t , the new classification function wt+1 is
the one that minimizes the following objective function1 :
1
Q(w) = w − w t 2 + C · l(w; (x̃t , yt ))2 (11)
2
where the first term w − wt is a measure of how much the new hypothesis
differs from the old one, while the second term l(w, (x̃t , yt )) is a proper loss
function2 assigning a penalty cost to an incorrect classification. C is the aggres-
siveness parameter that balances the two competing terms in Eq. 11. Minimizing
Q(w) corresponds to solving a constrained optimization problem, whose closed
form solution is the following:
H(wt ; (x̃t , yt ))
wt+1 = wt + αt x̃t , αt = yt · (12)
x̃t 2 + 2C 1
where Σ1 is set initially to the identity matrix and η ∈ (0.5, 1] is the probability
required for the updated distribution on the current instance.
Unfortunately, the CW method may adopt a too aggressive updating strategy
where the distribution changes too much in order to satisfy constraints imposed
by an individual input example. Although this speeds up the learning process,
it could force wrong model updates (i.e. undesirable changes in the parameters
of the distribution) caused by mislabeled instances. This makes the original CW
algorithm to perform poorly in many noisy real-world applications. To overcome
the above limitation, the Soft-Confidence extension of the standard CW learning
has been proposed [27] with a more flexible handling of non-separable cases. After
the introduction of the following loss function:
lφ N (µ, Σ); (x̃t , yt ) = max 0, φ x̃
t Σ x̃t − yt µ · x̃t ,
where φ = Φ−1 (η) is the inverse cumulative function of the normal distribution,
the optimization problem of the original CW can be re-written as follows [27]:
2
(µt+1 , Σt+1 ) = arg min DKL N (µ, Σ)N (µt , Σt ) + Clφ N (µ, Σ); (x̃t , yt )
µ,Σ
4 Experimental Evaluations
In the following experimental evaluations, we applied the proposed Ensemble
Nyström methodology to two language learning tasks, i.e. Question Classification
and Argument Boundary Detection (ABD) in Semantic Role Labeling. All the
kernel functions and learning algorithms used in these experiments have been
implemented and released in the KeLP framework3 [12].
Question Classification. Question Classification (QC) is usually applied in
Question Answering systems to map the question into one of k classes of answers,
thus constraining the search. In these experiments, we used the UIUC dataset
[19]. It is composed by a training set of 5, 452 and a test set of 500 questions4 ,
organized in 6 classes (like ENTITY or HUMAN). It has been already shown the con-
tribution of (structured) kernel-based learning within batch algorithms for this
task, as in [31]. In these experiments the Smoothed Partial Tree Kernel (SPTK)
is applied as it obtains state-of-the-art results over this task by directly acting
over tree structures derived from the syntactic analysis of the questions [6]. The
SPTK measures the similarity between two trees proportionally to the number
of shared syntactic substructures, whose lexical nodes contribute according to
a Distributional Lexical Semantic Similarity metrics between word vectors. In
particular, lexical vectors are obtained through the distributional analysis of the
UkWaC corpus, comprising 2 billions words, as discussed in [6]. While the learn-
ing algorithms discussed in Sect. 3 allow to acquire binary classifiers, the QC is a
multi-classification task and a One-vs-All scheme is adopted to combine binary
outcomes [21].
We acquired the linear approximation of the trees composing the dataset from
100 up to 500 dimensions. Landmarks have been selected by applying a random
selection without replacement, as suggested in [18]; as the selection is random,
the evaluations reported here are the mean results over ten different selection of
landmarks. Moreover we applied several ensembles by using p = 1, 2, 3 experts.
These linear approximations are used within the PA-II, SCW-II and DCD imple-
mentation of the SVM. A different numbers of iterations have been adopted for
each algorithm. All the parameters of the kernel functions and the algorithms
have been estimated over a development set.
In Table 1, results in terms of Accuracy, i.e. the percentage of correctly labeled
examples over the test set, are reported: rows reflect the choice of p and l used in
the kernel-based Nyström approximation of the tree structures. Columns reflect
the learning algorithms and the number of iterations applied in the tests. The
last row reports results obtained by standard kernel-based algorithms. This cor-
responds to a sort of upper bound of the quality achievable by the reference
kernel function. In particular, the kernel-based C-SVM formulation [3] and the
Kernel-based PA-II [5] are adopted. The SCW is not applied, as the kernel
counterpart does not exist. The kernel-based C-SVM achieves the best result
(93.8 %), comparable with the PA-II when five iterations are applied (93.4 %).
3
http://sag.art.uniroma2.it/demo-software/kelp/.
4
http://cogcomp.cs.illinois.edu/Data/QA/QC/.
108 D. Croce and R. Basili
Linear counterparts are better when a higher number of dimensions and experts
are used in the approximation. The PA-II learning algorithm seems weaker with
respect to the SCW-II, while the SVM formulation achieves the best results.
However, the number of iterations required from the DCD is higher with respect
to the SCW-II. In fact the former requires 30 iteration (up to 92.4 %) that is
approximated by the SCW-II with only 2 iterations (92.0 %). At a lower number
of iterations the DCD perform worse than SCW-II. Results reported here are
evaluated on the same test set used in [11,13], where best result is 91.1 % and
91.4 % respectively.
These results are remarkable as our method requires much less kernel compu-
tations, as shown in Table 2. We measured the total number of kernel operations
required by all the above kernel-based settings5 , as reported in the last row
of Table 2. The percentage of saved computation of the SCW-II is impressive:
kernel-based methods, which achieve a comparable accuracy, require a consider-
able higher computational cost: the adoption of the Nyström linearization allows
avoiding from 80 % to more than 95 % of the kernel computations.
Automatic Boundary Detection in Semantic Role Labeling. Seman-
tic Role Labeling is a natural language processing task that can be defined
over frame-based semantic interpretation of sentences. Frames are linguis-
tic predicates providing a semantic description of real world situations.
5
C-SVM [3] proposes a caching policy, here ignored for comparative purposes. Large-
scale applications may impose prohibitive requirements on the required space.
Large-Scale Kernel-Based Language Learning Through the Ensemble 109
Table 2. Saving of kernel operations obtained by the SCW-II compared with the
kernel-based learning algorithms
according to the 90/10 proportion. This size makes the straightforward applica-
tion of a traditional kernel-based method unfeasible. We preserved the applica-
tion of the Smoothed Partial Tree Kernel and investigated the same dimensions
and sampling applied into the previous experimental settings. Given the size
of the dataset, we adopted (only) an online learning scheme, by applying the
SCW-II learning algorithm that achieved the best result in the previous QC
tasks (see Table 1). For this binary task, we reported results in Table 3 through
the standard F1 metrics. This is the first time a kernel-based method has been
used to this dataset with the entire train set used for training. In order to have
a comparison, we refer to [14] where the Budgeted Passive Aggressive learning
algorithm and the Distributed Tree Kernel have been applied to a subset of
up to 100,000 examples. In [14] authors approximate the Syntactic Tree Kernel
(STK) proposed in [4], so we approximated also this kernel. Table 3 shows in
the first two columns the results where the STK kernel is approximated, while
the SPTK is used in the last columns. The SPTK is more robust than the STK
and, compared with the size of the training material, we are able to outperform
the solution proposed in [14], that achieved 0.645 with an approximation derived
applying the Distributed Tree Kernel proposed in [30]. However, the Nyström
ensemble approximation derived through the STK achieves 0.585 of F1, that is
higher to all the results proposed in [14] at a similar dimensionality. In conclu-
sion, the combination of the Nyström method with the SCW-II achieves 0.724 of
F1 (with a relative improvement of 17 %). These outcomes suggest that applying
structured learning to dataset of this size is effective as well as viable.
Large-Scale Kernel-Based Language Learning Through the Ensemble 111
5 Conclusions
In this paper the Nyström methodology has been discusses as a viable solution
to face scalability issues in kernel-based language learning. It allows deriving
low-dimensional linear representations of training examples, regardless of their
corresponding representations (e.g. vectors or discrete structures), by approxi-
mating the implicit space underlying a kernel function. These linear represen-
tations enable the application of scalable and performant linear learning meth-
ods. Large scale experimental results on two language learning tasks suggested
that a classification quality comparable with original kernel-based methods can
be obtained, even when a reduction of the computational cost up to 99 % is
observed. We showed a successful application of these methods to a FrameNet
datasets of about 1.4 million of training instances. At the best of our knowledge,
this is the first application of this class of methods to language learning with sev-
eral open lines of research. Other language learning problems where kernel-based
learning has been previously applied can be investigated at a larger scale, such
as Relation Extraction [7]. Moreover, other learning tasks can be investigated,
such as linear regression, clustering or re-ranking. Further and more efficient
learning algorithms, such as [22,23], or more complex learning scheme, such as
the stratified approach proposed in [13], can be also investigated.
References
1. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: Pro-
ceedings of COLING-ACL. Montreal, Canada(1998)
2. Cancedda, N., Gaussier, É., Goutte, C., Renders, J.M.: Word-sequence kernels. J.
Mach. Learn. Res. 3, 1059–1082 (2003)
3. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans.
Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)
4. Collins, M., Duffy, N.: Convolution kernels for natural language. In: Proceedings
of Neural Information Processing Systems (NIPS 2001), pp. 625–632 (2001)
5. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
6. Croce, D., Moschitti, A., Basili, R.: Structured lexical similarity via convolution
kernels on dependency trees. In: Proceedings of EMNLP (2011)
7. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Pro-
ceedings of ACL 2004. Stroudsburg, PA, USA (2004)
8. Dredze, M., Crammer, K., Pereira, F.: Confidence-weighted linear classification.
In: Proceedings of ICML 2008. ACM, New York (2008)
9. Drineas, P., Mahoney, M.W.: On the nystrm method for approximating a gram
matrix for improved kernel-based learning. J. ML Res. 6, 2153–2175 (2005)
10. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library
for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
11. Filice, S., Castellucci, G., Croce, D., Basili, R.: Effective kernelized online learning
in language processing tasks. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai,
C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416,
pp. 347–358. Springer, Heidelberg (2014)
112 D. Croce and R. Basili
12. Filice, S., Castellucci, G., Croce, D., Basili, R.: Kelp: a kernel-based learning plat-
form for natural language processing. In: Proceedings of ACL: System Demonstra-
tions. Beijing, China, July 2015
13. Filice, S., Croce, D., Basili, R.: A stratified strategy for efficient kernel-based learn-
ing. In: AAAI Conference on Artificial Intelligence (2015)
14. Filice, S., Croce, D., Basili, R., Zanzotto, F.M.: Linear online learning over struc-
tured data with distributed tree kernels. In: Proceedings of ICMLA 2013 (2013)
15. Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A dual coor-
dinate descent method for large-scale linear svm. In: Proceedings of the ICML
2008, pp. 408–415. ACM, New York (2008)
16. Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural SVMs.
Mach. Learn. 77(1), 27–59 (2009)
17. Kumar, S., Mohri, M., Talwalkar, A.: Ensemble nystrom method. In: NIPS, pp.
1060–1068. Curran Associates, Inc.(2009)
18. Kumar, S., Mohri, M., Talwalkar, A.: Sampling methods for the nyström method.
J. Mach. Learn. Res. 13, 981–1006 (2012)
19. Li, X., Roth, D.: Learning question classifiers: the role of semantic information.
Nat. Lang. Eng. 12(3), 229–249 (2006)
20. Moschitti, A., Pighin, D., Basili, R.: Tree kernels for semantic role labeling. Com-
put. Linguist. 34, 193–224 (2008)
21. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res.
5, 101–141 (2004)
22. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient
solver for SVM. In: Proceedings of ICML. ACM, New York (2007)
23. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for reg-
ularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
24. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, New York (2004)
25. Vapnik, V.N.: Statistical Learning Theory. Wiley-Interscience, New York (1998)
26. Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps.
IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)
27. Wang, J., Zhao, P., Hoi, S.C.: Exact soft confidence-weighted learning. In: Pro-
ceedings of the ICML 2012. ACM, New York (2012)
28. Wang, Z., Vucetic, S.: Online passive-aggressive algorithms on a budget. J. Mach.
Learn. Res. Proc. Track 9, 908–915 (2010)
29. Williams, C.K.I., Seeger, M.: Using the nyström method to speed up kernel
machines. In: Proceedings of NIpPS 2000 (2001)
30. Zanzotto, F.M., Dell’Arciprete, L.: Distributed tree kernels. In: Proceedings of
ICML 2012 (2012)
31. Zhang, D., Lee, W.S.: Question classification using support vector machines. In:
Proceedings of SIGIR 2003, pp. 26–32. ACM, New York (2003)
Question Answering
Beyond Factoid QA: Effective Methods for
Non-factoid Answer Sentence Retrieval
1 Introduction
be difficult due to both the vague definition of a passage and evaluation meth-
ods [10]. A more natural and direct approach is to focus on retrieving sentences
that are part of answers. Sentences are basic expression units in most if not all
natural languages, and are easier to define and evaluate compared with passages.
Therefore, in this paper, we introduce answer sentence retrieval for non-factoid
Web queries as a practical WebQA task. To facilitate research on this task, we
have created a benchmark data set referred as WebAP using a Web collection
(TREC GOV2). To investigate the problem, we propose the first research ques-
tion:
RQ1. Could we directly apply existing methods like factoid QA methods and
sentence selection methods to solve this task?
RQ2. How could we design more effective methods for answer sentence retrieval
for non-factoid Web queries?
Previous results show that retrieving sentences that are part of answers is
a more challenging task, requiring more powerful features than traditional rele-
vance retrieval. By analyzing the task, we make two key observations from the
WebAP data:
1. Due to the shorter length of sentences compared with documents, the prob-
lem of vocabulary-mismatch may be even more severe in answer sentence
retrieval for non-factoid Web queries. Thus in addition to text matching fea-
tures, we need more features to capture the semantic relations between query
and answer sentences.
2. Non-factoid questions usually require multiple sentences as answers, and these
answer sentences do not scatter in documents but often form small clusters.
Thus the context of a sentence may be an important clue for identifying
answer sentences.
Based on these observations, we design and extract two new types of features
for answer sentence retrieval, namely semantic features and context features. We
adopt learning to rank (L2R) models for sentence ranking, which have been suc-
cessfully applied to document retrieval, collaborative filtering and many other
Beyond Factoid QA: Effective Methods for Non-factoid Answer 117
2 Related Work
Our work is related to several research areas, including answer passage retrieval,
answer retrieval with translation models, answer ranking in community question
answering (CQA) sites and answer retrieval for factoid questions.
Answer Passage Retrieval. Keikha et al. [9,10] developed an annotated data
set for non-factoid answer finding using TREC GOV2 collections and topics.
They annotated passage-level answers, revisited several passage retrieval models
with this data, and came to the conclusion that the current methods are not
effective for this task. Our research work departs from Keikha et al. [9,10] by
developing methods for answer sentence retrieval.
Answer Retrieval with Translation Models. Some previous research on
answer retrieval has been based on statistic translation models to find semanti-
cally similar answers [1,15,20]. Xue et al. [20] proposed a retrieval model that
combines a translation-based language model for the question part with a query
118 L. Yang et al.
likelihood approach for the answer part. Riezler et al. [15] presented an approach
to query expansion in answer retrieval that uses machine translation techniques
to bridge the lexical gap between questions and answers. Berger et al. [1] studied
multiple statistical methods such as query expansion, statistical translation, and
latent variable models for answer finding.
Answer Ranking in CQA. Surdeanu et al. [16] investigated a wide range of
feature types including similarity features, translation features, density / fre-
quency features, Web correlation features for ranking answers to non-factoid
questions in Yahoo! Answers. Jansen et al. [8] presented an answer re-ranking
model for non-factoid questions that integrate lexical semantics with discourse
information driven by two representations of discourse. Answer ranking in CQA
sites is a somewhat different task than answer retrieval for non-factoid questions:
answer sentences could come from multiple documents for general non-factoid
question answering, and the candidate ranked answer set is much smaller for a
typical question in CQA sites.
Answer Retrieval for Factoid Questions. There has also been substan-
tial research on answer sentence selection with data from TREC QA track
[17,18,21,22]. Yu et al. [22] proposed an approach to solve this task via means
of distributed representations, and learn to match questions with answers by con-
sidering their semantic encoding. Although the target answers are also at the sen-
tence level, this research is focused on factoid questions. Our task is different in
that we investigate answer sentence retrieval for non-factoid questions. We also
compare our proposed methods with a state-of-the-art factoid QA method to show
the advantages of developing techniques specifically for non-factoid answer data.
4 Baseline Experiments
Our first task on the WebAP data set is to seek solution of RQ1, in which
we ask if a new set of techniques is needed. We address this question using a
baseline experiment, in which we use some techniques that should be reasonable
for retrieving non-factoid answers, and compare these techniques with a factoid
question answering method.
We set up this experiment by including the following three classes of tech-
niques:
1. Retrieval Functions. In this experiment, we considered query likelihood
language model with Dirichlet smoothing (LM).
2. Factoid Question Answering Method. In this experiment, we use a more
recent approach based on convolutional neural network (CNN) [22], whose
performance is current on par with the best results on TREC QA Track
data. This is a supervised method and needs to be trained with pre-defined
120 L. Yang et al.
Table 1. Comparison of sample questions and answers in TREC QA Track data and
WebAP data.
word embeddings. Two variants are tested here, which are CNN with word
count features and CNN without word count features.
3. Summary Sentence Selection Method. In this experiment, we test a L2R
approach proposed by Metzler and Kanungo [13], which uses 6 simple features
referred as MK features as described in Sect. 5 to address the lexical matching
between the query and sentences. As suggested in the original paper, we use
the MART ranking algorithm to combine these features.
The results are given in Table 2. LM and MK perform better than CNN based
methods. LM achieves the best results under all the three metrics. CNN based
methods perform poorly on this task. Using word count features in [22], CNN
gets slightly better results. However, only using LM can achieve 170.73 % gain
for MRR over CNN with word count features and the difference is statistically
significant measured by the Student’s paired t-test (p < 0.01). MK achieves
better performance than CNN based methods, but it performs worse than LM.
This result shows that automatically learned word features (as in CNN) and
Beyond Factoid QA: Effective Methods for Non-factoid Answer 121
Table 2. Baseline experiment results for non-factoid answer sentence retrieval of dif-
ferent methods. The best performance is highlighted in boldface.‡ means significant
difference over CNN with word count features with p < 0.01 measured by the Stu-
dent’s paired t-test.
simple combined text matching features (as in MK) may not be sufficient for our
task, suggesting that a new set of techniques is needed for non-factoid answer
sentence retrieval.
where tfw,Q is the number of times that w occurs in the query, tfw,S is the
number of times that w occurs in the sentence, |S| is the length of sentences,
2
http://wordnet.princeton.edu/.
122 L. Yang et al.
Context features are features specific to the context of the candidate sentence.
We define the context of a sentence as the adjacent sentence before and after
it. The intuition is that the answer sentences are very likely to be surrounded
by other answer sentences. Context features could be generated based on any
sentence features. They include features in the following two types:
With the computed features, we carry out sentence re-ranking with L2R methods
using the following models:
6 Experiments
6.1 Experimental Settings
Table 3 shows the evaluation results for non-factoid answer sentence retrieval
using different feature sets and learning models. We summarize our observations
as follows: (1) For feature set comparison, the results are quite consistent across
the three different learning models where the combination of MK, semantic fea-
tures, and context features achieve the best results for all three learning model
settings. For MRR, this combination achieves 5.4 % gain using CA, 68.55 % gain
using MART, and 10.05 % gain using LambdaMART over the MK feature set.
Similar gains are also observed under other metrics. In terms of relative feature
importance, context features achieve larger gain compared to semantic features
although adding both of them can improve the performance of the basic MK
feature set. (2) For the learning model comparison, MART achieves the best
performance with MK + Semantics + Context(All) features, with statistically
significant differences with respect to both LM and MK(MART), which shows
the effectiveness of MART for combining different features to learn a ranking
model for non-factoid answer sentence retrieval.
4
http://rmit-ir.github.io/SummaryRank.
5
http://www.lemurproject.org/ranklib.php.
6
https://code.google.com/p/jforests/.
7
We just report the hyperparameters used for MK + Semantics + Context(All) fea-
ture sets given the space limit. The hyperparameters for other feature sets can be
obtained by standard 5-fold cross validation. For parameter values in MART, we set
the number of trees as 100, the number of leaves of each tree as 20, learning rate as
0.05, the number of threshold candidates for tree splitting as 256, min leaf support
as 1, the early stop parameter as 100. For parameter values in CA, we set the num-
ber of random restarts as 3, the number of iterations to search in each dimension as
25, performance tolerance between two solutions as 0.001. For parameter values in
LambdaMART, we set the number of trees as 1000, the number of leaves of each tree
as 15, learning rate as 0.1, minimum instance percentage per leaf as 0.25, feature
sampling ratio as 0.5. We empirically set the parameter µ = 10 in the computation
of the LanguageModelScore feature.
Beyond Factoid QA: Effective Methods for Non-factoid Answer 125
Table 3. Evaluation results for non-factoid answer sentence retrieval of different feature
sets and learning models. “Context(All)” denotes context features for both MK and
semantics features. The best performance is highlighted in boldface. † and ‡ means
significant difference over LM with p < 0.05 and p < 0.01 respectively measured by the
Student’s paired t-test. ∗∗ means significant difference over MK with the same learning
model with p < 0.01 measured by the Student’s paired t-test.
We further show some examples and analysis of top-1 ranked sentences by dif-
ferent methods in Table 4. In general, the top-1 ranked sentences by MK +
Semantics + Context(All) are better than the other two methods. Although
126 L. Yang et al.
there are lexical matches for top-1 sentences retrieved by all methods, sentences
with lexical matches may be not answer sentences of high quality. For example,
for query 808, MK features will be confused on “Korean”, “North Korean” and
“South Korean” appearing frequently in sentences since there is a common term
among them. Semantic features play an important role here to alleviate this
problem. Semantic features such as EntityLinking can differentiate “Korean”
with “North Korean” by linking them to different Wikipedia pages.
Context features have the potential to guide the ranker in the right direc-
tion since correct non-factoid answer sentences are usually surrounded by other
similar non-factoid answer or relevant sentences. Query 816 in Table 4 gives
one example for their effects. Without context features, LambdaMART with
MK+Semantics features retrieved a non-relevant sentence as the top-1 result.
The retrieved sentence may seem relevant itself (as it indeed mentions USAID’s
efforts to support biodiversity), but if we pull out its context:
The sentences around the SentenceRetrieved are not relevant to the query and
are labeled as “None”. From its context, we can see that the SentenceRetrieved
is actually not talking about Galapagos Islands. Comparing to that, the con-
text of top-1 sentence retrieved by LambdaMART with MK + Semantics +
Context(All) features are:
In this paper, we formally introduced the answer sentence retrieval task for non-
factoid Web queries and investigated a framework based on learning to rank
methods. We compared learning to rank methods with baseline methods includ-
ing language models and a CNN based method. We found that both semantic
and context features are useful for non-factoid answer sentences retrieval. In par-
ticular, the results show that MART with appropriate features outperforms all
the baseline methods significantly under multiple metrics and provides a good
basis for non-factoid answer sentence retrieval.
For future work, we would like to investigate more features such as syntactic
features and readability features to further improve non-factoid answer sentence
retrieval. Learning an effective representation of answer sentences for information
retrieval is also an interesting direction to explore.
Acknowledgments. This work was partially supported by the Center for Intelligent
Information Retrieval, by NSF grant #IIS-1160894, by NSF grant #IIS-1419693, by
ARC Discovery grant DP140102655 and by ARC Project LP130100563. Any opinions,
findings and conclusions expressed in this material are those of the authors and do
not necessarily reflect those of the sponsor. We thank Mark Sanderson for the valuable
comments on this work.
128 L. Yang et al.
References
1. Berger, A., Caruana, R., Cohn, D., Freitag, D., Mittal, V.: Bridging the lexical
chasm: statistical approaches to answer-finding. In: Proceedings of the SIGIR 2000
(2000)
2. Chen, R.-C., Spina, D., Croft, W.B., Sanderson, M., Scholer, F.: Harnessing seman-
tics for answer sentence retrieval. In: Proceedings of ESAIR 2015 (2015)
3. Croft, W.B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in
Practice, 1st edn. Addison-Wesley Publishing Company, Lebanon (2009)
4. Ferragina, P., Scaiella, U.: Fast and accurate annotation of short texts with
wikipedia pages. IEEE Softw. 29(1), 70–75 (2012)
5. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann.
Stat. 29, 1189–1232 (2000)
6. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-
based explicit semantic analysis. In: Proceedings of IJCAI 2007 (2007)
7. Huston, S., Croft, W.B.: A comparison of retrieval models using term dependencies.
In: Proceedings of CIKM 2014 (2014)
8. Jansen, P., Surdeanu, M., Clark, P.: Discourse complements lexical semantics for
non-factoid answer reranking. In: Proceedings of ACL 2014, pp. 977–986 (2014)
9. Keikha, M., Park, J.H., Croft, W.B.: Evaluating answer passages using summa-
rization measures. In: Proceedings of SIGIR 2014 (2014)
10. Keikha, M., Park, J.H., Croft, W.B., Sanderson, M.: Retrieving passages and find-
ing answers. In: Proceedings of ADCS 2014, pp. 81–84 (2014)
11. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In:
Proceedings of SIGIR 2005 (2005)
12. Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval.
Inf. Retr. 10(3), 257–274 (2007)
13. Metzler, D., Kanungo, T.: Machine learned sentence selection strategies for query-
biased summarization. In: Proceedings of SIGIR Learning to Rank Workshop
(2008)
14. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. CoRR arxiv.org/abs/1301.3781 (2013)
15. Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V., Liu, Y.: Statistical
machine translation for query expansion in answer retrieval. In: Proceedings of
ACL 2007 (2007)
16. Surdeanu, M., Ciaramita, M., Zaragoza, H.: Learning to rank answers on large
online QA collections. In: Proceedings of ACL 2008, pp. 719–727 (2008)
17. Yih, W.-T., Chang, M.-W., Meek, C., Pastusiak, A.: Question answering using
enhanced lexical semantic models. In: Proceedings of ACL 2013 (2013)
18. Wang, M., Smith, N.A., Mitamura, T.: What is the jeopardy model? A quasi-
synchronous grammar for QA. In: EMNLP-CoNLL 2007, pp. 22–32 (2007)
19. Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information
retrieval measures. Inf. Retr. 13(3), 254–270 (2010)
20. Xue, X., Jeon, J., Croft, W.B.: Retrieval models for question and answer archives.
In: Proceedings of SIGIR 2008, pp. 475–482 (2008)
21. Yao, X., Durme, B.V., Callison-burch, C., Clark, P.: Answer extraction as sequence
tagging with tree edit distance. In: Proceedings of NAACL 2013 (2013)
22. Yu, L., Hermann, K.M., Blunsom, P., Pulman, S.: Deep Learning for Answer Sen-
tence Selection. arXiv: 1412.1632, December 2014
Supporting Human Answers for Advice-Seeking
Questions in CQA Sites
1 Introduction
Most questions posted on Community-based Question Answering (CQA) web-
sites, such as Yahoo Answers, Answers.com and StackExchange, do not target
simple facts such as “what is Brad Pitt’s height? ” or “how far is the moon from
earth? ”. Instead, askers expect some human touch in the answers to their ques-
tions. Especially, many questions look for recommendations, suggestions and
opinions, e.g. “what are some good horror movies for Halloween? ”, “should you
wear a jockstrap under swimsuit? ” or “how can I start to learn web develop-
ment? ”. According to our analysis, based on editorial judgments of 12,000 Yahoo
Answers questions, 70 % of all questions are advice or opinion seeking questions.
Examining answers for such advice-seeking questions, we found that quite
often answerers do not provide supportive evidence for their recommendation,
and that answers usually represent diverse perspectives of the different answerers
for the question at hand. For example, answerers may recommend different hor-
ror movies. Still, the asker would like to choose only one or two movies to watch,
and without additional supportive evidence her decision may be non-trivial.
In this paper we assume that askers would be happy to receive additional
information that will help them in choosing the best fit for their need from the
various suggestions or opinions provided in the CQA answers. More formally, we
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 129–141, 2016.
DOI: 10.1007/978-3-319-30671-1 10
130 L. Braunstain et al.
propose the novel task of retrieving sentences from the Web that provide support
to a given recommendation or opinion that is part of an answer in a CQA site.
We refer to the part of the answer (e.g., a sentence) that contains a recom-
mendation as a subjective claim about the need expressed in the question (e.g., a
call for advice). For a sentence to be considered as supporting the claim, it should
be relevant to the content of the claim and provide some supporting informa-
tion; e.g., examples, statistics, or testimony [1]. More specifically, a supporting
sentence is one whose acceptance is likely to raise the confidence in the claim.
While supporting sentences may be part of the same answer containing the
claim, or found in other answers given for the same question, in this paper we are
interested in retrieving sentences from other sources which may provide differ-
ent perspectives on the claim compared to content on CQA sites. For example,
for the question “what are some good horror movies? ”, a typical CQA answer
could be “The Shining is a great movie; I love watching it every year ”. On
the other hand, a supporting sentence from external sites may contain infor-
mation such as “...in 2006, the Shining made it into Ebert’s series of “Great
Movie” reviews...”. Specifically, we focus on retrieving supporting sentences from
Wikipedia, although our methods can be largely applied to other Web sites.
We present a general scheme of Learning to Rank for Support, in which the
retrieval algorithm is directly optimized for ranking sentences by presumed sup-
port. Our feature set includes both relevance-oriented features, such as textual
similarity, and support-oriented features, such as sentiment matching and simi-
larity with language-model-based support priors.
We experimented with a new dataset containing 40 subjective claims from
the Movies category of Yahoo Answers. For each claim, sentences retrieved from
Wikipedia using relevance estimates were manually evaluated for relevance and
support. The evaluated benchmark was then used to train and test our model.
The results demonstrate the merits of integrating relevance-based and support-
based features for the support ranking task. Furthermore, our model substan-
tially outperforms a state-of-the-art Textual Entailment system used for support
ranking. This result emphasizes the difference between prior work on supporting
objective claims and our task of supporting subjective recommendations.
Our first step is to rank sentences by their presumed relevance to claim c. Since
these sentences are part of documents in a corpus D, we follow common practice
in work on sentence retrieval [2] and first apply document retrieval with respect
to c. Then, the sentences in the top ranked documents are ranked for relevance.
We assume that each document d ∈ D is composed of a title, dt , and a body,
db . This is the case for Wikipedia, which is used in our experiment, as well as for
most Web pages. The initial document retrieval, henceforth InitDoc, is based on
the document score SSDM (c; db ). This score is assigned to the body of document
d with respect to the claim c by the state-of-the-art sequential dependence model
(SDM) from the Markov Random Field framework [3]. For texts x and y,
def
SSDM (x; y) = λT ST (x; y) + λO SO (x; y) + λU SU (x; y); (1)
ST (x; y), SO (x; y) and SU (x; y) are the (smoothed) log likelihood values of the
appearances of unigrams, ordered bigrams and unordered bigrams, respectively,
of tokens from x in y; λT , λU , and λO are free parameters whose values sum
to 1. We further bias the initial document ranking in favor of documents whose
titles contain ce — the entity the claim is about. Specifically, d is ranked by:
def
SInitDoc (c; d) = αS(ce ; dt ) + (1 − α)SSDM (c; db ); (2)
S(ce ; dt ) is the log of the Dirichlet smoothed maximum likelihood estimate, with
respect to d’s title, of the n-gram which constitutes the entity ce [4]; smoothing
is based on n-gram counts in the corpus1 ; α is a free parameter.
To estimate the relevance of sentence s to the claim c, we can measure their
similarity using, again, the SDM model. We follow common practice in work on
passage retrieval [2], and interpolate, using a parameter β, the claim-sentence
similarity score with the retrieval score of document d which s is part of:
def
SInitSent (c; s) = βSSDM (c; s) + (1 − β)SInitDoc (c; d). (3)
1
All SDM scoring function components in Eq. 1 also use the logs of Dirichlet smoothed
estimates [3]. The smoothing parameter, µ, is set to the same value for all estimates.
132 L. Braunstain et al.
Semantic Similarities. Both the claim c and the candidate support sentence s
can be short. Thus, to address potential vocabulary mismatch issues in textual
similarity estimation, we also use semantic-based similarity measures that uti-
lize word embedding [6]. Specifically, we use the word vectors, of dimension 300,
trained over a Google news dataset with Word2Vec3 . Let w denote the embed-
ding vector representing term w. We measurethe extent to which the terms
in s “cover” the terms in c by MaxSemSim: w∈c maxw ∈s cos(w, w ). Addi-
tionally, we measure the similarity between
the centroids
of the claim and the
1
sentence (cf. [7]), CentSemSim: cos( |c| w∈c w, 1
|s|
w ∈s w ); |c| and |s| are
the number of terms in the claim and sentence, respectively.
Sentence Style. The StopWords feature is the fraction of terms in the sentence
that are stop words. High occurrence of stop words potentially attests to rich
use of language [10], and consequently, to sentence quality. Stop words are deter-
mined using the Stanford parser7 . We also use the sentence length, SentLength,
as a prior signal for sentence quality.
3 Empirical Evaluation
3.1 Dataset
There is no publicly available dataset for evaluating sentence ranking for support
of subjective claims that originate from advice-seeking questions and correspond-
ing answers. Hence, we created a novel dataset8 as follows. Fifty subjective claims
4
http://nlp.stanford.edu/software/corenlp.shtml.
5
IMDB snapshot from 08/01/2014.
6
The order of concatenation has no effect since unigram language models are used.
7
http://nlp.stanford.edu/software/lex-parser.shtml.
8
Available at http://iew3.technion.ac.il/∼kurland/supportRanking.
134 L. Braunstain et al.
about movies, which serve as the entities ce , were collected from Yahoo Answers9
by scanning its movies category. We looked for advice-seeking questions, which
are common in the movies category, and selected answers that contain at least
one movie title. Each pair of a question and a movie title appearing in an answer
for the question was transformed to a claim by manually reformulating the ques-
tion into an affirmative form and inserting the entity (movie title) as the subject.
For example, the question “any good science fiction movies? ” and the movie title
“Tron” was transformed to the claim “Tron is a good science fiction movie”.
The corpus used for sentence retrieval is a dump of the movies category of
Wikipedia from March 2015, which contains 111, 164 Wikipedia pages. For each
claim, 100 sentences were retrieved using the initial sentence retrieval approach,
InitSent (Sect. 2.1). Each of these 100 sentences was categorized by five anno-
tators from CrowdFlower10 into: (1) not relevant to the claim, (2) strong non-
support, (3) medium non-support, (4) neutral, (5) medium support, (6) strong
support. The final label was determined by a majority vote.
We used the following induced scales: (a) binary relevance: not relevant
(category 1) vs. relevant (categories 2–6); (b) binary support: non-support
(categories 1–4) vs. support (categories 5–6); (c) graded support: non-support
(categories 1–4), weak support (category 5) and strong support (category 6).
The Fleiss’ Kappa inter-annotator agreement rates are: 0.68 (substantial) for
9
answers.yahoo.com.
10
www.crowdflower.com.
Supporting Human Answers for Advice-Seeking Questions in CQA Sites 135
binary relevance, 0.592 (moderate) for binary support and 0.457 (moderate) for
graded support. Table 1 provides examples of claims and relevant (on a binary
scale) sentences that either support the claim or not (i.e., binary support scale
is used).
Ten out of the fifty claims had no support sentences and were not used
for evaluation. For the forty remaining claims, on average, half of the support
sentences were weak support and the other half were strong support. On aver-
age, 23.5 % of the relevant sentences are supportive (binary scale). The median,
average and standard deviation of relevant sentences, and of support sentences
(binary scale), per claim are: 29, 40.5, 29.3 and 5.5, 7.4 and 6.7, respectively.
3.2 Methods
Relevance Support
NDCG@1 NDCG@3 NDCG@10 p@5 NDCG@1 NDCG@3 NDCG@10 p@5
InitSent .775 .766 .739 .730 .083 .165 .295 .215
LinearSVM .800 .786 .782 .770 .441i .478i .519i .410i
PolySVM .800 .839 .852i,l .835i .525i .527i .564i,l .445i
LambdaMART .825 .844 .808 .835i .608i .540i .593i .515i,l
p
support ranking are based on graded support scale, and those for p@5 are based
on the binary support scale. All performance numbers for relevance ranking are
based on the binary relevance scale. LambdaMART was trained for NDCG@10 as
this yielded, in general, better support ranking performance across the evaluation
measures than using NDCG@1 or NDCG@3. Statistically significant differences
of performance are determined using the two tailed paired t-test with p = 0.05.
3.3 Results
Table 2 presents our main results. We see that all three LTR methods outperform
the initial sentence ranking, InitSent, in terms of relevance ranking. Although few
of these improvements are statistically significant, they attest to the potential
merits of using the additional relevance-based features described in Sect. 2.2.
More importantly, all LTR methods substantially, and statistically significantly,
outperform the initial (relevance-based) sentence ranking in terms of support.
This result emphasizes the difference between relevance and support and shows
that our proposed features for support ranking are quite effective, especially
when used in a non-linear ranker such as LambdaMART.
In Sects. 1 and 4 we discuss the difference between subjective and factoid
claims. To further explore this difference, we compare our best performing Lamb-
daMART method with the P1EDA13 state-of-the-art textual entailment algo-
rithm [14] when both are used for the support-ranking task we address here.
P1EDA was designed for factual claims14 . Specifically, given a claim and a can-
didate sentence, P1EDA produces a classification decision of whether the sen-
tence entails the claim, accompanied with a confidence level. The confidence
level was used for support (and relevance) ranking. We also tested the inclusion
of P1EDA’s output (confidence level) as an additional feature in LambdaMART,
yielding the LMart+P1EDA method. Table 3 depicts the performance numbers.
We can see in Table 3 that P1EDA is (substantially) outperformed by both
InitSent and LambdaMART, for both relevance and support ranking. Since the
13
http://hltfbk.github.io/Excitement-Open-Platform/.
14
We trained P1EDA using the SNLI data set [15], which contains 549,366 examples.
Supporting Human Answers for Advice-Seeking Questions in CQA Sites 137
Relevance Support
NDCG@1 NDCG@3 NDCG@10 p@5 NDCG@1 NDCG@3 NDCG@10 p@5
InitSent .775 .766 .739 .730 .083 .165 .295 .215
P1EDA .525i .496i .462i .475i .066 .093 .129i .120i
i,p i,p i,p i,p i,p i,p i,p
LMart .825 .844 .808 .835 .608 .540 .593 .515i,p
LMart+P1EDA .850p .836p .811p .815p,m .600i,p .571i,p .609i,p .490i,p
claims in our setting are simple, this finding implies that approaches for identi-
fying texts that support (or “prove”) a factoid claim may not be effective for the
task of supporting subjective claims. The integration of P1EDA as a feature in
LambdaMART improves performance (although not to a statistically significant
degree) for some of the evaluation measures, including NDCG@10 for which the
ranker was trained, and hurts performance for others — statistically significantly
so in only a single case15 .
Integrating P1EDA with only our semantic-similarity features using Lamb-
daMART, which is a conceptually similar approach to a classification method
employed in some work on argument mining [16], resulted in support-ranking
performance that is substantially worse than that of using all our proposed fea-
tures in LambdaMART. Actual numbers are omitted due to space considerations
and as they convey no additional insight.
Table 4. Using features alone (specifically, the 10 that yield the highest NDCG@10
support ranking) to rank the initial sentence list vs. integrating all features in Lamb-
daMART. Boldface: the best result in a column; ‘m’: statistically significant difference
with LambdaMART.
SentimentSim, the sentiment similarity between the claim and the sentence,
is among the most effective features when used alone for support ranking. Addi-
tional ablation tests17 reveal that removing SentimentSim from the set of all
features results in the most severe performance degradation for all three learning-
to-rank methods. Indeed, sentiment is an important aspect of subjective claims,
and therefore, of inferring support for these claims.
We also found that ranking sentences by decreasing entropy of sentiment
(SentimentEnt) is superior to ranking by increasing entropy for NDCG@1 and
NDCG@3, while for NDCG@10 the reverse holds. The former finding is a concep-
tual reminiscent of those about using the entropy of a document term distribution
for the document prior for Web search [10]: the higher the entropy, the “broader”
the textual unit is – in our case, in terms of expressed sentiment — which presum-
ably implies to a higher prior.
Finally, Table 4 also shows that Prior-5 is the most effective claim-independent
feature18 . It is the similarity between a language model of the sentence and that
induced from Wikipedia movie pages which received high grade (5 stars) reviews
in IMDB. This shows that although Wikipedia authors aim to be objective in
their writing, the style and information for high rated movies is still quite dif-
ferent from that for lower rated ones, and it can potentially be modeled via the
automatic knowledge transfer and labeling method proposed in Sect. 2.2.
17
Actual numbers are omitted due to space considerations and as they convey no
additional insight.
18
Ablation tests reveal that removing this feature results in the second most substantial
decrease of support-ranking performance among all features.
Supporting Human Answers for Advice-Seeking Questions in CQA Sites 139
4 Related Work
A few lines of research are related to our work. The Textual Entailment task
is inferring the truthfulness of a textual statement (hypothesis) from a given
text [17]. A more specific incarnation of Textual Inference is automatic Ques-
tion Answering (QA). Work on these tasks focused on factoid claims for which
a clear correct/incorrect labeling should be inferred from supportive evidence.
Thus, typical textual inference approaches are designed to find the claim (e.g. a
candidate answer in QA) embedded in the supporting text, although it may be
rephrased. In contrast, in this paper, claims originate from CQA users who pro-
vide subjective recommendations rather than state facts. Our model, designed
for ranking sentences by support for a subjective claim, significantly outperforms
for this task a state-of-the-art textual entailment method as shown in Sect. 3.3.
Blanco and Zaragoza [18] introduce methods for retrieving sentences that
explain the relationship between a Web query and a related named entity, as
part of the entity ranking task. In contrast, we rank sentences by support for a
subjective claim. Kim et al. [19] present methods for retrieving sentences that
explain reasons for sentiment expressed about an aspect of a topic. In contrast to
these sentence ranking methods [18,19], ours utilizes a learning-to-rank method
that integrates various relevance and support features not used in [18,19].
The task most related to ours is argument mining (e.g., [16,20–24]). Specif-
ically, arguments supporting or contradicting a claim about a given debatable
(often controversial) topic are sought. Some of the types of features we use for
support ranking have also been used for argument mining; namely, semantic
[16,24] and sentiment [24] similarities between the claim and a candidate argu-
ment. Yet, the actual estimates and techniques used here to induce these fea-
tures are different than those in work on argument mining [16,24]. Furthermore,
the knowledge-transfer-based features we utilize, and whose effectiveness was
demonstrated in Sect. 3.3, are novel to this study.
Interestingly, while textual entailment features were found to be effective for
argument mining [16,20], this is not the case for support ranking (see Sect. 3.3).
This finding could be attributed to the fundamentally different nature of claims
used in our work, and those used in argument mining. That is, our claims origi-
nate from answers to advice-seeking questions of subjective nature, rather than
being about a given debatable/controversial topic. Also, additional information
about the debatable topic which was utilized in work on argument mining [24]
is not available in our setting.
Often, work on argument mining [24], similarly to that on question answer-
ing (e.g., [25]), focuses on finding supporting or contradicting evidence in the
same document in which the claim appears. In contrast, we retrieve supporting
sentences from the Web for claims originating from CQA sites. In fact, there has
been very little work on using sentence retrieval for argument mining [22]. In con-
trast to our work, a Boolean retrieval method was used, different features were
utilized, and relevance-based estimates were not integrated with support-based
estimates using a learning-to-rank approach.
140 L. Braunstain et al.
Acknowledgments. We thank the reviewers for their helpful comments, and Omer
Levy and Vered Shwartz for their help with the textual entailment tool used for exper-
iments. This work was supported in part by a Yahoo! faculty research and engagement
award.
References
1. Rieke, R., Sillars, M.: Argumentation and Critical Decision Making. Longman
Series in Rhetoric and Society, Longman (1997)
2. Murdock, V.: Exploring Sentence Retrieval. VDM Verlag, Saarbrücken (2008)
3. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In:
Proceedings of SIGIR, pp. 472–479 (2005)
4. Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied
to ad hoc information retrieval. In: Proceedings of SIGIR, pp. 334–342 (2001)
5. Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval
3(3), 225–331 (2009)
6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality. In: Proceedings of NIPS,
pp. 3111–3119 (2013)
7. Vulic, I., Moens, M.: Monolingual and cross-lingual information retrieval models
based on (bilingual) word embeddings. In: Proceedings of SIGIR, pp. 363–372
(2015)
8. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., Potts, C.:
Recursive deep models for semantic compositionality over a sentiment treebank.
In: Proceedings of EMNLP, pp. 363–372 (2013)
9. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf.
Retrieval 2(1–2), 1–135 (2007)
10. Bendersky, M., Croft, W.B., Diao, Y.: Quality-biased ranking of web documents.
In: Proceedings of WSDM, pp. 95–104 (2011)
11. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of KDD, pp.
217–226 (2006)
12. Burges, C.J.: From RankNet to LambdaRank to LambdaMART: an overview.
Technical report, Microsoft Research (2010)
13. Zhou, Y., Croft, B.: Query performance prediction in web search environments. In:
Proceedings of SIGIR, pp. 543–550 (2007)
Supporting Human Answers for Advice-Seeking Questions in CQA Sites 141
14. Noh, T.G., Pado, S., Shwartz, V., Dagan, I., Nastase, V., Eichler, K., Kotlerman,
L., Adler, M.: Multi-level alignments as an extensible representation basis for tex-
tual entailment algorithms. In: Proceedings of SEM (2015)
15. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus
for learning natural language inference. In: Proceedings of EMNLP, pp. 632–642
(2015)
16. Boltužić, F., Šnajder, J.: Back up your stance: recognizing arguments in online dis-
cussions. In: Proceedings of the First Workshop on Argumentation Mining, Asso-
ciation for Computational Linguistics, pp. 49–58 (2014)
17. Dagan, I., Roth, D., Sammons, M., Zanzotto, F.M.: Recognizing textual entail-
ment: models and applications. Synth. Lect. Hum. Lang. Technol. 6(4), 1–220
(2013)
18. Blanco, R., Zaragoza, H.: Finding support sentences for entities. In: Proceedings
of SIGIR, pp. 339–346 (2010)
19. Kim, H.D., Castellanos, M., Hsu, M., Zhai, C., Dayal, U., Ghosh, R.: Ranking
explanatory sentences for opinion summarization. In: Proceedings of SIGIR, pp.
1069–1072 (2013)
20. Cabrio, E., Villata, S.: Combining textual entailment and argumentation theory
for supporting online debates interactions. In: Proceedings of ACL, pp. 208–212
(2012)
21. Green, N., Ashley, K., Litman, D., Reed, C., Walker, V. (eds.) Proceedings of
the First Workshop on Argumentation Mining, Baltimore, Maryland. Associa-
tion for Computational Linguistics (2014). http://www.aclweb.org/anthology/W/
W14/W14-21
22. Sato, M., Yanai, K., Yanase, T., Miyoshi, T., Iwayama, M., Sun, Q., Niwa, Y.: End-
to-end argument generation system in debating. In: Proceedings of ACL-IJCNLP
2015 System Demonstrations (2015)
23. Cardie, C. (ed.): Proceedings of the 2nd Workshop on Argumentation Mining, Den-
ver, CO. Association for Computational Linguistics (2015). http://www.aclweb.
org/anthology/W15-05
24. Rinott, R., Dankin, L., Alzate Perez, C., Khapra, M.M., Aharoni, E., Slonim, N.:
Show me your evidence - an automatic method for context dependent evidence
detection. In: Proceedings of EMNLP, pp. 440–450 (2015)
25. Brill, E., Lin, J.J., Banko, M., Dumais, S.T., Ng, A.Y., et al.: Data-intensive ques-
tion answering. In: Proceedings of TREC, vol. 56, p. 90 (2001)
Ranking
Does Selective Search Benefit from WAND
Optimization?
1 Introduction
Selective search is a technique for large-scale distributed search in which the
document corpus is partitioned into p topic-based shards during indexing. When
a query is received, a resource selection algorithm such as Taily [1] or Rank-S [13]
selects the most relevant k shards to search, where k p. Results lists from
those shards are merged to form a final answer listing to be returned to the user.
Selective search has substantially lower computational costs than partitioning
the corpus randomly and searching all index shards, which is the most common
approach to distributed search [11,12].
Dynamic pruning algorithms such as Weighted AND (WAND) [3] and term-
bounded max score (TBMS) [22] improve the computational efficiency of retrieval
systems by eliminating or early-terminating score calculations for documents
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 145–158, 2016.
DOI: 10.1007/978-3-319-30671-1 11
146 Y. Kim et al.
which cannot appear in the top-k of the final ranked list. But topic-based parti-
tioning and resource selection change the environment in which dynamic prun-
ing is performed, and query term posting lists are likely to be longer in shards
selected by the resource selection algorithm than in shards that are not selected.
As well, each topic-based shard should contain similar documents, meaning that
it might be difficult for dynamic pruning to distinguish amongst them using only
partial score calculations. Conversely, the documents in the shards that were not
selected for search might be the ones that a dynamic pruning algorithm would
have bypassed if it had encountered them. That is, while the behavior of dynamic
pruning algorithms on randomly-organized shards is well-understood, the inter-
action between dynamic pruning and selective search is not. As an extreme
position, it might be argued that selective search is simply achieving the same
computational savings that dynamic pruning would have produced, but incurs
the additional overhead of clustering the collection and creating the shards. To
address these concerns, we investigate the behavior of the well-known Weighted
AND (WAND) dynamic pruning algorithm in the context of selective search,
considering two research questions:
RQ1: Does dynamic pruning improve selective search, and if so, why?
RQ2: Can the efficiency of selective search be improved further using a cascaded
pruning threshold during shard search?
2 Related Work
Selective search is a cluster-based retrieval technique [6,19] that combines ideas
from conventional distributed search and federated search [12]. Modern cluster-
based systems use inverted indexes to store clusters that were defined using
criteria such as broad topics [4] or geography [5]. The shards’ vocabularies are
assumed to be random and queries are sent to a single best shard, forwarding to
additional shards as needed [5].
In selective search, the corpus is automatically clustered into query-
independent topic-based shards with skewed vocabularies and distributed across
resources. When a query arrives, a resource selection algorithm identifies a subset
of shards that are likely to contain the relevant documents. The selected shards
are searched in parallel, and their top-k lists merged to form a final answer.
Because only a few shards are searched for each query, total cost per query is
reduced, leading to higher throughput.
Previous studies showed that selective search accuracy is comparable to a
typical distributed search architecture, but that efficiency is better [1,12], where
computational cost is determined by counting the number of postings processed
[1,12], or by measuring the execution time of a proof-of-concept implementation.
Resource Selection. Choosing which index shards to search for a query is
critical to search accuracy. There are three broad categories of resource selec-
tion algorithm: term-based, sample-based, and classification-based. Term-based
algorithms model the language distribution of a shard to estimate the relevance
Does Selective Search Benefit from WAND Optimization? 147
of the shard to a query, with the vocabulary of each shard typically treated
as a bag of words. The estimation of relevance is accomplished by adapting an
existing document scoring algorithm [8] or by developing a new algorithm specif-
ically for resource selection [1,9,15,24]. Taily [1] is one of the more successful
approaches, and fits a Gamma distribution over the relevance scores for each
term. At query time, these distributions are used to estimate the number of
highly scoring documents in the shard.
Sample-based algorithms extract a small (of the order of 1%) sample of the
entire collection, and index it. When a query is received, the sample index is
searched and each top-ranked document acts as a (possibly weighted) vote for the
corresponding index shard [13,16,20,21,23]. One example is Rank-S [13], which
uses an exponentially decaying voting function derived from the document’s
retrieval rank. The (usually small number of) resources with scores greater than
0.0001 are selected.
Classification-based algorithms use training data to learn models for
resources using features such as text, the scores of term-based and sample-
based algorithms, and query similarity to historical query logs [2,10]. While
classification-based algorithms can be more effective than unsupervised meth-
ods, they require access to training data. Their main advantage lies in combining
heterogeneous resources such as search verticals.
The Rank-S [13] and Taily [1] have both been used in prior work with sim-
ilar effectiveness. However Taily is more efficient, because lookups for Gamma
parameters are substantially faster than searching a sample index. We use both
in our experiments.
Dynamic Pruning. Weighted AND (WAND) is a dynamic pruning algorithm
that only scores documents that may become one of the current top k based on
a preliminary estimate [3]. Dimopoulos et al. [7] developed a Block-Max version
of WAND in which continuous segments of postings data are bypassed under
some circumstances by using an index where each block of postings has a local
maximum score. Petri et al. [17] explored the relationship between WAND-style
pruning and document similarity formulations. They found that WAND is more
sensitive than Block-Max WAND to the document ranking algorithm. If the
distribution of scores is skewed, as is common with BM25, then WAND alone
is sufficient. However, if the scoring regime is derived from a language model,
then the distribution of scores is top-heavy, and BlockMax WAND should be
used. Rojas et al. [18] presented a method to improve performance of systems
combining WAND and a distributed architecture with random shards.
Term-Bounded Max Score (TBMS) [22] is an alternative document-at-a-time
dynamic pruning algorithm that is currently used in the Indri Search Engine.
The key idea of TBMS is to precompute a “topdoc” list for each term, ordered
by the frequency of the term in the document, and divided by the document
length. The algorithm uses the union of the topdoc lists for the terms to deter-
mine a candidate list of documents to be scored. The number of documents in
the topdoc list for each term is experimentally determined, a choice that can
have an impact on overall performance. Kulkarni and Callan [12] explored the
148 Y. Kim et al.
3 Experiments
The observations of Kulkarni and Callan [12] provide evidence that dynamic
pruning and selective search can be complementary. Our work extends that
exploration in several important directions. First, we investigate whether there
is a correlation between the rank of a shard and dynamic pruning effectiveness
for that shard. A correlation could imply that dynamic pruning effectiveness
depends on the number of shards searched. We focus on the widely-used WAND
pruning algorithm, chosen because it is both efficient and versatile, particularly
when combined with a scoring function such as BM25 that gives rise to skewed
score distributions [7,17].
Experiments were conducted using the ClueWeb09 Category B dataset, con-
taining 50 million web documents. The dataset was partitioned into 100 topi-
cal shards using k-means clustering and a KL-divergence similarity metric, as
described by Kulkarni and Callan [11], and stopped using the default Indri sto-
plist and stemmed using the Krovetz stemmer. On average, the topical shards
contain around 500k documents, with considerable variation, see Fig. 1. A second
partition of 100 random shards was also created, a system in which exhaustive
“all shards” search is the only way of obtaining effective retrieval. Each shard
in the two systems was searched using BM25, with k1 = 0.9, b = 0.4, and global
corpus statistics for idf and average document length.1
1
The values for b and k1 are based on the parameter choices reported for Atire and
Lucene in the 2015 IR-Reproducibility Challenge, see http://github.com/lintool/
IR-Reproducibility.
Does Selective Search Benefit from WAND Optimization? 149
Each selected shard returned its top 1,000 documents, which were merged
by score to produce a final list of k = 1,000 documents. In selective search,
deeper ranks are necessary because most of the good documents may be in one
or two shards due to the term skew. Also, deeper k supports learning-to-rank
algorithms. Postings lists were compressed and stored in blocks of 128 entries
using the FastPFOR library [14], supporting fast block-based skipping during
the WAND traversal.
Two resource selection algorithms were used: Taily [1] and Rank-S [13]. The
Taily parameters were taken from Aly et al. [1]: n = 400 and v = 50, where v is
the cut-off score and n represents the theoretical depth of the ranked list. The
Rank-S parameters used are consistent with the values reported by Kulkarni
et al. [13]. A decay base of B = 5 with a centralized sample index (CSI) contain-
ing 1% of the documents was used – approximately the same size as the average
shard. We were unable to find parameters that consistently yielded better results
than the original published values.
We conducted evaluations using the first 1,000 unique queries from each of the
AOL query log2 and the TREC 2009 Million Query Track. We removed single-
term queries, which do not benefit from WAND, and queries where the resource
selection process did not select any shards. Removing single-term queries is a
common procedure for research with WAND [3] and allows our results to be
compared with prior work. That left 713 queries from the AOL log, and 756
queries from MQT, a total of 1,469 queries.
Our focus is on the efficiency of shard search, rather than resource selection.
To compare the efficiency of different shard search methods, we count the num-
4
10
3
10
2
10
Query Time (ms)
1
10
0
10
−1
10
−2
10
−3
10 0 1 2 3 4 5 6 7
10 10 10 10 10 10 10 10
Number of Postings Evaluated
Fig. 2. Correlation between the number of postings processed for a query and the time
taken for query evaluation. Data points are generated from MQT queries using both
WAND and full evaluation, applied independently to all 100 topical shards and all 100
random shards. In total, 756 × 200 × 2 ≈ 300,000 points are plotted.
2
We recognize that the AOL log has been withdrawn, but also note that it continues
to be widely used for research purposes.
150 Y. Kim et al.
ber of postings scored, a metric that is strongly correlated with total processing
time [3], and is less sensitive to system-specific tuning and precise hardware con-
figuration than is measured execution time. As a verification of this relationship,
Fig. 2 shows the correlation between processing time per query, per shard, and
the number of postings evaluated. There is a strong linear relationship; note also
that more than 99.9 % of queries completed in under 1 s with only a few extreme
outliers requiring longer.
Pruning Effectiveness of WAND on Topical Shards. The first experiment
investigated how WAND performs on the topical shards constructed by selective
search. Each shard was searched independently, as is typical in distributed settings
– parallelism is crucial to low response latency. w, the number of posting evalua-
tions required in each shard by WAND-based query evaluation was recorded. The
total length of the postings for the query terms in the selected shards was also
recorded, and is denoted as b, representing the number of postings processed by
an unpruned search in the same shard. The ratio w/b then measures the fraction
of the work WAND carried out compared to an unpruned search. The lower the
ratio, the greater the savings. Values of w/b can then be combined across queries
in two different ways: micro- and macro-averaging. In micro-averaging, w and b
are summed over the queries and a single value of w/b is calculated from the two
sums. In macro-averaging, w/b is calculated for each query, and averaged across
queries. The variance inherent in queries means that the two averaging methods
can produce different values, although broad trends are typically consistent.
Figure 3 and Table 1 provide insights into the behavior of macro- and micro-
averaging. Figure 3 uses the AOL queries and all 100 topical shards, plotting
w/b values on a per query per shard basis as a function of the total length of
the postings lists for that query in that shard. Queries involving only rare terms
benefit much less from WAND than queries with common terms. Thus, the
Fig. 3. Ratio of savings achieved by WAND as a function of the total postings length of
each query in the AOL set, measured on a per shard basis. A total of 100×713 ≈ 71,000
points are plotted. Queries containing only rare terms derive little benefit from WAND.
Does Selective Search Benefit from WAND Optimization? 151
Table 1. Ratio of per shard per query postings evaluated and per shard per query
execution time for WAND-based search, as ratios relative to unpruned search, averaged
over 100 topical shards and over 100 randomized shards, and over two groups each
of 700+ queries. The differences between the Topical and Random macro-averaged
ratios are significant for both query sets and both measures (paired two-tailed t-test,
p < 0.01).
Table 2. Average number of shards searched, and micro-averaged postings ratios for
those selected shards and for the complement set of shards, together with the cor-
responding query time cost ratios, in each case comparing WAND-based search to
unpruned search. Smaller numbers indicate greater savings.
Shards searched WAND postings cost ratio WAND runtime cost ratio
Selected Non-selected Selected Non-selected
Taily AOL 3.1 0.32 0.35 0.36 0.36
Taily MQT 2.7 0.23 0.37 0.30 0.40
Table 3. As for Table 2, but showing macro-averaged ratios. All differences between
selected and non-selected shards are significant (paired two-tailed t-test, p < 0.01).
which shards to search. For each query the WAND savings were calculated for
the small set of selected shards, and the much larger set of non-selected shards.
Table 2 lists micro-averaged w/b ratios, and Table 3 the corresponding macro-
averaged ratios. While all shards see improvements with WAND, the selected
shards see a greater efficiency gain than the non-selected shards, reinforcing
our contention that resource selection is an important component in search effi-
ciency. When compared to the ratios shown in Table 1, the selected shards see
substantially higher benefit than average shards; the two orthogonal optimiza-
tions generate better-than-additive savings.
Figure 4a shows the distribution of the individual per query per shard times
for the MQT query set, covering in the first four cases only the shards chosen by
the two resource selection processes. The fifth exhaustive search configuration
includes data for all of the 100 randomly-generated shards making up the second
system, and is provided as a reference point. Figure 4b gives numeric values for
1,000
Rank-S WAND 28.5 11.3
500 Taily Full 134.0 34.2
Taily WAND 42.7 23.6
0
Exhaustive WAND 26.6 21.8
Rank−S Full Rank−S WAND Taily Full Taily WAND Exhaustive WAND
(b)
(a)
Fig. 4. Distribution of query response times for MQT queries on shards: (a) as a box
plot distribution, with a data point plotted for each query-shard pair; (b) as a table
of corresponding means and medians. In (a), the center line of the box indicates the
median, the outer edges of the box the first and third quartiles, and the blue circle
the mean. The whiskers extend to include all points within 1.5 times the inter-quartile
range of the box. The graph was truncated to omit a small number of extreme points
for both Rank-S Full and Taily-Full. The maximum time for both these two runs was
6,611 ms.
Does Selective Search Benefit from WAND Optimization? 153
Fig. 5. Normalized 1,000 th document scores from shards, averaged over queries and
then shard ranks, and expressed as a fraction of the collection-wide maximum document
score for each corresponding query. The score falls with rank, as fewer high-scoring
documents appear in lower-ranked shards.
the mean and median of each of the five distributions. When WAND is combined
with selective search, it both reduces the average time required to search a
shard and also reduces the variance of the query costs. Note the large differences
between the mean and median query processing times for the unpruned search
and the reduction in that gap when WAND is used; this gain arises because query
and shard combinations that have high processing times due to long postings
lists are the ones that benefit most from WAND. Therefore, in typical distributed
environments where shards are evaulated in parallel, the slowest, bottleneck
shard will benefit the most from WAND and may result in additional gains
in latency reduction. Furthermore, while Fig. 4 shows similar per shard query
costs for selective and exhaustive search, the total work associated with selective
search is substantially less than exhaustive search because only 3–5 shards are
searched per query, whereas exhaustive search involves all 100 shards. Taken in
conjunction with the previous tables, Fig. 4 provides clear evidence that WAND
amplifies the savings generated by selective search, answering the first part of
RQ1 with a “yes”. In addition, these experiments have confirmed that execution
time is closely correlated with measured posting evaluations. The remaining
experiments utilize postings counts as the cost metric.
We now consider the second part of RQ1 and seek to explain why dynamic
pruning improves selective search. Part of the reason is that the postings lists
of the query terms associated with the highly ranked shards are longer than
they are in a typical randomized shard. With these long postings lists, there is
more opportunity for WAND to achieve early termination. Figure 5 shows nor-
malized final heap-entry thresholds, or equivalently, the similarity score of the
1,000 th ranked document in each shard. The scores are expressed as a fraction
of the maximum document score for that query across all shards, then plotted
as a function of the resource selector’s shard ranking using Taily, averaged over
queries. Shards that Taily did not score because they did not contain any query
154 Y. Kim et al.
Fig. 6. The micro-average w/b ratio for WAND postings evaluations, as a function of
the per query shard rankings assigned by Taily. Early shards generate greater savings.
terms were ordered randomly. For example, for the AOL log the 1,000 th doc-
ument in the shard ranked highest by Taily attains, on average across queries,
a score that is a little over 80 % of the maximum score attained by any single
document for that same query. The downward trend in Fig. 5 indicates that the
resource ranking process is effective, with the high heap-entry thresholds in the
early shards suggesting – as we would hope – that they contain more of the
high-scoring documents.
To further illustrate the positive relationship between shard ranking and
WAND, w/b was calculated for each shard in the per query shard orderings,
and then micro-averaged at each shard rank. Figure 6 plots the average as a
function of shard rank, and confirms the bias towards greater savings on the
early shards – exactly the ones selected for evaluation. As a reference point, the
same statistic was calculated for a random ordering of the randomized shards
(random since no shard ranking is applied in traditional distributed search), with
the savings ratio being a near-horizontal line. If an unpruned full search were
to be plotted, it would be a horizontal line at 1.0. The importance of resource
selection to retrieval effectiveness has long been known; Fig. 6 indicates that
effective resource selection can improve overall efficiency as well.
Improving Efficiency with Cascaded Pruning Thresholds. In the exper-
iments reported so far, the rankings were computed on each shard independently,
presuming that they would be executing in parallel and employing private top-k
heaps and private heap-entry thresholds, with no ability to share information.
This approach minimizes search latency when multiple machines are available,
and is the typical configuration in a distributed search architecture. An alter-
native approach is suggested by our second research question: what happens
if the shards are instead searched sequentially, passing the score threshold and
top-k heap from each shard to the next? The heap-entry score threshold is then
non-decreasing across the shards, and additional savings should result. While
this approach would be unlikely to be used in an on-line system, it provides
Does Selective Search Benefit from WAND Optimization? 155
Fig. 7. Normalized 1,000 th document scores from shards relative to the highest score
attained by any document for the corresponding query, micro-averaged over queries,
assuming that shards are processed sequentially rather than in parallel, using the Taily-
based ordering of topical shards and a random ordering of the same shards.
Fig. 8. Ratio of postings evaluated by WAND for independent shard search versus
sequential shard search, AOL queries with micro-averaging. Shard ranking was deter-
mined by Taily.
an upper bound on the efficiency gains that are possible if a single heap was
shared by all shards, and would increase throughput when limited resources are
available and latency is not a concern: for example, in off-line search and text
analytics applications.
Figure 7 demonstrates the threshold in the sequential WAND configuration,
with shards ordered in two ways: by Taily score, and randomly. The normalized
threshold rises quickly towards the maximum document score through the first
few shards in the Taily ordering, which is where most of the documents related
to the query are expected to reside. Figure 8 similarly plots the w/b WAND
savings ratio at each shard rank, also micro-averaged over queries, and with shard
ordering again determined by the Taily score. The independent and sequential
configurations diverge markedly in their behavior, with a deep search in the
latter processing far fewer postings than a deep search in the former. The MQT
156 Y. Kim et al.
query set displayed similar trends. Sharing the dynamic pruning thresholds has
a large effect on the efficiency of selective search.
Our measurements suggest that a hybrid approach between independent and
sequential search could be beneficial. A resource-ranker might be configured to
underestimate the number of shards that are required, with the understanding
that a second round of shard ranking can be instigated in situations where
deeper search is needed, identified through examining the scores or the quantity
of documents retrieved. When a second wave of shards is activated, passing the
maximum heap-entry threshold attained by the first-wave process would reduce
the computational cost. If the majority of queries are handled within the first
wave, a new combination of latency and workload will result.
4 Conclusion
Selective search reduces the computational costs of large-scale search by eval-
uating fewer postings than the standard distributed architecture, resulting in
computational work savings of up to 90 %. To date there has been only lim-
ited consideration of the interaction between dynamic pruning and selective
search [12], and it has been unclear whether dynamic pruning methods improve
selective search, or whether selective search is capturing some or all of the same
underlying savings as pruning does, just via a different approach. In this paper
we have explored WAND dynamic pruning using a large dataset and two differ-
ent query sets. In contrast to Kulkarni’s findings with TBMS [12], we show that
WAND-based evaluation and selective search generate what are effectively inde-
pendent savings, and that the combination is more potent than either technique
is alone – that is, that their interaction is a positive one. In particular, when
resource selection is used to choose query-appropriate shards, the improvements
from WAND on the selected shards is greater than the savings accruing on ran-
dom shards, confirming that dynamic pruning further improves selective search –
a rare situation where orthogonal optimizations are better-than-additive. We also
demonstrated that there is a direct correlation between the efficiency gains gen-
erated by WAND and the shard’s ranking. While it is well-known that resource
selection improves effectiveness, our results suggest that it can also improve
overall efficiency too.
Finally, two different methods of applying WAND to selective search were
compared and we found that passing the top-k heap through a sequential shard
evaluation greatly reduced the volume of postings evaluated by WAND. The
significant difference in efficiency between this approach and the usual fully-
parallel mechanism suggests avenues for future development in which hybrid
models are used to balance latency and throughput in novel ways.
References
1. Aly, R., Hiemstra, D., Demeester, T.: Taily: shard selection using the tail of score
distributions. In: Proceedings of the 36th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 673–682 (2013)
2. Arguello, J., Callan, J., Diaz, F.: Classification-based resource selection. In: Pro-
ceedings of the 18th ACM Conference on Information and Knowledge Management,
pp. 1277–1286 (2009)
3. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evalu-
ation using a two-level retrieval process. In: Proceedings of the 12th International
Conference on Information and Knowledge Management, pp. 426–434 (2003)
4. Cacheda, F., Carneiro, V., Plachouras, V., Ounis, I.: Performance comparison of
clustered and replicated information retrieval systems. In: Amati, G., Carpineto,
C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 124–135. Springer,
Heidelberg (2007)
5. Cambazoglu, B.B., Varol, E., Kayaaslan, E., Aykanat, C., Baeza-Yates, R.: Query
forwarding in geographically distributed search engines. In: Proceedings of the 33rd
International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval, pp. 90–97 (2010)
6. Croft, W.B.: A model of cluster searching based on classification. Inf. Syst. 5(3),
189–195 (1980)
7. Dimopoulos, C., Nepomnyachiy, S., Suel, T.: Optimizing top-k document retrieval
strategies for block-max indexes. In: Proceedings of the of the Sixth ACM Inter-
national Conference on Web Search and Data Mining, pp. 113–122 (2013)
8. Gravano, L., Garcı́a-Molina, H., Tomasic, A.: GlOSS: Text-source discovery over
the internet. ACM Trans. Database Syst. 24, 229–264 (1999)
9. Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical
database sampling and selection. In: Proceedings of the 28th International Confer-
ence on Very Large Data Bases, pp. 394–405 (2002)
10. Kang, C., Wang, X., Chang, Y., Tseng, B.: Learning to rank with multi-aspect
relevance for vertical search. In: Proceedings of the Fifth ACM International Con-
ference on Web Search and Data Mining, pp. 453–462 (2012)
11. Kulkarni, A., Callan, J.: Document allocation policies for selective searching of
distributed indexes. In: Proceedings of the 19th ACM International Conference on
Information and Knowledge Management, pp. 449–458 (2010)
12. Kulkarni, A., Callan, J.: Selective search: Efficient and effective search of large
textual collections. ACM Trans. Inf. Syst. 33(4), 17:1–17:33 (2015)
13. Kulkarni, A., Tigelaar, A., Hiemstra, D., Callan, J.: Shard ranking and cutoff
estimation for topically partitioned collections. In: Proceedings of the 21st ACM
International Conference on Information and Knowledge Management, pp. 555–564
(2012)
14. Lemire, D., Boytsov, L.: Decoding billions of integers per second through vector-
ization. Soft. Prac. & Exp. 41(1), 1–29 (2015)
15. Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval
quality for resource selection. In: Proceedings of the 26th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 290–297. ACM (2003)
16. Paltoglou, G., Salampasis, M., Satratzemi, M.: Integral based source selection for
uncooperative distributed information retrieval environments. In: Proceedings of
the 2008 ACM Workshop on Large-Scale Distributed Systems for Information
Retrieval, pp. 67–74 (2008)
158 Y. Kim et al.
17. Petri, M., Culpepper, J.S., Moffat, A.: Exploring the magic of WAND. In: Pro-
ceedings of the Australian Document Computing Symposium, pp. 58–65 (2013)
18. Rojas, O., Gil-Costa, V., Marin, M.: Distributing effciently the block-max WAND
algorithm. In: Proceedings of the 2013 International Conference on Computational
Science, pp. 120–129 (2013)
19. Salton, G.: Automatic Information Organization and Retrieval. McGraw-Hill,
New York (1968)
20. Shokouhi, M.: Central-Rank-Based Collection Selection in Uncooperative Distrib-
uted Information Retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR
2007. LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007)
21. Si, L., Callan, J.: Relevant document distribution estimation method for resource
selection. In: Proceedings of the 26th Annual International ACM SIGIR Conference
on Research and Development in Informaion Retrieval, pp. 298–305 (2003)
22. Strohman, T., Turtle, H., Croft, W.B.: Optimization strategies for complex queries.
In: Proceedings of the 28th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 219–225 (2005)
23. Thomas, P., Shokouhi, M.: Sushi: Scoring scaled samples for server selection. In:
Proceedings of the 32nd ACM SIGIR Conference on Research and Development
in Information Retrieval, pp. 419–426 (2009)
24. Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on
internet. In: Proceedings of the International Conference on Database Systems for
Advanced Applications, pp. 41–49 (1997)
Efficient AUC Optimization for Information
Ranking Applications
Sean J. Welleck(B)
1 Introduction
In various information retrieval applications, a system may need to provide a
ranking of candidate items that satisfies a criteria. For instance, a search engine
must produce a list of results, ranked by their relevance to a user query. The
relationship between items (e.g. documents) represented as feature vectors and
their rankings (e.g. based on relevance scores) is often complex, so machine
learning is used to learn a function that generates a ranking given a list of items.
The ranking system is evaluated using metrics that reflect certain goals for
the system. The choice of metric, as well as its relative importance, varies by
application area. For instance, a search engine may evaluate its ranking sys-
tem with Normalized Discounted Cumulative Gain (NDCG), while a question-
answering system evaluates its ranking using precision at 3; a high NDCG score
is meant to indicate results that are relevant to a user’s query, while a high
precision shows that a favorable amount of correct answers were ranked highly.
Other common metrics include Recall @ k, Mean Average Precision (MAP), and
Area Under the ROC Curve (AUC).
Ranking algorithms may optimize error rate as a proxy for improving metrics
such as AUC, or may optimize the metrics directly. However, typical metrics such
as NDCG and AUC are either flat everywhere or non-differentiable with respect
to model parameters, making direct optimization with gradient descent difficult.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 159–170, 2016.
DOI: 10.1007/978-3-319-30671-1 12
160 S.J. Welleck
2 Related Work
This work relates to two areas: LambdaMART and AUC optimization in ranking.
LambdaMART was originally proposed in [15] and is overviewed in [2]. The
LambdaRank algorithm, upon which LambdaMART is based, was shown to find
a locally optimal model for the IR metrics NDCG@10, mean NDCG, MAP, and
MRR [6]. Svore et al. [14] propose a modification to LambdaMART that allows
for simultaneous optimization of NDCG and a measure based on click-through
rate.
Various approaches have been developed for optimizing AUC in binary-class
settings. Cortes and Mohri [5] show that minimum error rate training may be
insufficient for optimizing AUC, and demonstrate that the RankBoost algorithm
globally optimizes AUC. Calders and Jaroszewicz [3] propose a smooth poly-
nomial approximation of AUC that can be optimized with a gradient descent
method. Joachims [9] proposes an SVM method for various IR measures includ-
ing AUC, and evaluates the system on text classification datasets. The SVM
method is used as the comparison baseline in this paper.
3 Ranking Metrics
We will first provide a review of the metrics used in this paper. Using document
retrieval as an example, consider n queries Q1 . . . Qn , and let n(i) denote the
number of documents in query Qi . Let dij denote document j in query Qi ,
where i ∈ 1, . . . , n, j ∈ 1 . . . n(i).
Efficient AUC Optimization for Information Ranking Applications 161
y = p y = n
f (x) = p TP FP
f (x) = n FN TN
where y denotes an example’s label, f (x) denotes the predicted label, p denotes
the class label considered positive, and n denotes the class label considered
negative.
Measuring the precision of the first k ranked documents is often impor-
tant in ranking applications. For instance, P recision@1 is important for ques-
tion answering systems to evaluate whether the system’s top ranked item is
a correct answer. Although precision is a metric for binary class labels, many
ranking applications and standard datasets have multiple class labels. To eval-
uate precision in the multi-class context we use Micro-averaged Precision and
Macro-averaged Precision, which summarize precision performance on multiple
classes [10].
where C denotes the number of classes, T Pc is the number of true positives for
class c, and F Pc is the number of false positives for class c.
P recisionmicro @k is measured by using only the first k ranked documents in
each query:
n C
1 c=1 |{dij |yj = c, j ∈ {1, . . . , k}}|
P recisionmicro @k = (2)
n i=1 (C)(k)
1
C
T Pc
P recisionmacro = (3)
C c=1 T Pc + F Pc
162 S.J. Welleck
AUC. AUC refers to the area under the ROC curve. The ROC curve plots True
Positive Rate (T P R = T PT+F P FP
N ) versus False Positive Rate (F P R = F P +T N ),
with T P R appearing on the y-axis, and F P R appearing on the x-axis.
Each point on the ROC curve corresponds to a contingency table for a given
model. In the ranking context, the contingency table is for the ranking cutoff k;
the curve shows the T P R and F P R as k changes. A model is considered to have
better performance as its ROC curve shifts towards the upper left quadrant.
The AUC measures the area under this curve, providing a single metric that
summarizes a model’s ROC curve and allowing for easy comparison.
We also note that the AUC is equivalent to the Wilcoxon-Mann-Whitney
statistic [5] and can therefore be computed using the number of correctly ordered
document pairs. Fawcett [7] provides an efficient algorithm for computing AUC.
Multi-class AUC. The standard AUC formulation is defined for binary classi-
fication. To evaluate a model using AUC on a dataset with multiple class labels,
AUC can be extended to multi-class AUC (MAUC).
We define the class reference AUC value AU C(ci ) as the AUC when class
label ci is viewed as positive and all other labels as negative. The multi-class
AUC is then the weighted sum of class reference AUC values, where each class
reference AUC is weighted by the proportion of the dataset examples with that
class label, denoted p(ci ) [7]:
C
M AU C = AU C(ci ) ∗ p(ci ). (5)
i=1
Note that the class-reference AUC of a prevalent class will therefore impact the
MAUC score more than the class-reference AUC of a rare class.
where
CorrectP airs = |{(i, j) : (i > j ) and (f (di ) > f (dj ))}| . (11)
Note that a pair with equal labels is not considered a correct pair, since a
document pair (di , dj ) contributes to CorrectP airs if and only if di is ranked
higher than dj in the ranked list induced by the current model scores, and i > j .
We now derive a formula for computing the ΔAU Cij term in O(1) time, given
the ranked list and labels. This avoids the brute-force approach of counting the
number of correct pairs before and after the swap, in turn providing an efficient
way to compute a λ-gradient for AU C. Specifically, we have:
(j − i )(j − i)
ΔAU Cij = . (12)
mn
Proof. To derive this formula, we start with
CPswap − CPorig
ΔAU Cij = (13)
mn
where CPswap is the value of CorrectP airs after swapping the scores assigned
to documents i and j, and CPorig is the value of CorrectP airs prior to the
swap. Note that the swap corresponds to swapping positions of documents i and
j in the ranked list. The numerator of ΔAU Cij is the change in the number of
correct pairs due to the swap. The following lemma shows that we only need to
compute the change in the number of correct pairs for the pairs of documents
within the interval [i,j] in the ranked list.
Lemma 1. Let (da , db ) be a document pair where at least one of (a, b) ∈ / [i, j].
Then after swapping documents (di , dj ), the pair correctness of (da , db ) will be
left unchanged or negated by another pair.
Proof. Without loss of generality, assume a < b. There are five cases to consider.
Case a ∈/ [i, j], b ∈
/ [i, j]: Then the pair (da , db ) does not change due to the swap,
therefore its pair correctness does not change.
Note that unless one of a or b is an endpoint i or j, the pair (da , db ) does not
change. Hence we now assume that one of a or b is an endpoint i or j.
Case a < i, b = i: The pair correctness of (da , db ) will change if and only if
a = 1, b = 1, j = 0 prior to the swap. But then the pair correctness of (di , dj )
will change from correct to not correct, canceling out the change (see Fig. 1).
Case a < i, b = j: Then the pair correctness of (da , db ) will change if and only
if a = 1, b = 1, i = 0 prior to the swap. But then the pair correctness of
(da , di ) will change from correct to not correct, canceling out the change.
Efficient AUC Optimization for Information Ranking Applications 165
1 a 1 a
i 1 b j 0 b
swap
j 0 i 1
Case a = i, b > j: Then pair correctness of (da , db ) will change if and only if
a = 0, b = 0, j = 1 prior to the swap. But then the pair correctness of (dj , db )
will change from correct to not correct, canceling out the change.
Case a = j, b > j: Then pair correctness of (da , db ) will change if and only if
a = 0, b = 0, i = 1 prior to the swap. But then the pair correctness of (di , db )
will change from correct to not correct, canceling out the change.
Hence in all cases, either the pair correctness stays the same, or the pair
(da , db ) changes from not correct to correct and an additional pair changes from
correct to not correct, thus canceling out the change with respect to the total
number of correct pairs after the swap.
Lemma 1 shows that the difference in correct pairs CPswap −CPorig is equivalent
to CPswap[i,j] − CPorig[i,j] , namely the change in the number of correct pairs
within the interval [i,j]. Lemma 2 tells us that this value is simply the length of
the interval [i,j].
Lemma 2. Assume i < j. Then
Case i = 0, j = 1: Before swapping, each pair (i, k), i < k ≤ j such that
k = 0 is not a correct pair. After the swap, each of these pairs is a correct pair.
There are nl0 [i,j] such pairs, namely the number of documents in the interval
[i, j] with label 0.
Each pair (k, j), i ≤ k < j such that k = 1 is not a correct pair before
swapping, and is correct after swapping. There are nl1 [i,j] such pairs, namely the
number of documents in the interval [i, j] with label 1.
Each pair (i, k), i < k ≤ j such that k = 1 remains not correct. Each pair
(k, j), i ≤ k < j such that k = 0 remains not correct. Every other pair remains
unchanged. Therefore
nl0 [i,j] + nl1 [i,j] = j − i (16)
pairs changed from not correct to correct, corresponding to an increase in the
number of correct pairs. Hence we have:
CPswap[i,j] − CPorig[i,j] = (j − i) = (j − i )(j − i) .
Therefore by Lemmas 1 and 2, we have:
CPswap − CPorig
ΔAU Cij =
mn
CPswap[i,j] − CPorig[i,j]
=
mn
(i − j )(j − i)
=
mn
completing the proof of Theorem 1.
Applying the formula from Theorem 1 to the list of documents sorted by the
current model scores, we define the λ-gradient for AUC as:
∂Cij
λAU Cij = Sij ΔAU Cij (17)
∂oij
∂Cij (i −j )(j−i)
where Sij and ∂oij are as defined previously, and ΔAU Cij = mn .
λ-MAUC. To extend the λ-gradient for AUC to a multi-class setting, we con-
sider the multi-class AUC definition found in Eq. (5). Since MAUC is a linear
combination of class-reference AUC values, to compute ΔM AU Cij we can com-
pute the change in each class-reference AUC value ΔAU C(ck ) separately using
Eq. (12) and weight each Δ value by the proportion p(ck ), giving:
C
ΔM AU Cij = ΔAU C(ck )ij ∗ p(ck ) . (18)
k=1
Efficient AUC Optimization for Information Ranking Applications 167
∂Cij
Using this term and the previously defined terms Sij and ∂oij , we define the
λ-gradient for M AU C as:
∂Cij
λM AU Cij = Sij ΔM AU Cij . (19)
∂oij
5 Experiments
Experiments were conducted on binary-class datasets to compare the AUC per-
formance of LambdaMART trained with the AUC λ-gradient, referred to as
LambdaMART-AUC, against a baseline model. Similar experiments were con-
ducted on multi-class datasets to compare LambdaMART trained with the
MAUC λ-gradient, referred to as LambdaMART-MAUC, against a baseline in
terms of MAUC. Differences in precision on the predicted rankings were also
investigated.
The LambdaMART implementation used in the experiments was a modified
version of the JForests learning to rank library [8]. This library showed the best
NDCG performance out of the available Java ranking libraries in preliminary
experiments. We then implemented extensions required to compute the AUC and
multi-class AUC λ-gradients. For parameter tuning, a learning rate was chosen
for each dataset by searching over the values {0.1, 0.25, 0.5, 0.9} and choosing
the value that resulted in the best performance on a validation set.
As the comparison baseline, we used a Support Vector Machine (SVM) for-
mulated for optimizing AUC. The SVM implementation was provided by the
SVM-Perf [9] library. The ROCArea loss function was used, and the regulariza-
tion parameter c was chosen by searching over the values {0.1, 1, 10, 100} and
choosing the value that resulted in the best performance on a validation set.
For the multi-class setting, a binary classifier was trained for each individual
relevance class. Prediction
C scores for a document d were then generated by com-
puting the quantity c=1 cfc (d), where C denotes the number of classes, and
fc denotes the binary classifier for relevance class c. These scores were used to
induce a ranking of documents for each query.
5.1 Datasets
to 4 (very relevant), while the LETOR datasets have integer relevance scores rang-
ing from 0 to 2. The LETOR datasets have 1700 and 800 queries, respectively,
while the larger Yahoo! datasets have approximately 20,000 queries.
5.2 Results
MAUC. Table 1 shows the MAUC scores on held out test sets for the
four multi-class datasets. The reported value is the average MAUC across all
dataset folds. The results indicate that in terms of optimizing Multi-class AUC,
LambdaMART-MAUC is as effective as SVM-Perf on the LETOR datasets, and
more effective on the larger Yahoo! datasets.
Additionally, the experiments found that LambdaMART-MAUC outper-
formed SVM-Perf in terms of precision in all cases. Table 2 shows the Mean Aver-
age Precision scores for the four datasets. LambdaMART-MAUC also had higher
P recisionmicro @k and P recisionmacro @k on all datasets, for k = 1, . . . , 10. For
instance, Fig. 2 shows the values of P recisionmicro @k and P recisionmacro @k
for the Yahoo! V1 dataset.
The class-reference AUC scores indicate that LambdaMART-MAUC and
SVM-Perf arrive at their MAUC scores in different ways. LambdaMART-MAUC
focuses on the most prevalent class; each ΔAU C(ci ) term for a prevalent class
receives a higher weighting than for a rare class due to the p(ci ) term in the
λM AU C computation. As a result the λ-gradients in LambdaMART-MAUC place
more emphasis on achieving a high AU C(c1 ) than a high AU C(c4 ). Table 3 shows
the class-reference AUC scores for the Yahoo! V1 dataset. We observe that
LambdaMART-MAUC produces better AU C(c1 ) than SVM-Perf, but worse
AU C(c4 ), since class 1 is much more prevalent than class 4; 48 % of the docu-
ments in the training set with a positive label have a label of class 1, while only
2.5 % have a label of class 4.
Finally, we note that on the large-scale Microsoft Learning to Rank Dataset
MSLR-WEB10k [11], the SVM-Perf training failed to converge on a single fold
after 12 hours. Therefore training a model for each class for every fold was
impractical using SVM-Perf, while LambdaMART-MAUC was able to train on
all five folds in less than 5 hours. This further suggests that LambdaMART-
MAUC is preferable to SVM-Perf for optimizing MAUC on large ranking
datasets.
AU C1 AU C2 AU C3 AU C4
LambdaMART-MAUC 0.503 0.690 0.757 0.831
SVM-Perf 0.474 0.682 0.796 0.920
6 Conclusions
We have introduced a method for optimizing AUC on ranking datasets using a
gradient-boosting framework. Specifically, we have derived gradient approxima-
tions for optimizing AUC with LambdaMART in binary and multi-class settings,
and shown that the gradients are efficient to compute. The experiments show that
the method performs as well as, or better than, a baseline SVM method, and per-
forms especially well on large, multi-class datasets. In addition to adding Lamb-
daMART to the portfolio of algorithms that can be used to optimize AUC, our
extensions expand the set of IR metrics for which LambdaMART can be used.
170 S.J. Welleck
There are several possible future directions. One is investigating local opti-
mality of the solution produced by LambdaMART-AUC using Monte Carlo
methods. Other directions include exploring LambdaMART with multiple objec-
tive functions to optimize AUC, and creating an extension to optimize area under
a Precision-Recall curve rather than an ROC curve.
Acknowledgements. Thank you to Dwi Sianto Mansjur for giving helpful guidance
and providing valuable comments about this paper.
References
1. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N.,
Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the
22Nd International Conference on Machine Learning ICML 2005, pp. 89–96. ACM,
New York (2005)
2. Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning
11, 23–581 (2010)
3. Calders, T., Jaroszewicz, S.: Efficient AUC optimization for classification. In: Kok,
J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron,
A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 42–53. Springer, Heidelberg
(2007)
4. Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview (2011)
5. Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. Adv. Neural
Inf. Process. Syst. 16(16), 313–320 (2004)
6. Donmez, P., Svore, K., Burges, C.J.: On the optimality of lambdarank. Technical
Report MSR-TR-2008-179, Microsoft Research, November 2008. http://research.
microsoft.com/apps/pubs/default.aspx?id=76530
7. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8),
861–874 (2006)
8. Ganjisaffar, Y., Caruana, R., Lopes, C.V.: Bagging gradient-boosted trees for high
precision, low variance ranking models. In: Proceedings of the 34th international
ACM SIGIR conference on Research and development in Information Retrieval,
pp. 85–94. ACM (2011)
9. Joachims, T.: A support vector method for multivariate performance measures.
In: Proceedings of the 22Nd International Conference on Machine Learning,
pp. 377–384. ACM, New York (2005)
10. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, Cambridge (2008)
11. Microsoft learning to rank datasets. http://research.microsoft.com/en-us/
projects/mslr/
12. Qin, T., Liu, T.: Introducing LETOR 4.0 datasets. CoRR abs/1306.2597 (2013).
http://arxiv.org/abs/1306.2597
13. Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: A benchmark collection for researchon
learning to rank for information retrieval. Inf. Retr. 13(4), 346–374 (2010).
http://dx.doi.org/10.1007/s10791-009-9123-y
14. Svore, K.M., Volkovs, M.N., Burges, C.J.: Learning to rank with multiple objective
functions. In: Proceedings of the 20th International Conference on World Wide
Web, pp. 367–376. ACM (2011)
15. Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information
retrieval measures. Inf. Retrieval 13(3), 254–270 (2010)
Modeling User Interests for Zero-Query Ranking
1 Introduction
The recent boom of mobile internet has seen the emergence of proactive search
systems like Google Now, Apple Siri and Microsoft Cortana. Unlike traditional
reactive Web search systems where the search engines return results in response
to queries issued by users, proactive search systems actively push information
cards to users on mobile devices based on the context such as time, location,
environment (e.g., weather), and user interests. Information cards are concise
and informative snippets commonly shown in many intelligent personal assistant
L. Yang—Work primarily done when interning at Microsoft.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 171–184, 2016.
DOI: 10.1007/978-3-319-30671-1 13
172 L. Yang et al.
mercial platforms and devices to represent user interests. For card topics, entities
are extracted from the associated URLs. To reduce the feature sparsity prob-
lem, we propose entity semantic similarity features based on word embedding
and an entity taxonomy in knowledge base. The contributions of our work can
be summarized as the follows:
– We present the first in-depth study on modeling user interests for improving
the ranking of proactive search systems.
– We propose a variety of approaches of modeling user interests, ranging from
coarser-grained modeling of card type preferences, to finer-grained modeling
of entity preferences, and the variations in demographics.
– We perform a thorough experimental evaluation of the proposed methods with
large-scale logs from a commercial proactive system, showing that our method
significantly outperforms a strong baseline ranker deployed in the production
system.
– We conduct in-depth feature analysis, which provides insights for guiding
future feature design of the proactive ranking systems.
2 Related Work
There is a range of previous research related to our work that falls into different
categories including proactive information retrieval, information cards, search
personalization and recommender systems.
Proactive Information Retrieval. Rhodes and Maes [14] proposed the just-
in-time information retrieval agent (JITIR agent) that proactively retrieves and
presents information based on a person’s local context. The motivation of mod-
ern proactive search system is very similar to JITIRs, but the monitored user
context and presented content of modern proactive system are more extensive
than traditional JITIRs.
Information Cards. Web search has seen rapid growth in mobile search traffic,
where answer-like results on information cards are better choices than a ranked
list to address simple information needs given the relative small size of screens
on mobile devices. For some types of information cards like weather, users could
directly find the target information from contents on cards without any clicks.
This problem was introduced by Li et al. [9] as “good abandonment”. Based
on this problem, Guo et al. [7] proposed a study of modeling interactions on
touch-enabled device for improving Web search ranking. Lagun et al. [8] studied
the browser viewport which is defined as visible portion of a web page on mobile
phones to provide good measurement of search satisfaction in the absence of
clicks. For our experiments, we also consider viewport based dwell time to gen-
erate relevance labels for information cards to handle the good abandonment
problem.
Search Personalization. Proactive systems recommend highly personalized
contents to users based on their interest and context. Hence, our work is also
174 L. Yang et al.
3 Method Overview
We adopt common IR terminology when we define the proactive search problem.
A proactive impression consists of a ranked list of information cards presented
to users together with the user interaction logs recording clicks and viewports.
Given a set of information cards {C1 , C2 , ..., Cn } and the corresponding relevance
labels, our task is to learn a ranking model R to rank the cards based on available
features θ and optimize a pre-defined metric E defined over the card ranked list.
We propose a framework for proactive ranking referred to as UMPRanker
(User Modeling based Proactive Ranker). Firstly we mine user interests from mul-
tiple user logs. Each user is distinguished by a unique and anonymized identifier
which is commonly used in these platforms. The information collected from these
different platforms forms the basis of our user modeling. Then we derive multi-
ple user interest features including entity based user interests, card type based
implicit feedback and user demographics based on the collected information. Infor-
mation cards are generated from multiple pre-defined information sources and
templates including weather, finance, news, calendar, places, event, sports, flight,
traffic, fitness, etc. We also extract card features from the associated URLs and
card types. Given user features and card features, we can train a learning to rank
model. Given a trigger context like particular time, location or event, information
cards are ranked by the model and pushed to the user’s device.
the information sources of user interests we consider include the following user
behavior in logs:
Note that users have the right to choose whether they would like the services
to collect their behavior data. The logs we collected are from “opt-in” users only.
To represent user interests, we extract entities from the text content specified by
user behaviors. We can also represent information card topics by entities exacted
from card URLs. Entities in user profiles and cards are linked with entities in a
large scale knowledge base to get a richer representation of user interests.
This feature group is based on statistics of user interactive history with different
card types like average view time of each card type, accept ratio of each card
type, etc. This group of features aims at capturing individual user preferences of
particular card type, for example, news, based on the statistics of the historical
interactions. Specifically, for each <user, card type> pair, features extracted
include historical clickthrough rate (CTR), SAT CTR (i.e., clicks with more
than 30 seconds dwell time on landing pages), SAT view rate (i.e., card views
with more than 30 seconds), hybrid SAT rate (i.e., rate of either a SAT click or
a SAT view), view rate, average view time, average view speed, accept ratio of
the card suggestions, untrack ratio of the card type, ratio of the cards being a
suggestion. The details of these features are explained in Table 1.
Feature Description
CTR Personal historical clickthrough rate of the card type
SATCTR Personal historical SAT (landing page dwell time > 30 s)
clickthrough rate of the card type
SATViewRate Personal historical SAT (card view time > 30 s) view rate
of the card type
ViewRate Personal historical view rate of the card type
AverageViewTime Personal historical average view time of the card type
AverageViewTimeSpeed Personal historical average view time per pixel of the card
type
AcceptRatio Personal historical accept ratio when the card type was
presented as a suggestion
UnTrackRatio Personal historical ratio untrack the card type
SuggestionRate Personal historical ratio of seeing the card type being
presented as a suggestion
Term Match. exact match feature suffers from feature sparsity problem. A
better method is to treat Ui and Cj as two term sets. Then we could get two
entity term distributions over Ui and Cj . The cosine similarity between these
two entity term distributions becomes term match feature.
Language Models. The feature LM Score is based on the language modeling
approach to information retrieval [16]. We treat the card entity term set as the
query and the user entity term set as the document. Then we compute the log
likelihood of generating card entity terms using a language model constructed
from user entity terms. We use Laplace smoothing in the computation of lan-
guage model score.
Word Embedding. We extract semantic similarity features between entities
based on word embeddings. Word embeddings [11,12] are continuous vector rep-
resentations of words learned from very large data sets based on neural networks.
The learned word vectors can capture the semantic similarity between words.
In our experiment, we trained a Word2Vec model using the skip-gram algo-
rithm with hierarchical softmax [12]. The training data was from the Wikipedia
English dump obtained on June 6th, 2015. Our model outputs vectors of size
200. The total number of distinct words is 1, 425, 833. We then estimate entity
vectors based on word vectors. For entities that are phrases, we compute the
average vector of embedding of words within the entity phrase. After vector nor-
malization, we use the dot product of entity vectors to measure entity similarity.
To define features for the similarity of Ui and Cj , we consider feature variations
inspired by hierarchical clustering algorithms as shown in Table 2.
Entity Taxonomy in Knowledge Base. Another way to extract semantic simi-
larity features between entities is measuring the similarity of entity taxonomy [10].
Modeling User Interests for Zero-Query Ranking 177
Feature Description
RawMatchCount The raw match count of entities by id in Ui and Cj
EMJaccardIndex The Jaccard Index of entities matched by id in Ui
and Cj
TMNoWeight The cosine similarity between two entity term
distributions in Ui and Cj
TMWeighted Similar to TMNoWeight, but terms are weighted by
impression count
LMScore The log likelihood of generating terms in Cj using a
language model constructed from terms in Ui
WordEBDMin The similarity between Ui and Cj based on word
embedding features(single-link)
WordEBDMax The similarity between Ui and Cj based on word
embedding features(complete-link)
WordEBDAvgNoWeight The similarity between Ui and Cj based on word
embedding features(average-link-noWeight)
WordEBDAvgWeighted The similarity between Ui and Cj based on word
embedding features(average-link-weighted)
KBTaxonomyLevel1 The similarity between Ui and Cj based on entity
taxonomy similarity in level 1
KBTaxonomyLevel1Weighted Similar to EntityKBTaxonomyLevel1 but each
entity is weighted by its impression counts
KBTaxonomyLevel2 The similarity between Ui and Cj based on entity
taxonomy similarity in level 2
KBTaxonomyLevel2Weighted Similar to EntityKBTaxonomyLevel2 but each
entity is weighted by its impression counts
As presented in Sect. 4, we link entities in the user interest profile with entities in
a large scale knowledge base. From the knowledge base, we can extract the entity
taxonomy which is the entity type information. Two entities without any common
terms could have similarities if they share some common entity types.
Table 3 shows entity taxonomy examples for “Kobe Bryant” and “Byron Scott”.
We can see that these two entities share common taxonomies like “basketball.
player”, “award.winner”. They also have their own special taxonomies.
“Kobe Bryant” has “olympics.athlete” in the taxonomies whereas “Byron Scott”
has the taxonomy named “basketball.coach”. Based on this observation, we can
measure the semantic similarity between two entities base on their taxonomies.
Specifically, we measure the similarity of two entities based on the Jaccard index
of the two corresponding taxonomy sets. Since all taxonomies only have two levels,
we compute entity taxonomy similarity features in two different granularity. When
we measure the similarity of the Ui and Cj , we can compute the average similar-
178 L. Yang et al.
Table 3. Examples of entity taxonomy for “Kobe Bryant” and “Byron Scott”.
ity of all entity pairs in this two entity sets. We compute a weighted version where
each entity is weighted by its impression count and a non-weighted version for this
features. In summary, in this feature group, we have 4 features that are listed in
Table 2.
6 Experiments
We use real data from a commercial intelligent personal assistant for the experi-
ments. The training data is from one week between March 18th, 2015 and March
24th, 2015. The testing data is from one week between March 25th, 2015 and
March 31st, 2015. The statistics of card impressions and users are shown in
Table 5.
The user profiles represented by entities are built from multiple logs pre-
sented in Sect. 4. The time window from user profile is from March 18th, 2014
Modeling User Interests for Zero-Query Ranking 179
Feature Description
UserAge Integer value of user’s age
UserGender Binary value of user’s gender
UserLanguage Integer value to denote user’s language
UserRegisterYears Integer value to denote the number of years since user’s
registration
CardAvgAge Average age of all users who clicked the URLs on the card
CardAvgGender Average gender value of all users who clicked the URLs on
the card
AgeAbsDistance The absolute distance for age between the user with all
users who clicked card URLs
AgeRelDistance The relative distance for age between the user with all users
who clicked card URLs
GenderAbsDistance The absolute distance for gender between the user with all
users who clicked card URLs
GenderRelDistance The relative distance for gender between the user with all
users who clicked card URLs
to March 17 th, 2015. So there is no overlap time between the user profiles and
training/testing data. Since most proactive impressions have only one card with
a positive relevance label, we pick mean reciprocal rank(MRR) and NDCG@1
as the evaluation metric.
SAT-View: Some types of cards do not require a click to satisfy users’ information
needs. For instance, users could scan the weather and temperature information
on the cards without any clicking behavior. Stock cards could also tell users
the real-time stock price of a company directly in the card content. Cards with
viewport duration ≥ 10 s are labeled as relevant and the others are non-relevant.
– UMPRanker-I (IF): The ranker from adding IF features on top of the features
being used in the production ranker.
– UMPRanker-IE (IF + EF): The ranker from adding IF and EF features on
top of the features being used in the production ranker.
– UMPRanker-IEU (IF + EF+ UD): The ranker from adding IF, EF and UD
features on top of the features being used in the production ranker.
Table 6. Comparison of different rankers with the production ranker. The gains and
losses are only reported in relative delta values to respect the proprietary nature of the
baseline ranker. All differences are statistically significance (p < 0.05) according to the
paired t-test.
NDCG@1 compared to the strong baseline ranker that was shipped to produc-
tion, which is very substantial. On top of the baseline features and IF features,
adding EF features, we were able to see significant larger gains of 2.37 % in MRR
and 2.38 % in NDCG@1, demonstrating the substantial additional values in the
entity-level modeling. Finally, with the UD features added, we were able to see
additional statistically significant gains, even though to a lesser extent, making
the total improvements of MRR to NDCG@1 both to 2.39 %.
users’ preferences on different card types based on user historical interaction with
the intelligent assistant system. Entity based features like KBT axonomyLevel1
W eighted and LM Score are useful for improving proactive ranking through
modeling user interests with user engaged textual content with term matching
and semantic features. UD features, as shown in Table 7, are not as important
as IF and EF features. However, they can still contribute to a better proactive
ranking by capturing user preferences with user demographics information.
Table 8. Examples of reranked cards in the testing data. “IsSuccess” denotes the
inferred relevance labels based on SAT-Click or SAT-View with the timestamp denoted
by “Time”.
Acknowledgments. This work was done during Liu Yang’s internship at Microsoft
Research and Bing. It was supported in part by the Center for Intelligent Information
Retrieval and in part by NSF grant #IIS-1419693. Any opinions, findings and conclu-
sions or recommendations expressed in this material are those of the authors and do
not necessarily reflect those of the sponsor. We thank Jing Jiang and Jiepu Jiang for
their valuable and constructive comments on this work.
References
1. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating
user behavior information. In: SIGIR 2006, pp. 19–26. ACM, New York (2006)
2. Allan, J., Croft, B., Moffat, A., Sanderson, M.: Frontiers, challenges, and oppor-
tunities for information retrieval: Report from SWIRL 2012 the second strategic
workshop on information retrieval in lorne. SIGIR Forum 46(1), 2–32 (2012)
3. Bennett, P.N., Radlinski, F., White, R., Yilmaz, E.: Inferring and using location
metadata to personalize web search. In: SIGIR 2011, July 2011
4. Bennett, P.N., White, R.W., Chu, W., Dumais, S.T., Bailey, P., Borisyuk, F., Cui,
X.: Modeling the impact of short- and long-term behavior on search personaliza-
tion. In: SIGIR 2012, pp. 185–194. ACM, New York (2012)
5. Burges, C., Ragno, R., Le, Q.: Learning to rank with non-smooth cost functions.
In: NIPS 2007. MIT Press, Cambridge, January 2007
6. Fox, S., Karnawat, K., Mydland, M., Dumais, S., White, T.: Evaluating implicit
measures to improve web search. ACM Trans. Inf. Syst. 23(2), 147–168 (2005)
7. Guo, Q., Jin, H., Lagun, D., Yuan, S., Agichtein, E.: Mining touch interaction
data on mobile devices to predict web search result relevance. In: SIGIR 2013, pp.
153–162 (2013)
8. Lagun, D., Hsieh, C.-H., Webster, D., Navalpakkam, V.: Towards better measure-
ment of attention and satisfaction in mobile search. In: SIGIR 2014, pp. 113–122.
ACM, New York (2014)
9. Li, J., Huffman, S., Tokuda, A.: Good abandonment in mobile and PC internet
search. In: SIGIR 2009, pp. 43–50. ACM, New York (2009)
184 L. Yang et al.
10. Lin, T., Mausam, Etzioni, O.: No noun phrase left behind: Detecting and typing
unlinkable entities. In: EMNLP-CoNLL 2012, pp. 893–903. Association for Com-
putational Linguistics, Stroudsburg (2012)
11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint (2013). arxiv:1301.3781
12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: NIPS 2013, pp.
3111–3119 (2013)
13. Resnick, P., Varian, H.R.: Recommender systems. Commun. ACM 40(3), 56–58
(1997)
14. Rhodes, B.J., Maes, P.: Just-in-time information retrieval agents. IBM Syst. J.
39(3–4), 685–704 (2000)
15. Shokouhi, M., Guo, Q.: From queries to cards: Re-ranking proactive card recom-
mendations based on reactive search history. In: SIGIR 2015, May 2015
16. Song, F., Croft, W.B.: A general language model for information retrieval. In:
CIKM 1999, pp. 316–321. ACM, New York (1999)
17. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv in
Artif Intell 2009, 4:2–4:2 (2009)
18. Wang, L., Bennett, P.N., Collins-Thompson, K.: Robust ranking models via risk-
sensitive optimization. In: SIGIR 2012, pp. 761–770. ACM, New York (2012)
19. White, R.W., Chu, W., Hassan, A., He, X., Song, Y., Wang, H.: Enhancing per-
sonalized search by mining and modeling task behavior. In: WWW 2013, Republic
and Canton of Geneva, Switzerland, pp. 1411–1420 (2013)
20. Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information
retrieval measures. Inf. Retr. 13(3), 254–270 (2010)
21. Xu, S., Jiang, H., Lau, F.C.-M.: Mining user dwell time for personalized web search
re-ranking. In: IJCAI 2011, pp. 2367–2372 (2011)
Evaluation Methodology
Adaptive Effort for Search Evaluation Metrics
1 Introduction
Searchers wish to find more but spend less. To accurately measure their search
experience, we need to consider both the amount of relevant information they
found (gain) and the effort they spent (cost). In this paper, we use effort and cost
interchangeably because nowadays using search engines is mostly free of costs
other than users’ mental and physical effort (e.g., formulating queries, examining
result snippets, and reading result web pages). Other costs may become relevant
in certain scenarios – e.g., the price charged to search and access information in
a paid database – but we only consider users’ effort in this paper.
We show that a wide range of existing evaluation metrics can be summa-
rized as some form of gain/effort ratio. These metrics focus on modeling users’
gain [6,15] and interaction with a ranked list [2,6,13]—for example, nDCG [6],
GAP [15], RBP [13], and ERR [2]. However, they use simple effort models,
considering search effort as linear to the (expected) number of examined results.
This implicitly assumes that users spend the same effort to examine every result.
But evidence suggests that users usually invest greater effort on relevant results
than on non-relevant ones, e.g., users are more likely to click on relevant entries
[20], and they spend a longer time on relevant results [12].
To better model users’ search experience, we adapt these metrics to account
for different effort on results with different relevance grades. We examine two
approaches: a parametric one that simply employs a parameter for the ratio
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 187–199, 2016.
DOI: 10.1007/978-3-319-30671-1 14
188 J. Jiang and J. Allan
of effort between relevant and non-relevant entries; and a time-based one that
measures effort by the expected time to examine the results, which is similar to
time-biased gain [17]. Both approaches model users’ effort adaptively according
to the results in the ranked list. We evaluate the adaptive effort metrics by
correlating with users’ ratings on search quality. Results show that the adaptive
effort metrics can better predict users’ ratings compared with existing ones.
j
gstop (j)
k k
gain g(i)
M2 = E( )= Pstop (j) · = Pstop (j) · i=1
j
(2)
effort j=1
estop (j) j=1 i=1 e(i)
Table 1 lists components for the M1 and M2 metrics discussed in this paper.
Note that this is only one possible way of explaining these metrics, while other
interpretations may also be reasonable. We use r(i) for the relevance of the ith
result and b(i) for its binary version, i.e., b(i) = 1 if r(i) > 0, otherwise b(i) = 0.
k j r(i) r
max m
Nr b(j) s=1 gs
GAP = · · i=1
j , E(N r ) = N m gs (3)
E(Nr ) j=1 Nr i=1 1 m=1 s=1
Rank-biased precision (RBP) [13] models that after examining a result, users
have probability p to examine the next result, and 1 − p to stop. Users always
examine the first result. RBP is an M1 metric. Users have pi−1 probability to
examine the ith entry. RBP and P@k have the same gain and effort function.
1
Note that the original RBP computes effort to an infinite rank (E(effort) = 1−p ).
k
Here we measure both gain and effort to some cutoff k (E(effort) = 1−p 1−p ). This
results in a slight numerical difference. But the two metrics are equivalent for
evaluation purposes because they are proportional when p and k are predefined.
190 J. Jiang and J. Allan
We also extend P@k and RBP to consider graded relevance judgments using
r(i)
the gain function in GAP (g(i) = s=1 gs ). We call the extensions graded P@k
(GP@k) and graded RBP (GRBP).
Reciprocal rank (RR) is an M2 metric where users always and only stop at rank t
(the rank of the first relevant result).
Expected reciprocal rank (ERR) [2] further models the chances that users
stop at different ranks while sequentially examining a ranked list. ERR models
that searchers, after examining the ith result, have probability R(i) to stop, and
r(i)
−1
1 − R(i) to examine the next result. Chapelle et al. [2] define R(i) = 22rmax . In
order to stop at the jth result, users need to first have the chance to examine
the jth result (they did not stop after examining results at higher ranks) and
j−1
then stop after examining the jth result—Pstop (j) = R(j) m=1 (1 − R(m)).
Both RR and ERR model that users always have 1 unit gain when they stop.
They do not have an explicit gain function for individual results, but model stop-
ping probability based on result relevance. To fit them into the M2 framework,
we define, when users stop at rank j, g(i) = 1 if i = j, otherwise g(i) = 0. For
both metrics, e(i) = 1, such that stopping at rank j costs j unit effort.
k
1
j−1
ERR@k = Pstop (j) · j , Pstop (j) = R(j) (1 − R(m)) (4)
j=1 i=1 1 m=1
Discounted cumulative gain (DCG) [6] sums up each result’s gain in a ranked
list, with a discount factor 1/ log2 (i + 1) on the ith result. It seems that DCG
has no effort factor. But we can also consider DCG as a metric where a ranked
list of length k always costs the user a constant effort 1. We can rewrite DCG as
an M1 metric as Eq. 5. The log discount can be considered as the examination
probability. Each examined result costs users e(i) effort, such that E(effort) sums
up to 1. e(i) can be considered as a constant because it only depends on k.
k 2r(i) −1
k
2r(i) − 1 i=1 log2 (i+1) 1
DCG@k = = k e(i)
, e(i) = k (5)
i=1
log 2 (i + 1) 1
i=1 log2 (i+1)
i=1 log2 (i+1)
The normalized DCG (nDCG) metric [6] is computed as the ratio of DCG
to IDCG (the DCG of an ideal ranked list). For both DCG and IDCG, E(effort)
equates 1, which can be ignored when computing nDCG. However, E(effort) for
DCG and IDCG can be different if we set e(i) adaptively for different results.
Adaptive Effort for Search Evaluation Metrics 191
Section 2 explained many current metrics as users’ gain/effort ratio with a con-
stant effort on different results. This is oversimplified. Instead, we assign different
effort to results with different relevance grades. Let 0, 1, 2, ..., rmax be the possi-
ble relevance grades. We define an effort vector [e0 , e1 , e2 , ..., ermax ], where er is
the effort to examine a result with the relevance grade r.
We consider two ways to construct such effort vector in this paper. The first
approach is to simply differentiate the effort on relevant and non-relevant results
using a parameter er/nr . We set the effort to examine a relevant result to 1 unit.
er/nr is the ratio of effort on a relevant result to a non-relevant one—the effort to
1
examine a non-relevant result is er/nr . For example, if we consider three relevance
1
grades (r = 0, 1, 2), the effort vector is [ er/nr , 1, 1]. Here we restrict er/nr ≥ 1—
relevant results cost more effort than non-relevant ones (because users are more
likely to click on relevant results [20] and spend a longer time on them [12]).
The second approach estimates effort based on observed user interaction from
search logs. Similar to time-biased gain [17], we measure effort by the amount
of time required to examine a result. We assume that, when examining a result,
users first examine its summary and make decisions on whether or not to click
on its link. If users decide to click on the link, they further spend time reading its
content. Equation 6 estimates t(r), the expected time to examine a result with
relevance r, where: tsummary is the time to examine a result summary; tclick (r)
is the time spent on a result with relevance r after opening its link; Pclick (r) is
the chance to click on a result with relevance r after examining its summary.
Table 2 shows the estimated time from a user study’s search log [9]. We use
this log to verify adaptive effort metrics. Details will be introduced in Sect. 4.
The log does not provide tsummary . Thus, we use the reported value of tsummary
in Smucker et al.’s article [17] (4.4 s). Pclick (r) and tclick (r) are estimated from
this log. The search log collected users’ eye movement data such that we can
estimate Pclick (r). The estimated time to examine Highly Relevant, Relevant,
and Non-relevant results is 37.6, 23.0, and 9.8 s, respectively.
We set the effort to examine a result with the highest relevance grade (2 for
this search log) to 1 unit. We set the effort on a result with the relevance grade r
to t(rt(r)
max )
. The effort vector for this log is [9.8/37.6, 23.0/37.6, 1] = [0.26, 0.61, 1].
3.2 Computation
The adaptive effort metrics are simply variants of the metrics in Table 1 using
the effort vectors introduced in Sect. 3.1—we replace e(i) by e(r(i)), i.e., the
effort to examine the ith result only depends on its relevance r(i). For example,
let a ranked list of five results have relevance [0, 0, 1, 2, 0]. Equation 7 computes
1
adaptive P@k and RR using an effort vector [ er/nr , 1, 1].
2 1
Padaptive = , RRadaptive = (7)
2+3× 1
er/nr
1
er/nr + 1
er/nr +1
When we set different effort to results with different relevance grades, users’
effort is not linear to the (expected) number of examined results anymore, but
further depends on results’ relevance and positions in the ranked list. We look
into the same example, and assume an ideal ranked list [2, 2, 2, 1, 1]. In such case,
1
DCG has E(effort) = er/nr + er/nr1log 3 + log1 4 + log1 5 + er/nr1log 6 , but IDCG has
2 2 2 2
E(effort) = 1+ log1 3 + log1 4 + log1 5 + log1 6 . The effort part is not trivial anymore
2 2 2 2
when we normalize adaptive DCG using adaptive IDCG (adaptive nDCG).
Adaptive effort metrics have a prominent difference with static effort metrics.
When we replace a non-relevant result with a relevant one, the gain of the ranked
list does not increase for free in adaptive effort metrics. This is because the users’
effort on the ranked list also increases (assuming relevant items cost more effort).
k
Equation 8 rewrites M1 and M2 metrics as i=1 g(i) · d(i). It suggests that,
when discounting a result’s gain, adaptive effort metrics consider users’ effort on
each result in the ranked list. For M1 metrics, d(i) = PE(effort)
examine (i)
. Increasing the
effort at any rank will increase E(effort), and penalize every result in the ranked
list by a greater extent. For M2 metrics, d(i) depends on the effort to stop at
j
rank i and each lower rank. Since estop (j) = m=1 e(m), d(i) also depends on
users’ effort on each result in the ranked list. This makes the rank discounting
mechanism in adaptive effort metrics more complex than conventional ones. We
leave the analysis of such discounting mechanism for future work.
k
Pexamine (i) k Pstop (j)
k
M1 = g(i) · , M2 = g(i) · (8)
i=1
E(effort) i=1
e
j=i stop
(j)
The time-based effort vector looks similar to the time estimation in time-biased
gain (TBG) [17]. But we estimate tclick based on result relevance, while TBG uses
Adaptive Effort for Search Evaluation Metrics 193
a linear model that depends on document length. We made this choice because
the former better correlates with tclick in the dataset used for evaluation.
Despite their similarity in time estimation, adaptive effort metrics and TBG
are motivated differently. TBG models “the possibility that the user stops at
some point by a decay function D(t), which indicates the probability that the
user continues until time t” [17]. The longer (the more effort) it takes to reach
a result, the less likely that users are persistent enough to examine the result.
Thus, we can consider TBG as a metric that models users’ examination behavior
(Pexamine ) adaptively according to the effort spent prior to examining a result.
U-measure [16] is similar to TBG. But the discount function d(i) is depen-
dent on the cumulative length of the texts users read after examining the ith
result. The more users have read (the more effort users have spent) when they
finish examining a result, the less likely the result will be useful. This seems
a reasonable heuristic, but it remains unclear what the discounting function
models.
In contrast to TBG, adaptive effort metrics retain the original examination
models (Pexamine and Pstop ) in existing metrics, but further discount the results’
gain by the effort spent. The motivation is that for each unit of gain users acquire,
we need to account for the cost (effort) to obtain that gain (and assess whether
or not it is worthwhile). Comparing to U-measure, our metrics are different in
that a result’s gain is discounted based on not only what users examined prior
to the result and for that result, but also those examined afterwards (as Eq. 8
shows). The motivation is that user experience is derived from and measured
for searchers’ interaction with the ranked list as a whole—assuming a fixed
contribution for an examined result regardless of what happened afterwards
(such as in TBG and U-measure) seems oversimplified.
Therefore, we believe TBG, U-measure, and the proposed metrics all consider
adaptive effort, but from different angles. This leaves room to combine them.
4 Evaluation
We evaluate a metric by how well it correlates with and predicts user perception
on search quality. By modeling search effort adaptively, we expect the metrics
can better indicate users’ search experience. We use data from a user study [9] to
examine adaptive effort metrics1 . The user study asked participants to use search
engines to work on some search tasks, and then rate their search experience and
judge relevance of results. The dataset only collected user experience in a search
session, so we must make some assumptions to verify metrics for a single query.
Relevance of results were judged at three levels: Highly Relevant (2), Relevant
(1), or Non-relevant (0). Users rated search experience by answering: how well
do you think you performed in this task ? Options are: very well (5), fairly well
1
The dataset and source code for replicating our experiments can be accessed at
https://github.com/jiepujiang/ir metrics/.
194 J. Jiang and J. Allan
(4), average (3), rather badly (2), and very badly (1). Users rated 22 sessions as
very well, 27 as fairly well, 22 as average, 7 as rather badly, and 2 as very badly.
When evaluating a metric, we first use it to score each query in a session.
We use the average score of queries as an indicator for the session’s perfor-
mance. We assess the metric by how well the average score of queries in a ses-
sion correlates with and predicts users’ ratings on search quality for that session.
This assumes that average quality of queries in a session indicates that session’s
quality.
We measure correlations using Pearson’s r and Spearman’s ρ. In addition, we
evaluate a metric by how well it predicts user-rated performance. This approach
was previously used to evaluate user behavior metrics [8]. For each metric, we fit
a linear regression model (with intercept). The dependent variable is user-rated
search performance in a session. The independent variable is the average metric
score of queries for that session. We measure the prediction performance by
normalized root mean square error (NRMSE). We produce 10 random partitions
of the dataset, and perform 10-fold cross validation on each partition. This yields
prediction results on 100 test folds. We report the mean NRMSE values of the
100 folds and test statistical significance using a two-tail paired t-test.
5 Experiment
5.1 Parameters and Settings
For each metric in Table 1, we compare the metric using static effort with two
adaptive versions using the parametric or time-based effort vector. We evaluate
to a cutoff rank k = 9, because the dataset shows only 9 results per page.
For GP@k, GAP, and GRBP, we set the distribution of gs to P (s = 1) = 0.4
and P (s = 2) = 0.6. This parameter yields close to optimal correlations for most
metrics. For RBP and GRBP, we examine p = 0.8 (patient searcher) and p = 0.6
(less patient searcher). We set er/nr to 4 (the effort vector is thus [0.25, 1, 1]).
We compare with TBG [17] and U-measure [16]. The original TBG predicted
time spent using document length. However, in our dataset, we did not find any
significant correlation between the two (r = 0.02). Instead, there is a weak but
significant correlation between document relevance and time spent (r = 0.274,
p < 0.001). We suspect this is because our dataset includes mostly web pages,
while Smucker et al. [17] used a news corpus [18]. Web pages include many
navigational texts, which makes it difficult to assess the size of the main content.
Thus, when computing TBG, we set document click probability and expected
document examine time based on the estimation in Table 2. The dataset does not
provide document save probability. Thus, we set this probability and parameter h
by a brute force scan to maximize Pearson’s r in the dataset. The save probability
is set to Psave (r = 1) = 0.2 and Psave (r = 2) = 0.8. h is set to 31. Note that this
corresponds to a graded-relevance version of TBG well tuned on our dataset.
To be consistent with TBG, we also compute U-measure [16] based on time
spent. We set d(i) = max(0, 1 − t(i)
T ), where t(i) is the expected total time spent
after users examined the ith result, and T is a parameter similar to L in the
original U-measure. We set T to maximize Pearson’s r. T is set to 99 s.
5.2 Results
Table 3 reports Pearson’s r and NRMSE for metrics. We group results as follows:
– Block A: metrics that do not consider graded relevance and rank discount.
– Block B: metrics that consider graded relevance, but not rank discount.
– Block C: metrics that consider rank discount, but not graded relevance.
– Block D: metrics that consider both graded relevance and rank discount.
– Block S: session-level metrics (for reference only).
Following Kanoulas et al.’s work [10], we set b = 2 and bq = 4 in sDCG and
nsDCG. For esNDCG, we set parameters to maximize Pearson’s r—Pdown = 0.7
and Preform = 0.8. As results show, for most examined metrics (Blocks A, B, C,
and D), their average query scores significantly correlate with users’ ratings on
search quality in a session. The correlations are similarly strong compared with
the session-level metrics (Block S). This verifies that our evaluation approach is
reasonable—average query quality in a session does indicate the session’s quality.
196 J. Jiang and J. Allan
Adaptive Effort Vs. Static Effort. As we report in Table 3 (Block D), using
a parametric effort vector (er/nr = 4) in GBRP, ERR, DCG, and nDCG can
improve metrics’ correlations with user-rated performance. The improvements in
Pearson’s r range from about 0.03 to 0.06. The adaptive metrics with er/nr = 4
also yield lower NRMSEs in predicting users’ ratings compared with the static
effort ones. The differences are significant except DCG (p = 0.078).
Such improvements seem minor, but are in fact a meaningful progress. Block
A stands for metrics typically used before 2000. Since 2000, we witness metrics
on modeling graded relevance (e.g., nDCG, GAP, and ERR) and rank discount
(e.g., nDCG, RBP, and ERR). These work improve Pearson’s r from 0.326 (P@k,
the best “static” in Block A) to 0.405 (GRBP, p = 0.8, the best “static” in Blocks
B, C, and D). The proposed adaptive effort metrics further improve Pearson’s
r from 0.405 to 0.463 (GRBP, p = 0.6, the best in the table). The magnitude
of improvements in correlating with user-rated performance, as examined in our
dataset, are comparable to those achieved by modeling graded relevance and rank
discount. We can draw similar conclusions by looking at NRMSE. Although it
requires further verification using larger datasets and query-level user ratings,
our results at least suggest that the improvements are not negligible.
The best performing metric in our evaluation is adaptive GRBP (p = 0.6)
with er/nr = 4. It also outperforms well-tuned TBG and U-measure. All these
metrics consider adaptive search effort, but from different angles. The adap-
tive GRBP metric shows stronger Pearson’s r with user-rated performance,
and yields significantly lower NRMSE than TBG (p < 0.001) and U-measure
(p < 0.05).
The preferable results of adaptive effort metrics confirms that users do not
only care about how much relevant information they found, but are also con-
cerned with the amount of effort they spent during search. By modeling search
effort on relevant and non-relevant results, we can better measure users’ search
effort, which is the key to the improvements in correlating with user experience.
Parametric Effort Vector Vs. Time-Based One. Comparing the two ways
of constructing effort vectors, results suggest that the time-based effort vector is
not as good as the simple parametric one (er/nr = 4). Compared with the time-
based effort vector, metrics using the parametric effort vector almost always yield
stronger correlations and lower NRMSE in prediction. Compared with metrics
using static effort, the time-based effort vector can still improve GRBP, DCG,
and nDCG’s correlations, but it fails to help ERR (Table 3, Block D). In addition,
it can only significantly reduce NRMSE for GRBP (p = 0.6) and nDCG.
We suspect this is because time is only one aspect of measuring search effort.
It has an advantage—we can easily measure time-based effort from search logs—
but it does not take into account other effort such as making decisions and
cognitive burden. Whereas other types of effort are also difficult to determine
and costly to measure. Thus, it seems more feasible to tune parameters in the
effort vector to maximize correlations with user-rated performance (or minimize
prediction errors), if such data is available. In the presented results, we only
Adaptive Effort for Search Evaluation Metrics 197
Dunlop [5] extended the metric to measure the expected time required to find
a certain amount of relevant results. Kazai et al. [11] proposed effort-precision,
the ratio of effort (the number of examined results) to find the same amount of
relevant information in the ranked list compared with in an ideal list. But these
works all assume that examining different results involves the same effort.
This paper presents a study on search effectiveness metrics using adaptive
effort components. Previous work on this topic is limited. TBG [17] considered
adaptive effort, but applies it to the discount function. U-measure [16] is similar
to TBG, and possesses the flexibility of handling SERP elements other than
documents (e.g., snippets, direct answers). De Vries et al. [4] modeled searchers’
tolerance to effort spent on non-relevant information before stopping viewing an
item. Villa et al. [19] found that relevant results cost assessors more effort to
judge than highly relevant and non-relevant ones. Yilmaz et al. [21] examined
differences between searchers’ effort (dwell time) and assessors’ effort (judging
time) on results, and features predicting such effort. Our study shows that the
adaptive effort metrics can better indicate users’ search experience compared
with conventional ones (with static effort).
The dataset for these experiences was based on session-level user ratings and
required that we make assumptions to verify query-level metrics. One important
area of future research is to extend this study to a broader set of queries of
different types to better understand the applicability of this research. Another
direction for future research is to explore the effect of different effort levels,
for example, assigning different effort for Relevant and Highly Relevant results
rather than just to distinguish relevant from non-relevant.
Acknowledgment. This work was supported in part by the Center for Intelligent
Information Retrieval and in part by NSF grant #IIS-0910884. Any opinions, findings
and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect those of the sponsor.
References
1. Carterette, B.: System effectiveness, user models, and user utility: a conceptual
framework for investigation. In: SIGIR 2011, pp. 903–912 (2011)
2. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for
graded relevance. In: CIKM 2009, pp. 621–630 (2009)
3. Cooper, W.S.: Expected search length: a single measure of retrieval effectiveness
based on the weak ordering action of retrieval systems. Am. Documentation 19(1),
30–41 (1968)
4. De Vries, A.P., Kazai, G., Lalmas, M.: Tolerance to irrelevance: a user-effort ori-
ented evaluation of retrieval systems without predefined retrieval unit. RIAO 2004,
463–473 (2004)
5. Dunlop, M.D.: Time, relevance and interaction modelling for information retrieval.
In: SIGIR 1997, pp. 206–213 (1997)
6. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques.
ACM Trans. Inf. Syst. 20(4), 422–446 (2002)
Adaptive Effort for Search Evaluation Metrics 199
7. Järvelin, K., Price, S.L., Delcambre, L.M.L., Nielsen, M.L.: Discounted cumulated
gain based evaluation of multiple-query IR sessions. In: Macdonald, C., Ounis, I.,
Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp.
4–15. Springer, Heidelberg (2008)
8. Jiang, J., Hassan Awadallah, A., Shi, X., White, R.W.: Understanding and pre-
dicting graded search satisfaction. In: WSDM 2015. pp. 57–66 (2015)
9. Jiang, J., He, D., Allan, J.: Searching, browsing, and clicking in a search session:
Changes in user behavior by task and over time. In: SIGIR 2014, pp. 607–616
(2014)
10. Kanoulas, E., Carterette, B., Clough, P.D., Sanderson, M.: Evaluating multi-query
sessions. In: SIGIR 2011, pp. 1053–1062 (2011)
11. Kazai, G., Lalmas, M.: Extended cumulated gain measures for the evaluation of
content-oriented xml retrieval. ACM Trans. Inf. Syst. 24(4), 503–542 (2006)
12. Kelly, D., Belkin, N.J.: Display time as implicit feedback: Understanding task
effects. In: SIGIR 2004, pp. 377–384 (2004)
13. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effective-
ness. ACM Trans. Inf. Syst. 27(1), 2:1–2:27 (2008)
14. Robertson, S.E.: A new interpretation of average precision. In: SIGIR 2008, pp.
689–690 (2008)
15. Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded
relevance judgments. In: SIGIR 2010, pp. 603–610 (2010)
16. Sakai, T., Dou, Z.: Summaries, ranked retrieval and sessions: a unified framework
for information access evaluation. In: SIGIR 2013, pp. 473–482, (2013)
17. Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In:
SIGIR 2012, pp. 95–104(2012)
18. Smucker, M.D., Jethani, C.P.: Human performance and retrieval precision revis-
ited. In: SIGIR 2010, pp. 595–602 (2010)
19. Villa, R., Halvey, M.: Is relevance hard work?: Evaluating the effort of making
relevant assessments. In: SIGIR 2013, pp. 765–768 (2013)
20. Yilmaz, E., Shokouhi, M., Craswell, N., Robertson, S.E.: Expected browsing utility
for web search evaluation. In: CIKM 2010, pp. 1561–1564 (2010)
21. Yilmaz, E., Verma, M., Craswell, N., Radlinski, F., Bailey, P.: Relevance and effort:
an analysis of document utility. In: CIKM 2014, pp. 91–100 (2014)
Evaluating Memory Efficiency and Robustness
of Word Embeddings
1 Introduction
Word embeddings, also referred to as “word vectors” [7], capture syntactic and
semantic properties of words solely from raw natural text corpora without human
intervention or language dependent preprocessing. In natural language texts, the
co-occurrence of words to appear together in the same context depends on the
syntactic form and meaning of the individual words. In word embeddings the var-
ious nuances of word-to-word relations are distributed across several dimensions
in vector space. These vector spaces are high-dimensional to provide enough
degrees of freedom for hundreds of thousands of words to allow the relative
arrangement of their embeddings reflect as many pairwise relations as possible
out of the corpus statistics. However, embeddings carry a lot of information
about words, which is hard to understand, interpret and quantify or may even
be redundant and non-informative.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 200–211, 2016.
DOI: 10.1007/978-3-319-30671-1 15
Evaluating Memory Efficiency and Robustness of Word Embeddings 201
The NLP community has been successfully exploiting these embeddings over
the last years, e.g. [3,6]. However, the gain in task-accuracy brings the downside
that high-dimensional continuous valued word vectors require a large amount of
memory. Moreover, embeddings are trained by a fixed-size network architecture
that sweeps through a huge text corpus. Consequently, the total number of
parameters in the embedding matrix is implicitly defined a-priori. Further, there
is no natural transition to more memory efficient embeddings, by which one could
trade accuracy for memory. This is particularly limiting for NLP-applications on
resource limited devices where memory is still a scarce resource. An embedding
matrix with 150,000 vocabulary words can easily require 60–180 Megabytes of
memory, which is rather inconvenient to be transferred to and stored in a browser
or mobile application. This restriction gives rise to contemplate different types
of post-processing methods in order to derive robust and memory-efficient word
vectors from a trained embedding matrix.
In this paper, we investigate three post-processing methods for word vectors
trained with the Skip-Gram algorithm that is implemented in the Word2Vec
software toolkit1 [7]. The employed post-processing methods are (i) dimensional-
ity reduction (PCA), (ii) parameter reduction (Pruning) and (iii) Bit-resolution
reduction (Truncation). To isolate the effects on embeddings with different sizes,
sparsity levels and resolutions, we employ intrinsic evaluation tasks based on
word relatedness and abstain from extrinsic classification tasks. Our work makes
the following contributions:
2 Related Work
3 Methodology
The word embeddings we use in our experiments, are obtained from Mikolov’s
Skip-Gram algorithm [7]. As a recent study [9] revealed, the algorithm factor-
izes an implicit word-context matrix, whose entries are the Point-wise Mutual
Information of word-context pairs shifted by a constant offset. This PMI-matrix
Evaluating Memory Efficiency and Robustness of Word Embeddings 203
M ∈ R|V |×|V | is factorized into a word embedding matrix W ∈ R|V |×d and a
context matrix C ∈ Rd×|V | , where |V | is the number of words in the vocabulary
and d is the number of dimensions of each word vector. The context matrix is
only required during training and usually discarded afterwards. The result of
optimizing the Skip-Gram’s objective is that word vectors (rows in W ) have
high similarity with respect to their cosine-similarity in case the words are syn-
tactically or semantically similar. Besides that, the word vectors are dense and
have significantly fewer dimensions than there are context words - columns in
M . With sufficiently large d, the PMI-matrix could be perfectly reconstructed
from its factors W and C, and thus provide the most accurate information about
word co-occurrences in a corpus [6]. However, increasing the dimensionality d of
word vectors also increases the amount of memory required to store the embed-
ding matrix W . When using word embeddings in an application, we do not aim
for a perfect reconstruction of the PMI-matrix but for reasonably accurate word
vectors that reflect word similarities and word relations of language. Therefore,
a more memory-efficient, yet accurate version of W would be desirable.
More formally, we want to have a mapping τ from the full embedding matrix W
to Ŵ = τ (W ), where Ŵ can be stored more efficiently while at the same time its
word vectors are similarly accurate as the original vectors in W . For the vectors
in Ŵ to have an accuracy loss as low as possible, word vectors in W must be
robust against the mapping function τ . We consider W robust against the trans-
formation τ , if the loss of τ (W ) is small compared to W across very different
evaluation tasks. A memory reduction through τ can be induced by reducing
the number of dimensions, the amount of effective parameters or the parame-
ters’ Bit-resolution. Accordingly, we employed three orthogonal post-processing
methods that can be categorized into dimensionality-based, parameter-based and
resolution-based approaches:
Dimensionality-based (PCA)
|V| |V|
d
PCA Bits
Bits Bits
manifold. Both the computational overhead for estimating the nonlinear com-
ponents and the memory overhead for storing the inverse of the mapping
alongside with the transformed embeddings is high.
• Parameter-based: Whereas dimensionality-based methods change the bases of
the embedding space, parameter-based methods leave the structure untouched
but directly modify individual parameter values in the embedding matrix:
τpar : R|V |×d → R|V |×d , where τpar is supposed to map most values to zero
and leave only few non-zero elements in the output matrix. For instance, a
simple pruning strategy can be used. Then, the output of τpar is a sparse
matrix that can be stored more efficiently.
• Resolution-based: τres : R|V |×d → {0, · · · , r − 1}|V |×d , where r ∈ N+ is the
resolution of the coordinate axes. With discrete coordinates, values can be
stored at lower Bit-precision. Resolution-based methods discretize the coor-
dinate axes into distinct intervals and thus reduce the resolution of the word
vectors. For instance, the Bit-Truncation method subdivides the embedding
space into regions of equal size.
In this work, we select one method of each category. In particular, we
explore the robustness and memory efficiency of embeddings after applying PCA-
reduction, Pruning and Bit-Truncation. In the following section we describe the
selected methods along with the rationale for the selection (see Fig. 1).
4 Experimental Setup
In all experiments we used word vectors estimated with the Skip-Gram method of
the word2vec-toolkit from a text corpus containing one billion words. The cor-
pus was collected from the latest snapshot of English Wikipedia articles2 . After
removing words that appeared less than 100 times, the vocabulary contained
148,958 words, both uppercase and lowercase. We used a symmetric window
covering k = 9 context words and chose the negative-sampling approximation
to estimate the error from neg = 20 noise words. With this setup, we computed
word vectors of several sizes in the range d ∈ [50, 100, 150, 300, 500]. After
training, all vectors are normalized to unit length. To evaluate the robustness
and efficiency of word vectors after applying post-processing, we compare PCA-
reduction, Pruning and Bit-Truncation on three types of intrinsic evaluation
tasks: word relatedness, word analogy and linguistic properties of words. In each
of these tasks, we use two different datasets.
Word Relatedness: The WordSim353 (WS353)[13] and MEN [14] datasets are
used to evaluate pairwise word relatedness. Both consist of pairs of English
words, each of which has been assigned a relatedness score by human evaluators.
The WordSim353 dataset contains 353 word pairs with scores averaged over
judgments of at least 13 subjects. For the MEN dataset, a single annotator
ranked each of the 3000 word-pairs relative to each of 50 randomly sampled
word-pairs. The evaluation metric is the correlation (Spearman’s ρ) between the
human ratings and the cosine-similarities of word vectors.
Word Analogy: The word analogy task is more sensitive to changes of the
global structure in embedding space. It is formulated as a list of questions of
the form “a is to â as b is to b̂”, where b̂ is hidden and has to be guessed from
the vocabulary. The dataset we use here was proposed by Mikolov et al. [8]
and consists of 19544 questions of this kind. About half of them are morpho-
syntactical (wa-syn) (loud is to louder as tall is to taller ) and the other half
semantic (wa-sem) questions (Cairo is to Egypt as Stockholm is to Sweden).
It is assumed that the answer to a question can be retrieved by exploiting the
relationship a → â and applying it to b. Since Word2Vec-embeddings exhibit
a linear structure in embedding space, word relations are consistently reflected
2
https://dumps.wikimedia.org/enwiki/20150112/.
Evaluating Memory Efficiency and Robustness of Word Embeddings 207
in sums and differences of their vectors. Thus, the answer to an analogy question
is given by the target word wt whose embedding wt is closest to wq = â − a + b
with respect to the cosine-similarity. The evaluation metric is the percentage of
questions that have been answered with the expected word.
Linguistic Properties: Schnabel et al. [15] showed that results from intrinsic
evaluations are not always consistent with results on extrinsic evaluations. There-
fore, we include the recently proposed QVEC-evaluation3 [16] as additional task.
This evaluation uses two dictionaries of words, annotated with linguistic prop-
erties: a syntactic (QVEC-syn) dictionary (e.g., ptb.nns, ptb.dt) and a seman-
tic (QVEC-sem) dictionary (e.g., verb.motion, noun.position). The proposed
evaluation method assigns to each embedding dimension the linguistic property
that has highest correlation across all mutual words. The authors showed that
the sum over all correlation values can be used as an evaluation measure for word
embeddings. Moreover, they showed that this score has high correlation with the
accuracy the same embeddings achieve on real-world classification tasks.
5 Results
Since we evaluate the robustness of word embeddings against post-processing,
we report the relative loss induced by applying a post-processing method. The
loss is measured as the difference between the score of original embeddings and
the score of post-processed embeddings. In case of the word relatedness task,
the score is the Spearman correlation. On the word analogy task, the score is
given as accuracy. And on the linguistic properties task, the score is the output
of the QVEC evaluation method. We divide the loss by the score of the original
embeddings to obtain a relative loss that is comparable across tasks.
Fig. 2. Mean relative loss of embeddings after (a) PCA: percentage of removed dimen-
sions, (b) Pruning: percentage of removed parameters and (c) Bit-Truncation: remain-
ing Bits. Scores on the QVEC datasets are not shown for PCA since they are not
comparable across different word vector sizes.
Fig. 3. Relative loss of embeddings on the syntactic word analogy dataset (wa-syn)
after PCA (a), Pruning (b) and Bit-Truncation (c).
or parameters nor the continuous values alone but the number of distinguishable
regions in embedding space is crucial for accurate word embeddings.
With a sufficiently large Bit-resolution the accuracy of all embedding sizes
approximates the same accuracy level as with continous values. Thus, we can
confirm the finding in [12] also for Skip-Gram embeddings: The same accuracy
can be achieved with discretized values at sufficiently large resolutions. Addi-
tionally, we state that this observation not only holds for cosine-similarity on
word relatedness tasks but also for vector arithmetic on the word analogy task
and for QVEC on the linguistic properties task.
To summarize, Skip-Gram word embeddings can be stored more efficiently
using a post-processing method that reduces the number of distinguishable
regions R = (2B )d in embedding space. Pruning does so by producing increas-
ingly large zero-valued regions around each coordinate axis (2B − const.). PCA
does so by mapping the word vectors into an embedding space with fewer dimen-
sions dˆ < d. And Bit-Truncation directly lowers the resolution of each coordinate
by constraining the Bit-resolution B̂ < B.
6 Discussion
Table 1. Answer words for several country-currency analogy questions from word
embeddings at different resolutions. Finally, all answer words are currencies.
(see evaluation data online4 ). Thus, in terms of memory, it can be more efficient
to first train high-dimensional embeddings and afterwards reduce them with
PCA to the desired size. The resolution-based approach provides the greatest
potential for memory savings. With only 8-Bit precision per value, there is no
loss on any of the tasks. A straight-forward implementation can thus fit the
whole embedding matrix in only 25 % of memory.
Resolution and Semantic Transition: Another observation is depicted in
Table 1. On the word analogy dataset, the transition from lower to higher-
precision values not only yields increasingly better average accuracy but also
corresponds to a semantic transition from lower to higher relatedness. Even if
the embeddings’ values have only 3-Bit precision, the retrieved answer words are
not totally wrong but still in some kind related to the expected answer word. It
seems that some notions of meaning are encoded on a finer scale in embedding
space and that these require more Bits to remain distinguishable.
For coarse resolutions (3-Bit) the regions in embedding space are too large
to allow an identification of a country’s currency. Because there are many words
within the same distance to the target location, the most frequent one is retrieved
as answer to the question. As the resolution increases, regions get smaller and
thus more nuanced distances between word embeddings emerge, which yields
not only increasingly accurate but also progressively more related answers.
7 Conclusion
In this paper, we explored three methods to post-process Skip-Gram word
embeddings in order to identify means to reduce the amount of memory required
to store the embedding matrix. Therefore, we evaluated the robustness of embed-
dings against a dimensionality-based (PCA), parameter-based (Pruning) and a
resolution-based (Bit-Truncation) approach. The results indicate, that embed-
dings are most robust against Bit-Truncation and PCA-reduction and that
preserving the number of distinguishable regions in embedding space is key
4
http://tinyurl.com/jj-ecir2016-eval.
Evaluating Memory Efficiency and Robustness of Word Embeddings 211
for obtaining memory efficient (75 % reduction) and accurate word vectors. Espe-
cially resource limited devices can benefit from these compact high-quality word
features to improve NLP-tasks under memory constraints.
Acknowledgments. The presented work was developed within the EEXCESS project
funded by the European Union Seventh Framework Programme FP7/2007-2013 under
grant agreement number 600601.
References
1. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of seman-
tics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
2. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language
model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12,
2493–2537 (2011)
4. Mnih, A., Hinton, G.: Three new graphical models for statistical language mod-
elling. In: ICML, pp. 641–648. ACM, June 2007
5. Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations
via global context and multiple word prototypes. In: 50th Annual Meeting of the
Association for Computational Linguistics, pp. 873–882. ACL, July 2012
6. Baroni, M., Dinu, G., Kruszewski, G.: Dont count, predict! a systematic compar-
ison of context-counting vs. context-predicting semantic vectors. In: 52nd Annual
Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)
7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space (2013). arXiv.org
8. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word
representations. In: HLT-NAACL, pp. 746–751, June 2013
9. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization.
In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
10. Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al’.s negative-
sampling word-embedding method. arXiv preprint (2014). arxiv:1402.3722
11. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn.
Res. 9(2579–2605), 85 (2008)
12. Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embed-
dings. In: Speech and Language Proceeding Workshop, ICML, Deep Learning for
Audio (2013)
13. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G.,
Ruppin, E.: Placing search in context: the concept revisited. In: 10th International
Conference on World Wide Web, pp. 406–414, ACM, April 2001
14. Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif.
Intell. Res. (JAIR) 49, 1–47 (2014)
15. Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsu-
pervised word embeddings. In: Proc. Emp. Met. Nat. Lang. (2015)
16. Ling, Y., Dyer, G.L.C.: Evaluation of word vector representations by subspace
alignment. In: Proc. Emp. Met. Nat. Lang., pp. 2049–2054, ACL, Lisbon (2015)
17. Saad, Y.: Iterative methods for sparse linear systems. Siam (2003)
Characterizing Relevance
on Mobile and Desktop
1 Introduction
The primary goal of Information Retrieval (IR) systems is to retrieve highly rel-
evant documents for user’s search query. Judges determine document relevance
by assessing topical overlap between its content and user’s information need.
To facilitate repeated and reliable evaluation of IR systems, trained judges are
asked to evaluate several documents (mainly shown on desktop computers) with
respect to a query to produce large scale test collections (such as those pro-
duced at TREC1 and NTCIR2 ). These collections are created with the following
assumptions: (1) document rendering is device agnostic, i.e., a document appears
the same whether it is viewed on a mobile or desktop. For example font size of
text, headings and image resolutions remain unchanged with change in screen
size. (2) document content is device agnostic, i.e. if some text if displayed on
1
http://trec.nist.gov/.
2
http://research.nii.ac.jp/ntcir/.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 212–223, 2016.
DOI: 10.1007/978-3-319-30671-1 16
Characterizing Relevance on Mobile and Desktop 213
desktop, will also be visible for instance on a mobile. Note that while the former
assumes that document layout remains unchanged, the latter assumes that its
content (for example, number of words, headings or paragraphs) remains the
same across devices.
While this evaluation mechanism is robust, it is greatly challenged by explo-
sion in devices with myriad resolutions. This requires optimization of pages for
different resolutions. A popular website today has at least two views, one opti-
mized for traditional desktops while the other tuned for mobiles or tablets, i.e.
devices with smaller screen sizes. The small screen size limits both the layout
and amount of content visible to the user at any given time. The continuous
advancement in browsers and markup languages exacerbates this problem, as
web developers today can collapse a traditionally multiple page website into a
single web page with several sections. The same page can be optimized for both
mobile and desktops with one style sheet and minimal change in HTML. Thus,
with today’s technology, a user may see separate versions of the same website
on desktop and mobile, which in turn may greatly impact her judgment of the
page with respect to a query.
To illustrate this further, Fig. 1 shows four web pages with their respective
queries on mobile as well as desktop. Web pages in Fig. 1a and b are relevant to
the query and have been optimized for mobiles. Judges in our study also marked
these pages relevant on mobile and desktop. However, web pages in Fig. 1c and d
are not suitable for mobile screens. In case of Fig. 1c, the whole page loads in the
viewport3 which in turn makes it extremely hard to read. Figure 1d has more
ads on the viewport than the relevant content which prompts judges to assign
lower relevance on mobile.
Thus, it needs to be determined whether device on which a document is
rendered influences its evaluation with respect to a query. While some work [7]
compares user search behavior on mobiles and desktop, we know little about
how users judge pages on these two mediums and whether there is any signifi-
cant difference between judging time or obtained relevance labels. We need to
determine whether page rendering has any impact on judgments, i.e. different
web page layouts (on mobile or desktop) translate into different relevance labels.
We also need to verify whether viewport specific signals can be used to determine
page relevance. If these signals are useful, mobile specific relevance can be deter-
mined using a classifier which in turn would reduce the overhead of obtaining
manual judgments.
In this work we investigate above outlined problems. We collect and compare
crowd sourced judgments of query-url pairs for mobile and desktop. We report
the difference in agreements, judging time and relevance labels. We also propose
novel viewport oriented features and use them to predict page relevance on both
mobile and desktop. We analyze which features are strong signals to determine
relevance. Our study shows that there are certain differences between mobile
and desktop judgments. We also observe different judging times, despite similar
3
Viewport is the framed area on a display screen of mobile or desktop for viewing
information.
214 M. Verma and E. Yilmaz
(c) average temperature Dallas winter (d) us worst drought year precipitation
Fig. 1. Sample queries and resulting web pages on desktop and mobile screens
2 Related Work
While there exists large body of work that identifies factors affecting relevance on
Desktop, only a fraction exists that characterizes search behavior or evaluates
search engine result pages for user interaction on mobiles. We briefly discuss
factors important for judging page relevance on desktop and contrast our work
with existing mobile search studies.
Schamber et al. [13] concluded that relevance is a multidimensional con-
cept that is affected by both internal and external factors. Since then, several
studies have investigated and evaluated factors that constitute relevance. For
instance, Xu et al. [18] and Zhang et al. [20] explored factors employed by users
to determine page relevance. They studied impact of topicality, novelty, reli-
ability, understandability, and scope on relevance. They found that topicality
and novelty to be the most important relevance criteria. Borlund et al. [1] have
shown that as search progresses, structure and understandability of document
become important in determining relevance. Our work is quite different as we
do not ask users explicit judgments on above mentioned factors. We compare
Characterizing Relevance on Mobile and Desktop 215
relevance judgments obtained for same query-url pairs on mobile and desktop.
Our primary focus is the difference in judging patterns on both mediums.
There is some work on mining large scale user search behavior in the wild.
Several studies [3,5,6,8–10,17,19] report differences in search patterns across
devices. For instance, Kamwar et al. [8] compare searches across computers and
mobiles and conclude that smart phones are treated as extensions of users’ com-
puters. They suggested that mobiles would benefit from integration with com-
puter based search interface. These studies found mobile queries to be short
(2.3 – 2.5 terms) and high rate of query reformulation. One key result of Karen
et al. [4] was that conventional desktop-based approach did not receive any click
for almost 90 % of searches which they suggest maybe due to unsatisfactory
search results. Song et al. [15] study mobile search patterns on three devices:
mobile, desktop and tablets. Their study emphasizes that due to significant dif-
ferences between user search patterns on these platforms use of similar web page
ranking methodology is not an optimal solution. They propose a framework to
transfer knowledge from desktop search such that search relevance on mobile and
tablet can be improved. We train models for relevance prediction as opposed to
search result ranking.
Other work includes abandonment prediction on mobile and desktop [12],
query reformulation on mobile [14] and understanding mobile search intents
[4]. Buchanan et al. [2] propose some ground rules to design web interfaces for
mobile. All the above mobile related studies focus on search behavior, not on
what constitutes page relevance on small screens. In this work our focus is not
to study search behavior but to compare relevance judgments for same set of
pages on different devices.
payed $0.03. We collected three judgments per query-url pair. We ensured that
query-url pairs were shown at random to avoid biasing the judge. We determined
judge’s browser type (and device) using javascript. The judgments performed on
Android or iOS phones are used in our analysis. To help filter malicious workers,
we restricted our HITs to workers with an acceptance rate of 95 % or greater
and to ensure English language proficiency, to workers in the US. In total we
collected 708 judgments from each interface. Desktop judgment hits were sub-
mitted by 41 workers and mobile judgment hits were completed by 28 workers
on MTurk.
The final grade of each pair was obtained by taking majority of all three
labels. We also group relevance labels5 to form binary judgments from 4 scale
judgments. The label distribution is as follows:
– Desktop: High-rel=108, rel=37, some-rel=47, non-rel=44
– Mobile: High-rel=86, rel=55, some-rel=64, non-rel=31
The inter-rater agreement (Fleiss kappa) for Desktop judgments was 0.28 (fair)
on 4 scales and 0.42 (moderate) for binary grades. Similarly, inter-rater agree-
ment for mobile judgments were 0.33 (fair) on scale of 4 grades and 0.53 (mod-
erate) for binary grades. The agreement rate is comparable to that observed in
previous relevance crowd sourcing studies [11]. However, Cohen’s kappa between
majority desktop and mobile relevance grades is only 0.127 (slight), indicat-
ing that judgments obtained on mobiles may differ greatly from those obtained
from desktop. Kendall Tau is also low, only 0.114 (p-value=0.01), suggesting
that judging device influences judges.
Boxplots for relevance grades assigned by crowd judges and their judging
time on each interface is shown in Fig. 2. The average time (red squares) crowd
judges spent labeling a highly relevant, relevant, somewhat relevant and non-
relevant page on desktop was 88 s, 159 s, 142 s and 197 s respectively. Meanwhile,
5
rel = high-rel+rel, non-rel=some-rel+non-rel.
Characterizing Relevance on Mobile and Desktop 217
the average mobile judging time for highly relevant, relevant, somewhat relevant
and non-relevant page was 65 s, 51 s, 37 s and 48 s respectively. The plots show
two interesting judging trends: Firstly, judges take longer to decide non-relevance
on desktop as compared to mobile. This maybe due to several reasons. If a web
page is not optimized for mobiles, it may be inherently difficult to find required
information. Judges perhaps do not spend time zooming/pinching if information
is not readily available on the viewport. It could also be a result of interaction
fatigue. In the beginning, judges may thoroughly judge each page, but due the
limited interaction on mobiles, they grow impatient as time passes and spend
increasingly less time evaluating each page, thus giving up more quickly than a
desktop judge. For optimized pages, the smaller viewport in mobile allows judges
to quickly decide if the web page is relevant or not. For example, web pages with
irrelevant ads (Fig. 1d) can be quickly marked as non-relevant. Secondly, judges
spend more time analyzing highly relevant and relevant pages on mobile and
desktop respectively. This is perhaps due to the time it takes to consume a page
on mobile is longer, and with limited information on the viewport user has to
tap, zoom or scroll several times to read the entire document.
(a) Judging time on Mobile & Desktop (b) Distribution of Relevance judgments
4 Relevance Prediction
Relevance prediction is a standard but crucial problem in Information retrieval.
There are several signals that are computed today to determine page relevance
with respect to a user query. However, our goal is not to test or compare exist-
ing features. Our primary focus is to determine whether viewport and content
specific features show different trends for relevance prediction on mobile and
desktop. Given that non-relevant pages are small in number on 1–4 scale, we
predict relevance on binary scale. We use several combinations of features to
train Adaboost Classifier [21]. Given that our dataset is small, we perform 10-
fold cross validation. We report average precision, recall and f1-score across 10
folds for mobile and desktop. We compute paired t-test to compute statistical
significance. We begin by describing the features used to predict relevance in
following subsection.
4.1 Features
We study whether there is a significant difference between features that are useful
in predicting relevance on mobile and those that predict relevance on desktop.
Our objective is to capture features that have different distributions on both
devices. Features that capture link authority or content novelty will contribute
equally to relevance for page rendered on mobile and desktop (i.e. will have same
value), we ignore them from our analysis. Several features have been reported to
be important in characterizing relevance. Zhang et al. [20] investigated five pop-
ular factors affecting relevance. Understandability and Reliability were highly
correlated with page relevance. We capture both Topicality and Understand-
ability oriented features in this work. Past work has also shown that textual
information, page structure and quality of the page [16] impacts user’s notion of
relevance.
As mentioned before, screen resolutions of mobiles greatly affects the legibil-
ity of on-screen text. If websites are not optimized to render on small screens,
users may get frustrated quickly due to repeated taps, pinching and zooming
to understand its content. Our hypothesis is that relative position and size of
text on page are important indicators as viewport size varies greatly between
a mobile and desktop, thus affecting user’s time and interaction with the page.
We extract features that rely on visible content of the web page, the position of
such elements and their rendered sizes. We evaluate some viewport i.e. interface
oriented features to predict page relevance.
Both content and viewport specific features are summarized in Table 1. Con-
tent oriented features are calculated on two levels: the entire web page (html ) and
only the portion of page visible to user (viewport) when she first lands on the
page. Geometric features capture mean, minimum and maximum of absolute
coordinates of query terms and headings on screen. Display specific features
capture the absolute size of query terms, headings and remaining words in the
page in pixels. These features are calculated by simulating how these pages are
Characterizing Relevance on Mobile and Desktop 219
rendered on mobile and desktop with the help of Selenium web browser automa-
tion.6 This provides all information about rendered DOM elements in HTML at
runtime.
Pearson’s correlation (R) between top five statistically significant features
(p-value < 0.05) and mobile/desktop judgments is shown in Table 2. As we can
see content oriented features are highly correlated with desktop judgments but
both view and size oriented features are correlated with mobile judgments.
Desktop Mobile
Feature R p-val Feature R p-val
Number of sentences (html) −0.13 0.04 Heading size (max) 0.16 0.01
Number of words (html) −0.12 0.03 Number of headings (viewport) 0.16 0.02
Unique tokens (html) −0.12 0.04 Number of words (viewport) 0.14 0.04
Query term size (mean) 0.26 0.00 Number of headings (html) 0.13 0.04
Query term position (min) −0.17 0.01 Number of images (viewport) 0.10 0.04
labeled no.x correspond to models using all the features but features of type x.
The rows labeled only.x have metrics for models trained on features in group x.
Finally, four pairs of features – geom.size, geom.view, geom.html and view.html
correspond to models trained on features in either geometric, content (viewport
(view) or html) or display group. Treating the model trained with all the features
as baseline, the statistically significant (p-value < 0.05 for paired t-test) models
have been marked with (*).
The classification accuracy for mobile relevance is significantly better than
random. The best performing feature combination is viewport features
(only.view) for mobile. The confusion matrix for best model (only.view) is
given in Table 4. Surprisingly, the accuracy does not improve on using all fea-
tures. There is also no improvement in performance when display specific fea-
tures are taken into account. Binary classifier trained solely on html features
does worse than the classifier trained using viewport features. In pairwise combi-
nations, content specific features (view.html) perform the best, which is perhaps
due to the presence of viewport based features in the model.
Mobile Desktop
Accuracy Prec Recall F1-score Accuracy Prec Recall F1-score
all 0.76 0.83 0.84 0.839 0.60 0.70 0.73 0.71
no.html 0.75 0.83 0.83 0.82 0.65* 0.72 0.79* 0.75*
no.view 0.73 0.81 0.82 0.81 0.61 0.68 0.79* 0.73
no.geom 0.79* 0.85 0.86 0.86* 0.67* 0.75* 0.76 0.75*
no.display 0.78 0.857* 0.85 0.85 0.64* 0.71 0.79* 0.74
only.html 0.74 0.78 0.90* 0.83 0.61 0.70 0.75 0.72
only.view 0.81* 0.87* 0.88* 0.87* 0.64* 0.70 0.78 0.74
only.geom 0.76 0.82 0.88* 0.84 0.54 0.63 0.72 0.67
only.display 0.71 0.79 0.81 0.80 0.64* 0.71 0.79 0.74
geom.display 0.75 0.84 0.82 0.82 0.61 0.68 0.76 0.72
geom.html 0.77 0.84 0.86 0.85 0.57 0.66 0.73 0.69
geom.view 0.78 0.84 0.87 0.85 0.62 0.70 0.77* 0.73
view.html 0.78 0.83 0.89* 0.86* 0.67* 0.75* 0.78* 0.76*
Our hypothesis that geometric and display oriented features impact relevance
are not supported by the results. Display features, for instance, when used alone
for binary classification have the lowest accuracy amongst all feature combina-
tions. It is worth noting that when viewport features are dropped (no.view) from
training, the accuracy goes down by 3 % when compared to model trained on all
the features.
Features that had highest scores while building the viewport (only.view) clas-
sifier are mentioned below. The most important set of features in decreasing order
of their importance are: total tokens in view port (0.28), number of images in
viewport (0.20), number of tables (0.16), number of sentences (0.10), number of
outlinks with query terms (0.07), number of outlinks (0.06) and finally number
of headings with query terms (0.06).
Characterizing Relevance on Mobile and Desktop 221
Rel NonRel
Rel 0.88 0.12
Non-Rel 0.26 0.74
Rel NonRel
Rel 0.68 0.32
Non-Rel 0.34 0.66
The results for document relevance are shown in Table 3. The overall accuracy
of relevance prediction on desktop pages is low. It is in fact lower than that
observed on mobile. The best performing system is one trained with content
based or viewport and html features (view.html). The confusion matrix for
best model (view.html) is given in Table 5. The difference between the accuracy
of best performing model on mobile (only.view) and best performing system
on desktop (view.html) is 17 %. It is interesting to note that viewport features
are useful indicators of relevance regardless of judging device. The models with
viewport features (only.view, geom.view and view.html) perform better than
model built using all features. This suggests that users tend to deduce page
relevance from immediate visible content once page finishes loading.
It is also surprising to note that classifier trained on features extracted from
the entire document (only.html) performs worse the one trained using only view-
port features (only.view). This could be due to the limited number of features
used in our study. Perhaps, with more extensive set of page specific features, the
classifier may perform better.
Our hypothesis that geometric features affect relevance is not supported by
either experiment. Overall, geometric features are not useful in predicting rel-
evance on desktop, the classifier trained only on geometric features achieves
only 54 %, 10 % lower than model trained with all features. It is not surprising
to observe the jump in accuracy once geometric features are removed from the
model training. Thus, both the experiments suggest that position of query terms
and headings, on both mobile and desktop, is not useful in predicting relevance.
Amongst models trained on a single set of features, the model with display
specific features (only.display) performs the best with 64 % accuracy and 0.74
F1-score. It seems that font size of query terms, headings and other words is
predictive of relevance. However, only.display model’s accuracy (64 %) is still
lower than that of view.html model. Amongst models trained on pairs of features,
222 M. Verma and E. Yilmaz
view.html performs best (67 %), closely followed by geom.view (62 %) model.
This is perhaps due to the presence of viewport features in both models.
The most representative features in content based (view.html) classifier, in
decreasing order of their importance are: content specific features such as number
of headings (viewport) (0.23), number of images (viewport) (0.14) and query
term frequency (html) (0.08), number of unique tokens (html) (0.06), number of
tables (viewport) (0.06), number of words (html) (0.05) and finally number of
outlinks (0.04).
Despite promising results, our study has several limitations. It is worth noth-
ing that our study only contained 236 query-url pairs, with more data and an
extensive set of features prediction accuracy would improve. We used query-
url pairs from previous study, which had gathered queries for only seven topic
descriptions or tasks. We shall follow up with a study containing more number
and variety of topics to further analyze impact of device size on judgments.
Overall, our experiments indicate that viewport oriented features are useful in
predicting relevance. However, model trained with viewport features on mobile
judgments significantly outperforms the model trained on desktop judgments.
Our experiments also show that features such as query term or heading positions
are not useful in predicting relevance.
5 Conclusion
Traditional relevance judgments have always been collected via desktops. While
existing work suggests that page layout and device characteristics have some
impact on page relevance, we do not know whether page relevance changes with
change in device. Thus, with this work we tried to determine whether device size
and page rendering has any impact on judgments, i.e. different web page layouts
(on mobile or desktop) translate into different relevance labels from judges. To
that end, we systematically compared crowdsourced judgments of query-url pairs
from mobile and desktop.
We analyzed different aspects of these judgments, mainly observing differ-
ences in how users evaluate highly relevant and non-relevant documents. We also
observed strikingly different judging times, despite similar inter-rater agreement
on both devices. We also used some viewport oriented features to predict rel-
evance. Our experiments indicate that they are useful in predicting relevance
on both mobiles and desktop. However, prediction accuracy on mobiles is sig-
nificantly higher than that of desktop. Overall, our study shows that there are
certain differences between mobile and desktop judgments.
There are several directions for future work. The first and foremost would be
to scale this study and analyze the judging behavior more extensively to draw
better conclusions. Secondly, it would be worthwhile to investigate further the
role of viewport features on user interaction and engagement on mobiles and
desktops.
Characterizing Relevance on Mobile and Desktop 223
References
1. Borlund, P.: The concept of relevance in IR. J. Am. Soc. Inf. Sci. Technol. 54(10),
913–925 (2003)
2. Buchanan, G., Farrant, S., Jones, M., Thimbleby, H., Marsden, G., Pazzani, M.:
Improving mobile internet usability. In: Proceedings of WWW. ACM (2001)
3. Church, K., Oliver, N.: Understanding mobile web and mobile search usein today’s
dynamic mobile landscape. In: Proceedings of MobileHCI. ACM (2011)
4. Church, K. Smyth, B.: Understanding the intent behind mobile information needs.
In: Proceedings of IUI. ACM (2009)
5. Church, K., Smyth, B., Bradley, K., Cotter, P.: A large scale studyof european
mobile search behaviour. In: Proceedings MobileHCI. ACM (2008)
6. Church, K., Smyth, B., Cotter, P., Bradley, K.: Mobile informationaccess: a study
of emerging search behavior on the mobile internet. ACM Trans. Web 1(1), 4
(2007)
7. Guo, Q., Jin, H., Lagun, D., Yuan, S., Agichtein, E.: Miningtouch interaction data
on mobile devices to predict web search result relevance. In: Proceedings of SIGIR.
ACM (2013)
8. Kamvar, M., Baluja, S.: A large scale study of wireless searchbehavior: Google
mobile search. In: Proceedings SIGCHI. ACM (2006)
9. Kamvar, M., Baluja, S.: Deciphering trends in mobile search. Computer 40(8),
58–62 (2007)
10. Kamvar, M., Kellar, M., Patel, R., Xu, Y.: Computers and iphonesand mobile
phones, oh my!: a logs-based comparison of search users on differentdevices. In:
Proceedings of WWW. ACM (2009)
11. Kazai, G., Kamps, J., Milic-Frayling, N.: An analysis of humanfactors and label
accuracy in crowdsourcing relevance judgments. Inf. Retr. (2013)
12. Li, J., Huffman, S., Tokuda, A.: Good abandonment in mobile andpc internet
search. In: Proceedings of SIGIR. ACM (2009)
13. Schamber, L., Eisenberg, M.: Relevance: The search for a definition (1988)
14. Shokouhi, M., Jones, R., Ozertem, U., Raghunathan, K., Diaz, F.: Mobile query
reformulations. In: Proceedings SIGIR. ACM (2014)
15. Song, Y., Ma, H., Wang, H., Wang, K.: Exploring and exploiting user search behav-
ior on mobile and tablet devices to improve search relevance. In: Proceedings of
WWW (2013)
16. Tombros, A., Ruthven, I., Jose, J.M.: How users assess web pages for information
seeking. J. Am. Soc. Inf. Sci. Technol. 56(4), 327–344 (2005)
17. Tossell, C., Kortum, P., Rahmati, A., Shepard, C., Zhong, L.: Characterizing web
use on smartphones. In: Proceedings of the SIGCHI. ACM (2012)
18. Xu, Y.C., Chen, Z.: Relevance judgment: What do informationusers consider
beyond topicality? JASIST 57(7), 961–973 (2006)
19. Yi, J., Maghoul, F., Pedersen, J.: Deciphering mobile search pat-terns: a study of
yahoo! mobile search queries. In: Proceedings of the WWW. ACM (2008)
20. Zhang, Y., Zhang, J., Lease, M., Gwizdka, J.: Multidimensional relevance modeling
via psychometrics and crowdsourcing. In: Proceedings of the SIGIR. ACM (2014)
21. Zhu, J., Zou, H., Rosset, S., Hastie, T.: Multi-class adaboost. Stat. Interface 2(3),
349–360 (2009)
Probabilistic Modelling
Probabilistic Local Expert Retrieval
1 Introduction
When visiting unfamiliar cities for the first time, visitors are confronted with a
number of challenges related to finding the right spots to go, sights to see or even
the most appropriate range of cuisine to sample. While residents quite naturally
familiarize themselves with their surroundings, strangers often face difficulties in
efficiently selecting the best location to suit their preferences. We refer to such
location-specific, and frequently sought-after [11], knowledge as local expertise.
Local expertise can be acquired with the help of online resources such as
review sites (e.g., yelp.com or tripadvisor.com) that rely on both paid profes-
sionals as well as user recommendations. General-purpose Web search engines,
especially in the form of entertainment or food verticals, provide valuable infor-
mation. However, these services merely return basic information and results are
not specifically tailored to the individual. Ideally, a more effective way of solving
this task would be to ask someone who is local and/or has the knowledge about
the area in question. Seeking out this kind of people is an example of expert
retrieval, and, more specifically, of local expert retrieval.
Location-based social networks allow users to post messages and document
their whereabouts. When a user checks in at a given location, the action of
check-in is not merely a user-place tuple. The physical attendance at the loca-
tion also suggests that the user, at least to some extent, gets familiar with the
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 227–239, 2016.
DOI: 10.1007/978-3-319-30671-1 17
228 W. Li et al.
location and its environment. The more frequently such evidence is observed, the
more accurate our insights into the user’s interests and expertise become. This
paper introduces two novel contributions over the state of the art in local expert
retrieval. (1) We propose a range of probabilistic models for estimating users’
local expertise on the basis of geo-tagged social network streams. (2) In a large-
scale evaluation effort, we demonstrate the merit of the methods on real-world
data sampled from the popular microblogging service Twitter.
The remainder of this paper is structured as follows: Sect. 2 gives an overview
of local expert retrieval methods as well as social question answering platforms.
Section 3 derives a range of probabilistic models for local expert retrieval that are
being evaluated on a concrete retrieval task in Sect. 4. Finally, Sect. 5 concludes
with a brief discussion of our main findings as well as an outlook on future work.
2 Related Work
The task of expertise retrieval has first been addressed in the domain of enter-
prises managing and optimizing human resources (detailed in [1,17]). Early
expertise retrieval systems required experts to manually fill out questionnaires
about their areas of expertise to create the so-called expert profiles. Later, auto-
mated systems were employed for building and updating such profiles and prob-
abilistic models were introduced for estimating a candidate’s expertise based on
the documents they authored, e.g., [1,4,6,7,15]. These works inspire us to use
probabilistic models for estimating local expertise. Since location information
is not always presented in textual format, we approach the problem by build-
ing models based on candidates’ check-in profiles. Li et al. [11] proposed the
problem of local expert retrieval (using the term “geo-expertise”) and investi-
gated the main intuitions that could naturally support an automatic approach
which considers user check-in profiles as evidence of having knowledge regarding
locations they had visited. A preliminary empirical evaluation demonstrated its
feasibility using three heuristic methods for automating local expert retrieval,
however, without giving a formal derivation that would underline the soundness
of these methods. In this paper, we follow up upon this line of research and
investigate the probabilistic reasoning behind the methods proposed in the pre-
vious work. Cheng et al. [5] also proposed finding local experts as a retrieval
problem, for which they combined models of local and topical authority to rank
candidates based on data collected from Twitter. The authors rely on textual
queries accompanied by a location to specify spatial constraints. In our setting,
queries are phrased in terms of locations. This can for example be either a spe-
cific restaurant or a type of restaurants to which users are interested in paying
a visit. The main difference between this study and the aforementioned one is
that we focus on evidence of location knowledge in geo-spatial movement pro-
files while Cheng et al. introduce location constraints in text-based expertise
retrieval.
Probabilistic Local Expert Retrieval 229
The first approach we propose considers only the candidates’ check-in frequency.
This method focuses on knowledge about a single location or a single type of loca-
tions. We take a co-occurrence modelling approach, inspired by expert finding
via text documents [6]. To be specific, we rank a candidate u by their proba-
bility of having local expertise in a given topic q, i.e., P (u|q). We estimate the
conditional probability by aggregating over the user’s check-ins at all locations
(l), that is
P (u|q) = P (u|l, q)P (l|q).
l
Nu
|Lq | Nl,u
P (q|u) = |Lq |
.
Nu
u ∈U Nu l∈Lq
The above scoring function indicates that check-ins at multiple distinct loca-
tions (within the queried location set) should increase the score more than
repeatedly checking in at the same location. This means that a candidate will
gain a high local expertise rating if he/she makes check-ins at a variety of rele-
vant locations. This fits the intuition that candidates with experience at a vari-
ety of locations may know more about the essence of the topic rather than mere
specifics of a single place within that category. For example, if we seek advice
about Italian restaurants, individuals who have been to many Italian restaurants
in town will be more suitable candidates than those who have been to the same
restaurant a lot.
The prior P (u) is selected for the scoring function so that the candidate-
dependent denominator in the conditional probability P (l|u) will be cancelled
when combined with the prior. This accounts for the fact that language models
represent users’ topical focus rather than their knowledge, i.e., they are biased
towards shorter profiles, when two profiles have the same amount of relevant
check-ins. Since check-ins are positive evidence of candidates knowing about a
location, additionally knowing about other types of locations should not nega-
tively affect the local expertise score. For example, if a candidate has visited two
place categories A and B each for n times while another candidate only has been
to A for n times, it is not reasonable to assume that the latter candidate should
have more knowledge about A than the former candidate, even if the latter has
focused on A more.
Experts are humans and as such they rely on their memories to support their
expertise. Therefore, we should take into account the fact that (1) people forget
the knowledge they once gained and have not refreshed for a while, and (2) the
world is changing as time goes by, e.g., restaurants may have new chefs, and old
buildings may have been replaced. The more time has passed since the creation
of the memory, the more likely it will be forgotten or outdated. To incorporate
such effects, we explicitly model the candidates’ memory by P (c|u), which indi-
cates the probability that candidate u can recall his/her visit represented by
the check-in c. As suggested in the domain of psychology, human memory can
be assumed to decay exponentially [14]. Consequentially, we use an exponential
decay function to represent the retention of individual check-ins, by which we
obtain:
e−λ(t−tc )
Pt (c|u) = −λ(t−tc )
,
c∈Cu e
where t is the time of query and tc is the time when the user posted the check-in.
Similarly, we define a prior for each candidate as follows
e−λ(t−tc )
Pt (u) = c∈Cu −λ(t−t )
c∈C e
c
Probabilistic Local Expert Retrieval 233
The decay of the weight on check-ins models our belief on how up-to-date the
information is, while the prior reflects the average recency of knowledge borne by
the whole community on the social network. Then, for estimating the candidate’s
expertise, we weight each check-in according to its recency, i.e., we marginalize
the user’s old check-ins.
1(lc = l)e−λ(t−tc )
Pt (l|u) = P (l|c)Pt (c|u) = , (3)
c∈Cu c∈C
e−λ(t−tc )
u
where 1(·) is an indicator function, which equals 1 if and only if the condition
in the parentheses evaluates to true. Given these two estimations, we have
rank
Sr (u, q) = Pt (u|q) = Pt (u|l)P (l|q) ==== Pt (l|u)Pt (u)P (l|q)
l∈L l∈L
rank −λ(t−tc )
==== e .
l∈Lq c∈Cu ,lc =l
e−λ(t−tc )
P (c|u) = −λ(t−tc )
c∈Cu e
Thus, we have
1(lc = l)e−λ(t−tc )
P (q|u) = −λ(t−tc )
l∈L c∈C
q u
c∈Cu e
By replacing the counterparts with these into Eq. 2 and applying the logarithm
on both sides of the equation, we obtain
( c∈Cu e−λ(t−tc ) )|Lq | c∈Cu 1(lu = l)e
−λ(t−tc )
Sd (u, q) = log
( c∈C e−λ(t−tu ) )|Lq | l∈L c∈Cu e
−λ(t−tc )
q
1
= log −λ(t−t ) |L |
+ log 1(l = lc )e−λ(t−tc )
( c∈C e c ) q
l∈Lq c∈Cu
1(l = lc )e−λ(t−tc )
rank
==== log
l∈Lq c∈Cu
The decay parameter λ is set to the same value as that in the WTR method.
MT M · h (n)
h (n+1) = .
MT M · h (n)
4 Evaluation
To evaluate the various discussed methods and profile types, we implement a
configurable local expert retrieval system. The system accepts a topic which is
Probabilistic Local Expert Retrieval 235
composed of a scope of city and a (type of) location(s) and returns a list of
related candidate experts. The dataset used in this study is an extended version
of the collection of POI-tagged tweets proposed by Li et al. [12]. It comprises
1.3M check-ins from 8 K distinct users from New York, Chicago, Los Angeles,
and San Francisco. The data collection process was set up such that each user’s
full check-in profile would be included. To filter out accounts that are solely
used for branding and advertisement purposes (e.g., by companies) we remove all
users having a “speed ” between consecutive check-ins higher than 700 kph (which
corresponds approximately to the speed of a passenger aircraft). Similarly, users
showing less than 5 geo-tagged tweets were excluded as well. As a consequence
of this thresholding approach, Fig. 1, shows that the check-in distribution over
users does not follow a complete power law which was observed in the previous
dataset.
4.1 Annotation
Topic-candidate pairs are annotated by human judges that assess each candi-
date’s level of expertise about the given topics. To facilitate this process, an
interactive annotation interface for displaying the candidate’s historical check-
ins has been designed and can be accessed at https://geo-expertise.appspot.com.
For each topic-candidate pair, an annotator is asked to assign a value from 1 to 5
indicating their assessment of the candidate’s knowledge about the topic, where
“5” means the candidate knows the topic very well and “1” indicates the can-
didate knows barely anything about the topic. For greater reliability, we recruit
assessors from different channels. (1) The first run has been carried out on the
crowdsourcing platform CrowdFlower where each participant is paid 0.5 USD
per task (each containing 10 topic-candidate pairs to annotate). Additionally,
we invited students and staff from Delft University of Technology to contribute
their assessments. Via Cohen’s Kappa [8], we found that annotators are inclined
to agree (κ > 0.4) whenever they have strong opinions on whether a candidate
is a local expert on a give topic.
We carry out separate evaluations on two runs of annotation, i.e., one from
the recruited annotators on CrowdFlower and one from the university staff and
students. Annotations are converted into binary labels, in which topic-candidate
pairs assigned with scores 4 or 5 are considered relevant (local experts) and those
with scores 3 or below are considered as irrelevant (non-experts). Based on the
binary relevance annotation, trec eval (http://trec.nist.gov/trec eval/) is used
for measuring the performance of the proposed methods. We test the statisti-
cal significance of differences between the performance scores using a Wilcoxon
Signed Rank Test (α < 0.05).
According to the crowdsourced annotation shown in Table 1, the WTD
method with Active-Day profiles performs the best under P@1 and P@5, while
WTA with check-in profiles performs the best under MAP. All proposed meth-
ods with both types of profiles significantly outperform the random baseline. The
HITS baseline performs significantly better than the random selection approach
but is outperformed by the proposed movement profile methods even though
this difference was not found to be significant.
The two types of profiles (+A and +C) were designed for comparing the
potential influence of check-in gamification by Foursquare that might encourage
users to check in as often as possible. In our evaluation, however, we do not
observe a clear preference for either of the two profile types. This may suggest
that the two types of profiles do not diverge much and check-in gamification
does not have an observable influence on assessing candidates’ local expertise.
Turning towards university annotations, we note that the annotators’ rel-
evance assessment is also in favour of configurations with WTD for P@1 and
P@5 and WTA for MAP, although no significant differences between these two
methods are observed. The proposed methods and the HITS method are all
Probabilistic Local Expert Retrieval 237
significantly better than the random baseline on this set of annotations. Differ-
ent from the crowdsourced annotation, here, we observe a significant preference
of all the proposed methods (WTD, WTA, or WTRD) over HITS. At the same
time, we have not observed any significant differences between the three methods
WTD, WTA and WTRD configured with either profile types.
5 Conclusion
In this paper, we presented a range of probabilistic-model-based approaches
to the task of local expert retrieval. Based on the existing theoretical work in
expertise retrieval, we designed three models to capture the candidate’s check-in
profiles. We further designed a method for distilling users’ check-in profile to test
whether the gamification of online location-based social networks would affect
the accuracy of geo-expertise estimation. To evaluate the proposed methods, we
collected a large volume of check-ins via Twitter’s and Foursquare’s public APIs,
for which we finally collected judgements from both online recruited annotators
and university annotators. Our evaluation shows that the proposed methods do
capture local expertise better than both random as well as refined baselines.
During our experiments, we did not observe a significant difference between
Active-Day profiles and the raw check-in profiles in the evaluation.
In the future, we propose to carry out this evaluation task in-vivo by building
a dedicated local expert retrieval system. Such a system can access the Twit-
ter/Foursquare APIs for users who authorize the application to analyse their
check-in profiles as well as their friends’ geo-tagged media streams to find the
friend that is assumed to know most about the user’s desired location. In conse-
quence, it can be assumed to produce much more reliable expertise annotations
238 W. Li et al.
References
1. Balog, K., Fang, Y., de Rijke, M., Serdyukov, P., Si, L.: Expertise retrieval. Found.
Trends Inf. Retrieval 6(2–3), 127–256 (2012)
2. Bao, J., Zheng, Y., Mokbel, M.F.: Location-based and preference-aware recommen-
dation using sparse geo-social networking data. In: Proceedings of the 20th Interna-
tional Conference on Advances in Geographic Information Systems - SIGSPATIAL
2012, pp. 199–208 (2012)
3. Bar-Haim, R., Dinur, E., Feldman, R., Fresko, M., Goldstein, G.: Identifying and
following expert investors in stock microblogs. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing - EMNLP 2011, pp. 1310–1319
(2011)
4. Campbell, C.S., Maglio, P.P., Cozzi, A., Dom, B.: Expertise identification using
email communications. In: Proceedings of the 12th International Conference on
Information and Knowledge Management - CIKM 2003, pp. 528–531 (2003)
5. Cheng, Z., Caverlee, J., Barthwal, H., Bachani, V.: Who is the barbecue king
of Texas? In: Proceedings of the 37th International ACM SIGIR Conference on
Research and Development in Information Retrieval - SIGIR 2014, pp. 335–344
(2014)
6. Fang, H., Zhai, C.X.: Probabilistic models for expert finding. In: Amati, G.,
Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 418–430.
Springer, Heidelberg (2007)
7. Fang, Y., Si, L., Mathur, A.P.: Discriminative models of integrating document
evidence and document-candidate associations for expert search. In: Proceedings
of the 33rd International ACM SIGIR Conference on Research and Development
in Information Retrieval - SIGIR 2010, p. 683 (2010)
8. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull.
76(5), 378–382 (1971)
9. Horowitz, D., Kamvar, S.D.: The anatomy of a large-scale social search engine. In:
Proceedings of the 19th International Conference on World Wide Web - WWW
2010, pp. 431–440 (2010)
10. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM
46(5), 604–632 (1999)
11. Li, W., Eickhoff, C., de Vries, A.P.: Geo-spatial domain expertise in microblogs.
In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K.,
Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 487–492. Springer, Heidelberg
(2014)
12. Li, W., Serdyukov, P., de Vries, A.P., Eickhoff, C., Larson, M.: The where in the
tweet. In: Proceedings of the 20th ACM International Conference on Information
and Knowledge Management - CIKM 2011, pp. 2473–2476 (2011)
13. Liu, X., Croft, W.B., Koll, M.: Finding experts in community-based question-
answering services. In: Proceedings of the 14th ACM International Conference on
Information and Knowledge Management - CIKM 2005, pp. 315–316 (2005)
Probabilistic Local Expert Retrieval 239
14. Loftus, G.R.: Evaluating forgetting curves. J. Exp. Psychol. Learn. Mem. Cogn.
11(2), 397–406 (1985)
15. Wagner, C., Liao, V., Pirolli, P., Nelson, L., Strohmaier, M.: It’s not in their
tweets: modeling topical expertise of twitter users. In: SocialCom/PASSAT 2012,
pp. 91–100 (2012)
16. Whiting, S., Zhou, K., Jose, J., Alonso, O., Leelanupab, T.: CrowdTiles: presenting
crowd-based information for event-driven information needs. In: Proceedings of the
21st ACM International Conference on Information and Knowledge Management -
CIKM 2012, pp. 2698–2700 (2012)
17. Yimam-Seid, D., Kobsa, A.: Expert-finding systems for organizations: problem
and domain analysis and the DEMOIR approach. J. Organ. Comput. Electron.
Commer. 13(1), 1–24 (2003)
18. Zhang, J., Ackerman, M.S., Adamic, L.: Expertise networks in online communities:
structure and algorithms. In: Proceedings of the 16th International Conference on
World Wide Web - WWW 2007, pp. 221–230 (2007)
Probabilistic Topic Modelling
with Semantic Graph
Long Chen(B) , Joemon M. Jose, Haitao Yu, Fajie Yuan, and Huaizhi Zhang
1 Introduction
Topic models, such as Probabilistic Latent Semantic Analysis (PLSA) [7] and
Latent Dirichlet Analysis (LDA) [2], have been remarkably successful in ana-
lyzing textual content. Specifically, each document in a document collection is
represented as random mixtures over latent topics, where each topic is character-
ized by a distribution over words. Such a paradigm is widely applied in various
areas of text mining. In view of the fact that the information used by these mod-
els are limited to document collection itself, some recent progress have been made
on incorporating external resources, such as time [8], geographic location [12],
and authorship [15], into topic models.
Different from previous studies, we attempt to incorporate semantic knowl-
edge into topic models. Exploring the semantic structure underlying the sur-
face text can be expected to yield better models in terms of their discovered
latent topics and performance on prediction tasks (e.g., document clustering).
For instance, by applying knowledge-rich approaches (cf. Sect. 3.2) on two news
articles, Fig. 1 presents a piece of global semantic graph. One can easily see that
“United States” is the central entity (i.e., people, places, events, concepts, etc. in
DBPedia) of these two documents with a large number of adjacent entities. It is
also clear that a given entity only have a few semantic usage (connection to other
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 240–251, 2016.
DOI: 10.1007/978-3-319-30671-1 18
Probabilistic Topic Modelling with Semantic Graph 241
Fig. 1. A piece of global semantic graph automatically generated from two documents
(178382.txt and 178908.txt of 20 Newsgroups dataset)
entities) and thus can only concentrate on a subset of topics, and utilization of
this information can help infer the topics associated with each of the document
in the collections. Hence, it is interesting to learn the interrelationships between
entities in the global semantic graph, which allows an effective sharing of infor-
mation from multiple documents. In addition to the global semantic graph, the
inference of topics associated with a single document is also influenced by other
documents that have the same or similar semantic graphs. For example, if two
documents overlapped with their entities list, then it is highly possible that these
two documents also have a common subset of topics. Following this intuition,
we also construct local semantic graphs for each document in the collection with
the hope to utilize their semantic similarity.
In a nutshell, the contribution of this paper are:
2 Related Work
2.1 Topic Model with Network Analysis
Topic Model, such as PLSA [7] and LDA [16], provides an elegant mathemati-
cal model to analyze large volumes of unlabeled text. Recently, a large number
of studies, such as Author-Topic Model (ATM) [15] and CFTM [4] have been
reported for integrating network information into topic model, but they mostly
focus on homogeneous networks, and consequently, the information of hetero-
geneous network is either discarded or only indirectly introduced. Besides, the
concept of graph-based regularizer is related to Mei’s seminal work [13] which
incorporates a homogeneous network into statistic topic model to overcome
the overfitting problem. The most similar work to ours is proposed by Deng
et al. [5], which utilised the Probabilistic Latent Semantic Analysis (PLSA) [7]
(cf. Sect. 3.1) together with the information learned from a heterogeneous net-
work. But it was originally designed for academic networks, and thus didn’t
utilize the context information from any knowledge repository. In addition, their
framework only incorporates the heterogeneous network (i.e., relations between
document and entity), while the homogeneous network (i.e., relations between
entity pairs with weight) is completely ignored, whereas we consider both of
them in our framework.
3 Models
3.1 Probabilistic Topic Model
K
P (di , wj ) = P (di ) P (wj |zk )P (zk |di ) (1)
k=1
where P (wj |zk ) is the probability of word wj according to the topic model zk , and
P (zk |dj ) is the probability of topic zk for document di . Following the likelihood
principle, these parameters can be determined by maximizing the log likelihood
of a collection C as follows:
N
M
K
L(C) = n(di , wj )log P (wj |zk )P (zk |di ) (2)
i=1 j=1 k=1
The model parameters φ = P (wj |zk ) and θ = P (zk |di ) can be estimated by
using standard EM algorithm [7].
Thus PLSA provides a good solution to find topics of documents in a text-rich
information network. However, this model ignores the associated heterogeneous
information network as well as other interacted objects. Furthermore, in PLSA
there is no constraint on the parameters θ = P (zk |di ), the number of which
grows linearly with the data. Therefore, the model tends to overfitting the data.
To overcome these problems, we propose to use a biased propagation algorithm
by exploiting a semantic network.
1 P (zk |di )
= ( + P (zk |ej )P (ej |e)) (3)
2 |De |
di ∈De ej ∈Ce
244 L. Chen et al.
N
M
K
L (C) = n(di , wj )log P (wj |zk )PE (zk |di ) (5)
i=1 j=1 k=1
Semantic Graph Construction. When computing the P (ej |e) in the above,
TMSG model, we adopt the method of [14] to construct the semantic graph.
We start with a set of input entities C, which is found by using the off-the-shelf
named entity recognition tool DBpedia Spotlight3 . We then search a sub-graph
of DBpedia which involes the entities we already identified in the document,
together with all edges and intermediate entities found along all paths of maximal
length L that connect them. In this work, we set L = 2, as we find when L is
larger than 3 the model tends to produce very large graphs and introduce lots
of noise.
Figure 2 illustrates an example of a semantic graph generated from the set
of entities {db:Channel, db:David Cameron, db:Ed Miliband}, e.g. as
found in the sentence “Channel 4 will host head-to-head debates between David
Cameron and Ed Miliband.” Starting from these seed entities, we conduct a
depth-first search to add relevant intermediate entities and relations to G (e.g.,
dbr:Conservative Party or foaf:person). As a result, we obtain a semantic
graph with additional entities and edges, which provide us with rich knowledge
about the original entities. Notice that we create two versions of semantic graphs,
namely, the local semantic graph and global semantic graph. The local entity
graphs build a single semantic graph for each document, and it aims to capture
3
https://github.com/dbpedia-spotlight/dbpedia-spotlight.
Probabilistic Topic Modelling with Semantic Graph 245
the document context information. The global entity graph is constructed with
the entities of the whole document collection, and we use it to detect the global
context information.
two steps, E-step and M-step. Formally, we have the E-step to calculate the pos-
terior probabilities P (zk |di , wj ) and P (zk |di , el ):
P (wj |zk )PE (zk |di )
P (zk |di , wj ) = K (6)
k =1 P (wj |zk )PE (zk |di )
K
n(di , el )P (zk |di , el )
(1 − ξ) l=1K (10)
l =1 n(di , el )
4 Experimental Evaluation
We conducted experiments on two real-world datasets, namely, DBLP and 20
Newsgroups. The first dataset, DBLP4 , is a collection of bibliographic informa-
tion on major computer science journals and proceedings. The second dataset, 20
Newsgroups5 , is a collection of newsgroup documents, partitioned evenly across
20 different newsgroups. We experimented with topic modelling using a sim-
ilar set-up as in [5]: For DBLP dataset, we select the records that belong to
the following four areas: database, data mining, information retrieval, and arti-
ficial intelligence. For 20 Newsgroups dataset, we use the full dataset with 20
categories, such as atheism, computer graphics, and computer windows X.
For preprocessing, all the documents are lowercased and stopwords are
removed using a standard list of 418 words. With the disambiguated entities
(cf. 3.2), we create local and global entity collections, respectively, for construct-
ing local and global semantic graphs. The creation process of entity collections
is organized as a pipeline of filtering operations:
4
http://www.informatik.uni-trier.de/∼ley/db/.
5
http://qwone.com/∼jason/20Newsgroups/.
Probabilistic Topic Modelling with Semantic Graph 247
DBLP 20 Newsgroups
# of docs 40,000 20,000
# of entities (local) 89,263 48,541
# of entities (global) 9,324 8,750
# of links (local) docs 237,454 135,492
# of links (global) docs 40,719 37,713
1. The isolated entities, which have no paths with the other entities of the full
entity collection in the DBpedia repository, are removed, since they have less
power in the topic propagation process.
2. The infrequent entities, which appear in less than five documents when con-
structing the global entity collection, are discarded.
3. Similar to step 2, we discard entities that appear less than two times in the
document when constructing the local entity collection.
The statistic of these two datasets along with their corresponding entities
and links are shown in Table 1. We randomly split each of the dataset into a
training set, a validation set, and a test set with a ratio 2:1:1. We learned the
parameters in the semantic graph based topic model (TMSG) on the training
set, tuned the parameters on the validation set and tested the performance
of our model and other baseline models on the test set. The training set and
the validation set are also used for tuning parameters in baseline models. To
demonstrate the effectiveness of the TMSG method, we introduce the following
methods for comparison:
– PLSA: The baseline approach which only employs the classic Probabilistic
Latent Semantic Analysis [7].
– ATM: The state-of-the-art approach, Author Topic Model, which combines
LDA with authorship network [15], in which authors are replaced with entities.
– TMBP: The state-of-the-art approach, Topic Model with Biased Propaga-
tion [5], which combines PLSA with an entity network (without the external
knowledge, such as DBpedia).
– TMSG: The approach which described in Sect. 3, namely, Topic Model with
Semantic Graph.
In order to evaluate our model and compare it to existing ones, we use
accuracy (AC) and normalized mutual information (NMI) metrics, which are
popular for evaluating effectiveness of clustering systems. The AC is defined as
n
δ(a ,map(li ))
AC = 1 in [17], where n denotes the total number of documents,
δ(x, y) is the delta function that equals one if x = y and equals zero oth-
erwise, and map(li ) is the mapping function that maps each cluster label li
to the equivalent label from the data corpus. Given two set of documents, C
and C , their mutual information metric M I(C, C ) is defined as: M I(C, C ) =
248 L. Chen et al.
p(ci ,c )
p(ci , cj ) · log2 p(ci )·p(c
ci ∈C,cj ∈C
j
) [17], where p(ci ) and pcj are the probabilities
j
that a document arbitrarily selected from the corpus belongs to the clusters ci
and cj , respectively, and p(ci , cj ) is the joint probability that arbitrarily selected
document belongs to the cluster ci and cj at the same time.
Fig. 3. The effect of varying parameter ξ in the TMSG framework on DBLP dataset.
the semantic graph with ξ < 0.6. It is also notable that the TMSG with local
semantic graphs (local TMSG) generally performs better then the TMSG with
global semantic graph (global TMSG), which suggests that the local context is
probably more important than the global one for document clustering task. We
further tuned the parameters on the validation dataset. When comparing TMSG
with other existing techniques, we empirically set the bias parameter ξ = 0.6
and the ratio between local and global TMSG is set as 0.6 : 0.4.
Table 2 depicts the clustering performance of different methods. For each
method, 20 test runs are conducted, and the final performance scores were cal-
culated by averaging the scores from the 20 tests. We can observe that ATM
outperforms the baseline PLSA with additional entity network information. As
expected, TMBP outperforms the ATM since it directly incorporates the hetero-
geneous network of the entities. More importantly, our proposed model TMSG
can achieve better results than state-of-the-art ATM and TMBP algorithms.
A comparison using the paired t-test is conducted for PLSA, ATM, and TMBP
over TMSG, which clearly shows that our proposed TMSG outperforms all base-
line methods significantly. This indicates that by considering the semantic graph
information and integrating with topic modelling, TMSG can have better topic
modelling power for clustering documents.
Table 2. The clustering performance of different methods on (a) DBLP and (b) 20
Newsgroups datasets ( -*-* and -* indicate degraded performance compared to TMSG
with p-value < 0.01 and p-value < 0.05, respectively).
Table 3. The representative terms generated by PLSA, ATM, TMBP, and TMSG
models. The terms are vertically ranked according to the probability P (w|z).
Topic 1 (DB) Topic 2 (DM) Topic 3 (IR) Topic 4 (AI)
data management data algorithm information learning learning knowledge
PLSA
5 Conclusion
The main contribution of this paper is to show the usefulness of semantic graph
for topic modelling. Our proposed TMSG (Topic Model with Semantic Graph)
supersedes the existing ones since it takes account both homogeneous networks
(i.e., entity to entity relations) and heterogeneous networks (i.e., entity to docu-
ment relations), and since it exploits both local and global representation of rich
knowledge that go beyond networks and spaces.
There are some interesting future work to be continued. First, TMSG only
relies on one of the simplest latent topic models (namely PLSA), which makes
sense as a first step towards integrating semantic graphs into topic models. In
the future, we will study how to integrate the semantic graph into other topic
modeling algorithms, such as Latent Dirichlet Allocation. Secondly, it would be
also interesting to investigate the performance of our algorithm by varying the
weights of different types of entities.
References
1. Bao, Y., Collier, N., Datta, A.: A partially supervised cross-collection topic model
for cross-domain text classification. In: CIKM 2013, pp. 239–248 (2013)
2. Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. 3, 459–
565
3. Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in cqa by
leveraging wikipedia semantic knowledge. In: CIKM 2011, pp. 1321–1330 (2011)
Probabilistic Topic Modelling with Semantic Graph 251
4. Chen, X., Zhou, M., Carin, L.: The contextual focused topic model. In: KDD 2012,
pp. 96–104 (2012)
5. Deng, H., Han, J., Zhao, B., Yintao, Y., Lin, C.X.: Probabilistic topic models with
biased propagation on heterogeneous information networks. In: KDD 2011, pp.
1271–1279 (2011)
6. Guo, W., Diab, M.: Semantic topic models: Combining word distributional statis-
tics and dictionary definitions. In: EMNLP 2011, pp. 552–561 (2011)
7. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis.
Mach. Learn. 45, 256–269
8. Hong, L., Dom, B., Gurumurthy, S., Tsioutsiouliklis, K.: A time-dependent topic
model for multiple text streams. In: KDD 2011, pp. 832–840 (2011)
9. Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic
labelling using dbpedia. WSDM 2013, pp. 465–474 (2013)
10. Kim, H., Sun, Y., Hockenmaier, J., Han, J.: Etm: Entity topic models for mining
documents associated with entities. In: ICDM 2012, pp. 349–358 (2012)
11. Li, F., He, T., Xinhui, T., Xiaohua, H.: Incorporating word correlation into tag-
topic model for semantic knowledge acquisition. In: CIKM 2012, pp. 1622–1626
(2012)
12. Li, H., Li, Z., Lee, W.-C., Lee, D.L.: A probabilistic topic-based ranking framework
for location-sensitive domain information retrieval. In: SIGIR 2009, pp. 331–338
(2009)
13. Mei, Q., Cai, D., Zhang, D., Zhai, C.: Topic modeling with network regularization.
In: WWW 2008, pp. 342–351 (2008)
14. Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In:
WSDM 2014, pp. 543–552 (2014)
15. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Zhong, S.: Arnetminer: extraction
and mining of academic social networks. In: KDD 2008, pp. 428–437 (2008)
16. Xing Wei, W., Croft, B.: Lda-based document models for ad-hoc retrieval. In:
SIGIR 2006, pp. 326–335 (2009)
17. Wei, X., Liu, X., Gong, Y.: Document clustering based on non-negative matrix
factorization. In: SIGIR 2003, pp. 267–273 (2003)
Estimating Probability Density of Content
Types for Promoting Medical Records Search
1 Introduction
In general, the disease and symptom in electronic medical records (EMRs) can be
divided into multiple content types: positive (e.g., “with lung cancer”), negative
(e.g., “denies fever”), family history (e.g., “family history of lung cancer”) and
the others [1]. The traditional information retrieval (IR) systems usually treat
all the EMRs equally such as keyword match, instead of considering different
content types on different situations. For example, a query describes patients
admitted with a diagnosis of dementia. Then, all EMRs including the keyword
“dementia” will be retrieved by a classical IR model such as BM25 [2]. However,
the real intention of the query is to find the patients with dementia, instead of
the family member who had dementia or someone denies dementia.
In order to handle multiple content types, previous work mainly focuses on
the following two ways: (1) transforming the format of the negative content, for
example, changing “denies fever” to “nofever” as one word; (2) directly removing
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 252–263, 2016.
DOI: 10.1007/978-3-319-30671-1 19
Estimating Probability Density of Content Types 253
the negative and family history before indexing the EMRs. However, the above
solutions still fail to promote the influence of the right content types in search
process respectively based on the users’ true desire. Therefore, we are motivated
to utilize the content types as our relevance features for medical search.
In this paper, we propose a novel learning approach to promote the perfor-
mance of the medical records search through probability density estimation over
the content types. Specifically, we first present a Bayesian-based classification
method to identify the different contents from the EMR data and queries. Then,
a type-based weighting function is inferred, followed by estimating the kernel
density function to compute the weights of the content types. After that, how
to select the bandwidth of the density estimator is visualized.
We evaluate our approach on the TREC 2011 and 2012 Medical Records
Tracks [3] (the track has been discontinued after 2012). The task of the TREC
Medical Records Tracks requires an IR system to return a list of retrieved EMRs
by the likelihood of satisfying the user’s information desire. The retrieved EMRs
are used to identify the patients who meet the criteria, described by the queries,
for inclusion in possible clinical studies. The evaluation results show that the
proposed approach outperforms the strong baselines.
In the rest of our paper, we briefly present the related work in Sect. 2. Then, we
introduce the proposed approach in Sect. 3, including the identification of the con-
tent types, the definition of the weighting function and the density estimation. After
that, we show the experiments in Sect. 4, followed by the discussion and analysis in
Sect. 5. Finally, we draw the conclusions and describe the future work in Sect. 6.
2 Related Work
2.1 Content Types Processing
In medical records search, some content types, including the negative, history or
experience of family members other than the patients, often have a bad effect on
search performance, when users need the positive contents. Previous researchers
endeavored to solve two problems: (1) how to detect the content types in clinical
text? (2) how to prevent them from damaging the retrieval performance?
For the first problem, most algorithms have been designed based on regular
expression. Chapman et al. [4] developed NegEx which utilized several phrases
to indicate the negative content. NegEx has been widely applied to identify
the negative content. Harkema et al. [5] proposed an algorithm called ConText
which was an extension of NegEx. It was not only used to detect the negative
content, but also the hypothetical, historical and experience of family members.
In this paper, we apply the Bayesian-based classification method to discover the
positive, negative and family history in EMRs and fit each query into one of
them. The characteristic of our method is based on probability statistics rather
than designing the regular expressions as detecting rules.
Averbuch et al. [6] proposed a framework to automatically identify the neg-
ative and positive content and assure the retrieved document in which at least
one keyword appeared in its positive content. Limsopatham et al. [7] proposed
254 Y. He et al.
3 Methodology
First, we identify the content types in EMRs and queries by the Bayesian-based
classification method. Then, we use kernel density estimation to obtain the den-
sity functions of the content types in the training set. After that, the weights
of the content types, which utilize the kernel estimators, are incorporated into
BM25 as the types-based weighting function. Finally, the EMRs are ranked by
the weighting function.
Table 1. Examples and numbers of queries in the TREC Medical Records Tracks in
2011 and 2012
Then, four indices are built separately based on the four parts. We implement the
first round retrieval and extract the relevance scores of basic model (e.g., BM25)
from these indices, referred as x1 , x2 , x3 , x4 . We use vector X = (x1 , x2 , x3 , x4 ) to
represent the feature of the content types in EMRs, which is a continuous-valued
random variable [14].
Correspondingly, each query is fitted into a type denoted by Cq as well.
We use Cq to represent the feature of the content types in queries, which is a
discrete-valued random variable.
P (X, Cq |R = 1)P (R = 1)
P (R = 1|X, Cq ) = (4)
P (X, Cq )
p(X, Cq |R = 1)P (R = 1)
P (R = 1|X, Cq ) ≈ (5)
p(X, Cq )
Note that P (R = 1) is a constant, given the EMRs collection and the query,
having no impact on the ranking. We ignore P (R = 1) in the Eq. 5 and define
the weights of the content types as:
p(X, Cq |R = 1)
f (X, Cq ) = log( ) (6)
p(X, Cq )
where fBM 25 is the relevance score of BM25, ⊕ denotes that the term f (X, Cq ) is
added only once for each EMR. F (X, Cq ) is used as relevance ranking function.
1 1
n n
x − xi
p̂(x) = ωi = K( ) (8)
nh i=1 nh i=1 h
where K(u) is the kernel function, and h is the bandwidth, ωi is the weight of
xi to influence the p̂(x). Empirically, the choice of kernel function has almost no
impact on the estimator. Hence, we choose the common Gaussian kernel, such
that:
1 1
K(u) = √ exp(− u2 ) (9)
2π 2
Bandwidth Selection. In our work, the asymptotic mean integrated square
error (AMISE) between the estimators p̂(x; h) and actual p(x) is utilized to
obtain the theoretical optimal bandwidth hopt . In general, since actual p(x) is
unknown, it is assumed as Gaussian distribution. AMISE is given as follows:
AM ISE{p̂(x; h)} = E( {p̂(x; h) − p(x)}2 dx)
= ({E p̂(x; h) − p(x)}2 + E{p̂(x; h) − E p̂(x; h)}2 )dx (10)
= [(Bias(p̂(x; h)))2 + V ariance(p̂(x; h))]dx
Note that AMISE is divided into bias and variance. Based on AMISE, hopt
can be calculated by minimizing Eq. 10.
Except for hopt , we investigate more variations of hopt : 4 ∗ hopt , 2 ∗ hopt ,
hopt /2, hopt /4, hopt /8. In Fig. 1, shapes of estimators based on different h are
shown. It is presented that h has an obvious impact on the estimators, and
the smoothness of estimators decreases with decreasing h. Empirically, a less
smooth estimator indicates a low bias and high variance, and verse versa. Hence,
the bias decreases with decreasing h. Our intuition is that a relatively small
bandwidth, corresponding to an estimator which has low bias, will lead to a
better performance on the TREC medical track data sets. The effect of the
bandwidths is evaluated in our experiments.
4 Experiments
In this section, we present a set of experiments to evaluate the performance of our
proposed type-based weighting function on the TREC Medical Records Tracks.
Our evaluation baseline is the BM25 model. Moreover, we compare our approach
with the popular method which removes the negation and family history before
indexing the EMRs, referred as “negation removal” in the experiments.
258 Y. He et al.
The data sets in the TREC 2011 and 2012 Medical Tracks are composed of de-
identified medical records from the University of Pittsburgh NLP Repository.
Each medical record is linked to a visit, which is the patients single stay at a
hospital. The data sets contain 93,551 medical records which are linked to 17,264
visits. Queries were provided by physicians in the Oregon Health & Science
University (OHSU) Biomedical Informatics Graduate Program. Each query is
made up of symptom, diagnosis and treatment, matching a reasonable number
of visits. All EMRs and queries are preprocessed by Potter’s stemming and
standard stopword removal. Terrier is applied for indexing and retrieval1 . We
first implement the initial search to obtain the relevance score of the content
types as the learning features. The final result is achieved by ranking the EMRs
according to Eq. 7.
We use all 35 queries in the 2011 track to obtain the kernel estimators, and all
50 queries in the 2012 track as the testing purpose. The relevance judgments for
the test collection are evaluated on a 3-point scale: not relevant, normal relevant
and highly relevant. In this work, we ignore the different degrees of relevance
with regarding the highly relevant as the normal relevant in the training data.
0.30
0.30
0.20
0.20
0.20
0.10
0.10
0.10
0.00
0.00
0.00
−5 0 5 10 15 0 5 10 15 0 5 10 15
0.3
0.3
0.20
0.2
0.2
0.10
0.1
0.1
0.00
0.0
0.0
0 5 10 15 0 5 10 15 0 5 10 15
Fig. 1. Estimators based on the different bandwidths. The solid line denotes
p̂(x, Cq = 2|R = 1) and the dashed line denotes p̂(x, Cq = 2).
1
http://terrier.org.
Estimating Probability Density of Content Types 259
4.3 Results
5 Discussion
Here we first investigate the influence of the proposed learning approach. Then,
we analyze the effectiveness of removing negative and family history content.
After that, the impact of the bandwidth on the kernel estimators is discussed.
Finally, we show the empirically optimized parameter θ for the bandwidth.
Table 3 displays the results of BM25, negation removal and our proposed app-
roach. Here “negation removal” stands for the solution which removes the neg-
ative and family history content at all. Our best result is selected based on the
tuned parameters and the optimized bandwidth.
The results of negation removal solution outperform BM25 in terms of P@5
and P@10. This explains that the keyword match in BM25 is not suitable for the
medical task, since BM25 denies the differences among different content types
in the EMRs. Hence, negation removal obtains better performance when the
negation information is excluded for retrieval.
However, the negation information still has its own influence in the EMRs.
Our approach which specifically identifies negation from the EMRs and queries,
achieves the best results. Hence, we suggest to better make use of the negation
information, instead of removing it arbitrarily.
260 Y. He et al.
Table 2. Evaluation results of the type-based weighting function in 2012 medical track.
Figure 2 shows the results based on different bandwidths. The optimal bandwidth
achieves the best results in terms of P@10 and P@15 and the top results in terms of
P@20 and P@30. These observations are theoretically proved by minimizing Eq. 10.
Estimating Probability Density of Content Types 261
Table 3. Comparison between the type-based weighting function and the negation
removal in 2012 medical track
of bandwidth, we believe the optimized θ depends on the variety of the data sets.
Therefore, we only suggest the above local optimized parameters on the TREC
2011 and 2012 data sets, instead of the global one for all data sets.
References
1. Koopman, B., Zuccon, G.: Understanding negation and family history to improve
clinical information retrieval. In: Proceedings of the 37th International ACM SIGIR
Conference on Research Development in Information Retrieval, pp. 971–974. ACM
(2014)
Estimating Probability Density of Content Types 263
2. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and
Beyond. Now Publishers Inc, Hanover (2009)
3. Voorhees, E., Tong, R.: Overview of the trec medical records track. In: Proceedings
of TREC 2011 (2011)
4. Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A
simple algorithm for identifying negated findings and diseases in discharge sum-
maries. J. Biomed. Inform. 34(5), 301–310 (2001)
5. Harkema, H., Dowling, J.N., Thornblade, T., Chapman, W.W.: Context: an algo-
rithm for determining negation, experiencer, and temporal status from clinical
reports. J. Biomed. Inform. 42(5), 839–851 (2009)
6. Averbuch, M., Karson, T., Ben-Ami, B., Maimon, O., Rokach, L.: Context-sensitive
medical information retrieval. In: The 11th World Congress on Medical Informatics
(MEDINFO 2004), San Francisco, CA, pp. 282–286. Citeseer (2004)
7. Limsopatham, N., Macdonald, C., McCreadie, R., Ounis, I.: Exploiting term depen-
dence while handling negation in medical search. In: Proceedings of the 35th Inter-
national ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 1065–1066. ACM (2012)
8. Karimi, S., Martinez, D., Ghodke, S., Cavedon, L., Suominen, H., Zhang, L.: Search
for medical records: Nicta at trec medical track. In: TREC 2011 (2011)
9. Amini, I., Sanderson, M., Martinez, D., Li, X.: Search for clinical records: rmit
at trec medical track. In: Proceedings of the twentieth Text Retrieval Conference
(TREC 2011). Citeseer (2011)
10. Córdoba, J.M., López, M.J.M., Dı́az, N.P.C., Vázquez, J.M., Aparicio, F., de Bue-
naga Rodrı́guez, M., Glez-Peña, D., Fdez-Riverola, F.: Medical-miner at trec med-
ical records track. In: TREC 2011 (2011)
11. King, B., Wang, L., Provalov, I., Zhou, J.: Cengage learning at trec medical track.
In: TREC 2011 (2011)
12. Limsopatham, N., Macdonald, C., Ounis, I., McDonald, G., Bouamrane, M.: Uni-
versity of glasgow at medical records track: experiments with terrier. In: Proceed-
ings of TREC 2011 (2011)
13. Zhou, X., Huang, J.X., He, B.: Enhancing ad-hoc relevance weighting using prob-
ability density estimation. In: Proceedings of the 34th international ACM SIGIR
conference on Research and development in Information Retrieval, pp. 175–184
(2011)
14. Choi, S., Choi, J.: Exploring effective information retrieval technique for the med-
ical web documents: Snumedinfo at clefehealth task 3. In: Proceedings of the
ShARe/CLEF eHealth Evaluation Lab 2014 (2014)
15. Robertson, S.E.: The probability ranking principle in IR. J. Document. 33, 294–304
(1977)
16. Gijbels, I., Delaigle, A.: Practical bandwidth selection in deconvolution kernel den-
sity estimation. Comput. Stat. Data Anal. 45(2), 249–267 (2004)
17. Duraiswami, V.: Abstract fast optimal bandwidth selection for kernel density esti-
mation. Fast optimal bandwidth selection for kernel density estimation. - Research-
Gate (2006)
18. Jones, M.C.: A brief survey of bandwidth selection for density estimation. J. Am.
Stat. Assoc. 91(433), 401–407 (1996)
19. Comaniciu, D.: An algorithm for data-driven bandwidth selection. IEEE Trans.
Pattern Anal. Mach. Intell. 25, 281–288 (2003)
Evaluation Issues
The Curious Incidence of Bias Corrections
in the Pool
1 Introduction
An important issue in Information Retrieval (IR) is the offline evaluation of IR
systems. Since the first Cranfield experiments in the 60s, the evaluation has been
performed with the support of test collections. A test collection is composed of:
a collection of documents, a set of topics, and a set of relevance assessments
for each topic. Ideally, for each topic all the documents of the test collection
should be judged, but due to the dimension of the collection of documents, and
their exponential growth over the years, this praxis soon became impractical.
Therefore, already early in the IR history, this problem has been addressed
through the use of the pooling method [11]. The pooling method requires a
set of runs provided by a set of IR systems having as input the collection of
documents and the set of topics. Given these runs, the original pooling method
consists, per topic, of: (1) collecting all the top n retrieved documents from
each selected run in a so-called pool ; (2) generating relevance judgments for
each document in the pool. The benefit of this method is a drastic reduction of
the number of documents to be judged, quantity regulated via the number d of
documents selected. The aim of the pooling method, as pointed out by Spärck
Jones, is to find an unbiased sample of relevant documents [6]. The bias can be
minimized via increasing either the number of topics, or the number of pooled
documents, or the number and variety of IR systems involved in the process.
But albeit the first two are controllable parameters that largely depend on the
budget invested in the creation of the test collection, the third, the number and
This research was partly funded by the Austrian Science Fund (FWF) project num-
ber P25905-N23 (ADmIRE).
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 267–279, 2016.
DOI: 10.1007/978-3-319-30671-1 20
268 A. Lipani et al.
2 Related Work
This section is divided into two parts. First we consider the work done in correct-
ing the pool bias for the evaluation measure P@n. Second we consider the work
conducted on the pooling strategies themselves. We will not cover the extensive
effort in creating new metrics that are less sensitive to the pool bias (the work
done for Bpref [2], followed by the work done by Sakai on the condensed lists [9]
or by Yilmaz et al. on the inferred metrics [17,18]).
Pooling was already used in the first TREC, in 1992, 17 years after it was intro-
duced by Spärck Jones and van Rijsbergen [11], on the discussion of building an
‘ideal’ test collection that would allow reusability. The algorithm [5] is described
as follows: (1) divide each set of results into results for a given topic; then, for
each topic: (2) select the top 200 (subsequently generalized to d) ranked docu-
ments of each run, for input to the pool; (3) merge results from all runs; (4) sort
results on document identifiers; (5) remove duplicate documents. This strategy
is known as fixed-depth pool.
With the aim of further reducing the cost of building a test collection, Buckley
and Voorhees [2] explored the uniformly sampled pool. At the time they observed
that P@n had the most rapid deterioration compared to a fully judged pool. The
poor behavior of this strategy for top-heavy metrics was confirmed recently in
Voorhees’s [14] short comparison on pooling methods.
Another strategy is the stratified pool [18], a generalization of both the fixed-
depth pool and the uniformly sampled pool. The stratified pool consists in layer-
ing the pool in different strata based on the highest rank obtained by a document
in any of the given runs.
A comparison of the various pooling strategies has been recently reported by
Voorhees [14]. We complement that report in several directions: First, and most
importantly, we focus on bias correction methods and the effects of the pooling
strategies on them rather than on the metrics themselves. Second, we generalize
the stratified sampling method. Third, we expand the observations from 2 to 12
test collections. We also observe that the previous study does not distinguish
between the effect of the number of documents evaluated with the effect of the
different strategies (see Table 1 in [14]). In our generalization of the stratified
pooling strategy we will ensure that the expected1 number of judged documents
is constant across different strategies.
3 Background Analysis
Here, the pooling method and its strategies are explained. Then the work con-
ducted on the pool bias correction for the evaluation measure P @n is analyzed.
In this section, to simplify the notation, the average P @n over the topics is
denoted by g.
1
Obviously, a guarantee on the actual number of judged documents cannot be pro-
vided without an a posteriori change in the sampling rates.
270 A. Lipani et al.
The simplest pooling strategy is Depth@d, which has been already described
above. SampledDepth@d&r uses the Depth@d algorithm as an intermediary
step. It produces a new pool by sampling without replacement from the resulting
set at a given rate r. Obviously, if r = 1 the two strategies are equivalent. The
Stratif ied further generalizes the pooling strategy, introducing the concepts of
stratification and stratum. A stratification is a list of n strata, with sizes si
and sample rates ri : z n = [(s0 , r0 ), ..., (sn , rn )]. A stratum is a set of documents
retrieved by a set of runs on a given range of rank positions. Which rank j−1range
ρ of the stratum j is: if j = 1 then 1 ≤ ρ ≤ s1 else if j > 1 then i=1 si <
j
ρ ≤ n
i=1 si . In this strategy, given a stratification z , we distinguish three
phases: (1) pre-pooling: each document of each run is collected in a stratum
based on its rank; (2) purification: for each stratum all the documents found
on a higher rank stratum get removed; (3) sampling: each stratum is sampled
without replacement based on its sample rate. Obviously, when the stratification
is composed by only one stratum, it boils down to SampledDepth@d&r.
Which strategy to choose is not clear and sometimes it depends on the domain
of study. Generally, the Depth@d is preferred because of its widespread use in
the IR community, but for recall oriented domains the Stratif ied is preferred
because of its ability to go deeper in the pool without explosively increasing
the number of documents to be judged. The SampledDepth@d&r is generally
neglected due to its lack in ability to confidently compare the performance of
two systems, especially when used with top-heavy evaluation measures.
The main factor under the control of the test collection builder is the num-
ber of judged documents. This number depends both on the number of pooled
runs and on the minimum number of judged documents per run. The following
inequality shows the relation between these two components:
R R R \{rp } R \{rp }
p
g(r, Qd+1 ) − g(r, Qd p ) ≥ g(r, Qd+1
p
) − g(r, Qd p ) (1)
where r is a run, Rp is the set of runs used on the construction of the pool Q,
rp ∈ Rp , d is the minimum number of documents judged per run, and g(r, Q) is
the score of the run r evaluated on the pool Q. The proof is evident if we observe
R Rp Rp \{rp } Rp R \{r } Rp \{rp } R \{r }
that: Qd p ⊆ Qd+1 , Qd+1 ⊆ Qd+1 , Qd p p ⊆ Qd+1 and Qd p p ⊆
R
Qd p . When rp = r, the inequality (Eq. 1) defines the reduced pool bias. In general
however it shows that the bias is influenced by d, the minimum number of judged
documents per run, and by |Rp | the number of runs.
Herein, we analyze the two pool bias correctors. Both attempt to calculate a
coefficient of correction that is added to the biased score.
Webber and Park [16] present a method for the correction that computes the
error introduced by the pooling method when one of the pooled runs is removed.
This value is computed for each pooled run using a leave-one-out approach and
The Curious Incidence of Bias Corrections in the Pool 271
then averaged and used as correction coefficient. Their correction coefficient for
a run rs ∈
/ Rp is the expectation:
Rp Rp \{rp }
E g(rp , Qd ) − g(rp , Qd ) (2)
rp ∈Rp
R
where Rp is the set of pooled runs, rp ∈ Rp and Qd p is a pool constructed with
d documents per each run in Rp . As done in a previous study [7] we evaluate the
method using the mean absolute error (MAE). Equation 2 is simple enough that
we can attempt to analytically observe how the method behaves with respect to
the reduced pool, in the context of a Depth@d pool at varying d. We identify
analytically a theoretical limitation of the Webber approach when used with a
Depth@d. The maximum benefit, in expectation, is obtained when the cut-off
value of the precision (n) is less or equal to d. After this threshold the benefit is
lost.
We start analyzing the absolute error (AE) of the Webber approach for a
run rs :
g(rs , G) − g(rs , QRp ) + E g(rp , QRp ) − g(rp , QRp \{rp } )
d
rp ∈Rp
d d
R
where G is ground truth2 , Qd p is the pool constructed using a Depth@d strategy
where d is its depth and Rp is the set of pooled runs. We compare it to the
absolute error of the reduced pool:
R
g(rs , G) − g(rs , Qd p ) (3)
We observe that when the depth of the pool d becomes greater or equal than
R
n, g(rp , Qd p ) becomes constant. For the sake of clarity we substitute it with
Cn . We substitute g(rs , G), which is also a constant, with CG . Finally, we also
R R \{r }
rename the components a(d) = g(rs , Qd p ), b(d) = Erp ∈Rp [g(rp , Qd p p )], and
call f (d) the AE of the Webber method, and h(d) the AE of the reduced pool:
Therefore,
ḃ(d) ≥ 0, if CG − [a(d) + Cn − b(d)] ≥ 0
f˙(d) ≥ ḣ(d) iff
2ȧ(d) ≥ ḃ(d), if CG − [a(d) + Cn − b(d)] < 0
While the first condition is always verified (ḃ(d) is an average of positive quanti-
ties), the second tells us that if ḃ(d) is less or equal to 2ȧ(d) the Webber method
decreases more slowly than the reduced pool. This inequality, as a function of rs
does not say anything about its behavior as it can be different for each rs . There-
fore we study the MAE using its expectation. We define RG as the set of runs of
the ground truth G, in which Rp ⊂ RG . Using the law of total expectation we
can write:
R \{rs ,rp } R \{rs ,rp }
E [ḃ(d)] = E [ E
G
[g(rp , Qd+1 ) − g(rp , Qd G )]] =
rs ∈RG rs ∈RG rp ∈RG \{rs }
R \{rs1 ,rs2 }
G R \{rs1 ,rs2 }
= E [g(rs1 , Qd+1 ) − g(rs1 , Qd G )] (6)
rs1 ,rs2 ∈RG :rs1 =rs2
We observe that there exists a confounding factor that is the proportion of judged
relevant to non-relevant documents. Assuming that all runs are ranked by some
probability of relevance, i.e. that there is a higher probability to find relevant
documents at the top than at the bottom of the runs, our approach (Lipani)
is sensitive to the depth of the pool because at any one moment it compares
one run, that is a set of d probably relevant documents and |rs | − d probably
non-relevant documents with all the existing runs, i.e. a set of d|Rp | probably
relevant documents and (E [|rp |] − d)|Rp | probably non-relevant documents. The
effects of this aggregation are difficult to formalize in terms of the proportion
of relevant and non-relevant documents, and we explore them experimentally in
the next section.
4 Experiments
To observe how the pool and the two pool bias correctors work in different con-
texts we used a set of 12 TREC test collections, sampled from different tracks:
8 from Ad Hoc, 2 from Web, 1 from Robust and 1 from Genomics. In order to
make possible the simulation of the different pooling strategies, the test collec-
tions needed to have been built using a Depth@d strategy with depth d ≥ 50.
For Depth@d and SampledDepth@d&r, all the possible combinations of
parameters with a step size of 10 have been explored. Figure 1 shows the MAE
of the different methods, for Depth@d at varying d. Figure 2 shows the MAE of
the different methods for the SampledDepth@d&r, with fixed depth d = 50, at
varying sample rate r from 10 % to 90 % in steps of 10.
For Stratif ied, due to its more flexible nature, we constrained the generation
of the stratifications. We should note that there are practically no guidelines in
the literature on how to define the strata. First, we defined the sizes of the
strata for each possible stratification and then for each stratification we defined
the sample rates of each stratum.
Given n, the number of strata to generate, and s ∈ S, a possible n stratum’s
size, we find all the vectors of size n, sn = (s0 s1 ... sn ) such that i=1 si = D,
where D is the maximum depth of the pool available, with sizes si chosen in
increasing order, except for the last stratum, which may be a residual smaller
than the second-last. For each n ∈ N+ , and constraining the set of stratum sizes
S to multiples of 10, when D = 100, we find only ten possible solutions.
To find the sample rates ri to associate to each stratum si , we followed a
more elaborated procedure. As pointed out by Voorhees [14], the best results
are obtained fixing the sample rate of the first stratum to 100 %. From the
second to the last stratum, when available, we sample keeping the expected
minimal number of pooled documents for each run constant. This is done in
order to allow a cross-comparison among the stratifications. However, for strat-
ifications composed by 3 or more strata some other constraint is required. The
TREC practice has shown that the sampling rate decreases fast, but so far
decisions in this sense are very ad-hoc. Trying to understand how fast the rate
should drop, we are led back to studies relating retrieval status values (RSV),
274 A. Lipani et al.
i.e. scores, with probabilities of relevance. Intuitively, we would want our sam-
pling rate to be related to the latter. Nottelmann and Fuhr [8] pointed out
that mapping the RSV to the probability of relevance using a logistic function
outperforms the mapping when a linear function is used. Therefore, to create
the sampling rates, we define a logistic function with parameters b1 = 10/D,
b0 = D/2 where D is the depth of the original pool (i.e. of the ground truth).
b1 defines the slope of the logistic function and is in this case arbitrary. b0
is the minimal number of documents we want, on expectation, to assess per
run. The sample rates are then the areas under the logistic curve for each stra-
tum (Eq. 9). However, since, to keep in line with practice, we always force the
first strata to sample at 100 %, we correct the remaining sampling rates pro-
portionally (Eq. 10). To verify that the expected minimal number of sampled
documents is b0 , it is enough to observe that the sum of the areas that define
the sampling rates is b0 (Eq. 8). The resulting stratifications are listed in Table 1
and the corresponding MAEs for the different methods are shown in Fig. 3.
1-r1 D
1
1.00
0.94 dx = b0 (8)
0 1 + eb1 (x−b0 )
n
si
i=1 1
rn
score
Table 1. List of the used stratifications zin , where n is the size of the stratification, and
i is the index of the solution found given the fixed constraints. E [d] is the mean number
of judged documents per run for all the test collections with respect to z11 . † indicates
when the difference with respect to the previous stratification is statistically significant
(t-test, p < 0.05).
To measure the bias of the reduced pool and the two correcting approaches,
we run a simulation3 using only the pooled runs and a leave-one-organization-out
approach, as done in previous studies [3,7]. The leave-one-organization-out app-
roach consists in rebuilding the pool removing in sequence all the runs submitted
3
The software is available on the website of the first author.
The Curious Incidence of Bias Corrections in the Pool 275
5 Discussion
In case of Depth@d in Fig. 1 we observe that, as expected given the analytical
observations of Sect. 3.2, the Webber approach slows down its correction with
increasing depth d. The ratio between the error produced by the reduced pool
and the method decreases systematically after d becomes greater than the cut-off
value n of P@n. This trend sometimes leads to an inversion, as for Ad Hoc 2,
3, 6, 7 and 8, Web 9 and 10 and Robust 14. The Lipani approach, as expected,
is less reliable when the depth d of the pool is less than the cut-off value of the
precision. We see that very clearly in Web 9 and Web 11. It generally reaches
a peak when d and n are equal, and then improves again. Comparing the two
approaches, we see that in the majority of the cases the Lipani approach does
better than the Webber approach.
The SampledDepth@d&r, shown in Fig. 2, does not display the same effects
observed for the Depth@d. Both corrections do better than the reduced pool.
The effect in Webber’s method disappears because in this case (and also in the
Stratif ied pool later) the constant Cn in Eq. 4 is no longer a constant here, even
for n > d. The effect observed in Lipani’s method is removed by the more non-
relevant documents introduced on the top of the pooled runs, which reduce the
effect of the selected run. The Lipani approach does generally better, sometimes
with high margin as in: Ad Hoc 5 and 8, Web 9, 10 and 11, and Genomics 14.
Finally, in the Stratif ied case, Fig. 3, the effects observed on the Depth@d
are also not visible. For P @10, the corrections perform much better if we sample
more from the top, most notably for the stratifications of size 3, but the correc-
tion degrades when using P @30. This is particularly visible when comparing z33
and z43 . Although they have essentially the same number of judged documents
(with difference non statistically significant, Table 1), the stratification with a
deeper first stratum makes a big difference in performance. Comparing the best
stratification of size 2 (z42 ) and the best stratification of size 3 (z13 ) we observe
that there is only a small difference in performance between them that could
be justified by the smaller number of judged documents (with difference statis-
tically significant, Table 1). The z42 is the best overall stratification, confirming
also the conclusion of Voorhees [14]. However a cheaper solution is the z13 , which,
as shown in Table 1, evaluates fewer documents, but obtains a comparably low
MAE.
Cross-comparing the three pooling strategies (observe the ranges on the y-
scales), we see that the best performing strategy, fixing the number of judged
documents, is Depth@d, then the Stratif ied and SampledDepth@d&r.
276 A. Lipani et al.
6
6
6 4
4
4 3 3
3 2 2
2
1
1 1
1
1
1 4
3
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
n
P @10 P @30
Fig. 1. MAE in logarithm scale of the ground truth (Depth@M, where M is the max-
imum depth of the test collection) against the Depth@d pool at varying of the depth
n, for the evaluation measures, P @10 and P @30. MAE computed using the leave-
one-organization out approach of pooled runs and removing the bottom 25 % poorly
performing runs.
The Curious Incidence of Bias Corrections in the Pool 277
20 25
20 20
Ad Hoc 8 Web 9 Web 10
25 30
22.5
40
20 25
17.5
30 20
15
12.5 15
Web 11 Genomics 14 Robust 14
14 35
12 30 40
10
25 30
8
20
10 30 50 70 90 10 30 50 70 90 10 30 50 70 90
r
P @10 P @30
Fig. 2. MAE of the ground truth (Depth@M, where M is the maximum depth of the
test collection) against the SampledDepth@d&r with fixed d = 50 at varying of the
sample rate r, for the evaluation measures, P @10 and P @30. MAE computed using
the leave-one-organization out approach of pooled runs and removing the bottom 25 %
poorly performing runs.
278 A. Lipani et al.
15 10 6
10 4
5
5 2
6
6 6
4
4 4
2
2
2
Web 9 Web 10
P @10
4 5 P @30
3 4
3
2 Lipani
2
1 1 Reduced Pool
Webber
z12
z22
z32
z42
z13
z23
z33
z43
z14
z12
z22
z32
z42
z13
z23
z33
z43
z14
Stratif ication
Fig. 3. MAE of the ground truth (Depth@M , where M is the maximum depth of the
test collection) against the Stratif ied pool at varying of the different stratifications, for
the evaluation measures, P @10 and P @30, only on test collections originally built using
the Depth@100 pooling strategy. MAE computed using the leave-one-organization out
approach of pooled runs and removing the bottom 25 % poorly performing runs.
6 Conclusion
We have confirmed a previous study [7] that the Lipani approach to pool bias
correction outperforms the Webber approach. We have further expanded these
observations to various pooling strategies. We have also partially confirmed
another previous study indicating that Stratif ied pooling with a heavy top
is preferable [14]. We have extended this by showing that, in terms of pool bias,
the pooling strategies are, in order of performance, Depth@d, Stratif ied, and
SampledDepth@d&r. Additionally, we made two significant observations on the
two existing pool bias correction methods. We have shown, analytically and
experimentally, that the Webber approach reduces its ability to correct the runs
at increasing pool depth, when this is greater than the cut-off of the measured
precision. Conversely, the Lipani approach sometimes manifests an instability
The Curious Incidence of Bias Corrections in the Pool 279
when the depth of the pool is smaller than the cut-off of the measured preci-
sion. These opposite behaviors would make the Lipani estimator a better choice
since it improves with increasing number of judged documents. Both of these
side-effects are reduced when a sampled strategy is used.
References
1. Bodoff, D., Li, P.: Test theory for assessing ir test collections. In: Proceedings of
SIGIR (2007)
2. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In:
Proceedings of SIGIR (2004)
3. Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., Soboroff, I.: Reliable information
retrieval evaluation with incomplete and biased judgements. In: Proceedings of
SIGIR (2007)
4. Clarke, C.L.A., Smucker, M.D.: Time well spent. In: Proceedings of IIiX (2014)
5. Harman, D.: Overview of the first trec conference. In: Proceedings of SIGIR (1993)
6. Jones, K.S.: Letter to the editor. Inf. Process. Manage. 39(1), 156–159 (2003)
7. Lipani, A., Lupu, M., Hanbury, A.: Splitting water: precision and anti-precision to
reduce pool bias. In: Proceedings of SIGIR (2015)
8. Nottelmann, H., Fuhr, N.: From retrieval status values to probabilities of relevance
for advanced ir applications. Inf. Retr. 6(3–4), 363–388 (2003)
9. Sakai, T.: Alternatives to bpref. In: Proceedings of SIGIR (2007)
10. Sanderson, M., Zobel, J.: Information retrieval system evaluation: effort, sensitivity,
andreliability. In: Proceedings of SIGIR (2005)
11. Jones, K.S., van Rijsbergen, C.J.: Report on the need for and provision of an “ideal”
information retrieval test collection. British Library Research and Development
Report No. 5266 (1975)
12. Urbano, J., Marrero, M., Martı́n, D.: On the measurement of test collection relia-
bility. In: Proceedings of SIGIR (2013)
13. Voorhees, E.M.: Topic set size redux. In: Proceedings of SIGIR (2009)
14. Voorhees, E.M.: The effect of sampling strategy on inferred measures. In: Proceed-
ings of SIGIR (2014)
15. Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment
error. In: Proceedings of SIGIR (2002)
16. Webber, W., Park, L.A.F.: Score adjustment for correction of pooling bias. In:
Proceedings of SIGIR (2009)
17. Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imper-
fect judgments. In: Proceedings of CIKM (2006)
18. Yilmaz, E., Kanoulas, E., Aslam, J.A.: A simple and efficient sampling method for
estimating AP and NDCG. In: Procedings of SIGIR (2008)
Understandability Biased Evaluation
for Information Retrieval
Guido Zuccon(B)
1 Introduction
Traditional information retrieval (IR) evaluation relies on the assessment of top-
ical relevance: a document is topically relevant to a query if it is assessed to be
on the topic expressed by the query. The Cranfield paradigm and its subsequent
incarnations into many of the TREC, CLEF, NTCIR or FIRE evaluation cam-
paigns have used this notion of relevance, as reflected by the collected relevance
assessments and the retrieval systems evaluation measures, e.g., precision and
average precision, recall, bpref, RBP, and graded measures such as discounted
cumulative gain (DCG) and expected reciprocal rank (ERR).
Relevance is a complex concept and the nature of relevance has been widely
studied [16]. A shared agreement has emerged that relevance is a multidimen-
sional concept, with topicality being only one of the factors (or criteria) influenc-
ing the relevance of a document to a query [8,28]. Among others, core factors that
influence relevance beyond topicality are: scope, novelty, reliability and under-
standability [28]. However, these factors are often not reflected in the evaluation
framework used to measure the effectiveness of retrieval systems.
In this paper, we aim to develop a general evaluation framework for informa-
tion retrieval that extends the existing one by considering the multidimensional
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 280–292, 2016.
DOI: 10.1007/978-3-319-30671-1 21
Understandability Biased Evaluation for Information Retrieval 281
2 Related Work
Research on document relevance has shown that users’ relevance assessments are
affected by a number of factors beyond topicality, although topicality has been
found to be the essential relevance criteria. Chamber and Eisenberg have syn-
thesised four families of approaches for modelling relevance, highlighting its mul-
tidimensional nature [24]. Cosijn and Ingwersen investigated manifestations of
relevance such as algorithmic, topical, cognitive, situational and socio-cognitive,
and identified relation, intention, context, inference and interaction as the key
attributes of relevance [8]. Note that relevance manifestations and attributes in
that work are different from what we refer to as factors of relevance in this paper.
Similarly, the dimensions described by Saracevic [23], which are related to those
of Cosijn and Ingwersen mentioned above, differ in nature from the factors or
dimensions of relevance we consider in this paper.
The actual factors that influence relevance vary across studies. Rees and
Schulz [20] and Cuadra and Katter [9] identified 40 and 38 factors respectively.
Xu and Chen proposed and validated a five-factor model of relevance which con-
sists of novelty, reliability, understandability, scope, along with topicality [28].
Zhang et al. have further validated such model [33]. Their empirical findings
highlight the importance of understandability, reliability and novelty along with
topicality in the relevance judgements they collected. Barry also explored fac-
tors of relevance beyond topicality [2]; of relevance to this work is that these
user experiments highlighted that criteria pertaining to user’s experience and
282 G. Zuccon
3 Gain-Discount Framework
1
K
M= d(k)g(d@k) (1)
N
k=1
where g(d@k) and d(k) are respectively the gain function computed for the
(relevance of the) document at rank k and the discount function computed for
the rank k, K is the depth of assessment at which the measure is evaluated, and
1/N is a (optional) normalisation factor, which serves to bound the value of the
sum into the range [0,1] (see also [25]).
Without loss of generality, we can express the gain provided by a document
at rank k as a function of its probability of relevance; for simplicity we shall
write g(d@k) = f (P (R|d@k)), where P (R|d@k) is the probability of relevance
given the document at k. A similar form has been used for the definition of
Understandability Biased Evaluation for Information Retrieval 283
the gain function for time-biased evaluation measures [25]. Measures like RBP,
nDCG and ERR can still be modelled in this context, where their differences
with respect to g(d@k) reflect on different f (.) functions being applied to the
estimations of P (R|d@k).
Different measures within the gain-discount framework use different functions
for computing gains and discounts. Often in RBP the gain function is binary-
valued1 (i.e., g(d@k) = 1 if the document at k is relevant, g(d@k) = 0 otherwise);
while for nDCG g(d@k) = 2P (R|d@k) − 1 and for ERR2 g(d@k) = (2P (R|d@k) −
1)/2max(P (R|d)) . The discount function in RBP is modelled by d(k) = ρk−1 ,
where ρ ∈ [0, 1] reflects user behaviour3 ; while in nDCG the discount function
is given by d(k) = 1/(log2 (1 + k)) and in ERR by d(k) = 1/k.
When only the topical dimension of relevance is modelled, as is in most
retrieval systems evaluations, then P (R|d@k) = P (T |d@k), i.e., the probability
that the document at k is topically relevant (to a query). This probability is 1 for
relevant and 0 for non-relevant documents, when considering binary relevance;
it can be seen as the values of the corresponding relevance levels when applied
to graded relevance.
4 Integrating Understandability
To integrate different dimensions of relevance in evaluation measures, we model
the probability of relevance P (R|d@k) as the joint distribution over all consid-
ered dimensions P (D1 , · · · , Dn |d@k), where each Di represents a dimension of
relevance, e.g., topicality, understandability, reliability, etc.
To compute the joint probability we assume that dimensions are composi-
tional
n events and their probabilities independent, i.e., P (D1 , · · · , Dn |d@k) =
i=1 P (Di |d@k). These are strong assumptions and are not always true. Eisen-
berg and Barry [10] highlighted that user judgements of document relevance
are affected by order relationships, and proposals to model these dynamics have
recently emerged, for example see Bruza et al. [3]. Nevertheless, Zhang et al.
used crowdsourcing to prime a psychometric framework for multidimensional
relevance modelling, where the relevance dimensions are assumed compositional
and independent [33]. While the above assumptions are unrealistic and somewhat
limitative, note that similar assumptions are common in information retrieval.
For example, the Probability Ranking Principle assumes that relevance assess-
ments are independent [21].
Following the assumptions above, the gain function with respect to different
dimensions of relevance can be expressed in the gain-discount framework as:
n
g(d@k) = f (P (R|d@k)) = f P (D1 , · · · , Dn |d@k) = f P (Di |d@k)
i=1
1
Although there is no requirement for this to be the case and RBP can be used for
graded relevance [17].
2
Where P (R|d@k) captures either binary (P (R|d@k) either 0 or 1) or graded relevance
and max(P (R|d)) is the highest relevance grade, e.g., 1 in case of binary relevance.
3
High values representing persistent users, low values representing impatient users.
284 G. Zuccon
using this collection because real-world tasks within consumer health search often
require that the retrieved information can be understood by cohorts of users with
different experience and understanding of health information [1,12,26,27,35].
Indeed, health literacy (the knowledge and understanding of health informa-
tion) has been shown as a critical factor influencing the value of information
consumers acquire through search engines [11].
Along with the queries, we also obtain the runs that were originally submitted
to the relevant tasks at CLEF 2013–2015 [12,13,18]4 . Both the simulations and
the analysis with real user assessments focus on the changes in system rankings
obtained when evaluating using standard RBP and its understandability vari-
ants (uRBP and uRBPgr). System rankings are compared using Kendall rank
correlation (τ ) and AP correlation (τAP ) [30], which assigns higher weights to
changes that affect top performing systems.
In all our experiments the RBP parameter ρ which models user behaviour
(RBP persistence parameter) was set to 0.8 for all variants of this measure,
following the findings of Zhang et al. [32].
where, for simplicity of notation, u1 (k) indicates the value of P1 (U |d@k) and
r(k) is the (topical) relevance assessment of document k (alternatively, the value
of P (T |d@k)); thus g(k) = f (P (T |d@k)P1 (U |d@k)) = P (T |d@k)P1 (U |d@k) =
r(k)u1 (k).
In the second user model, the probability estimation P2 (U |d@k) is similar
to the previous step function, but it is smoothed in the surroundings of the
threshed value. This provides a more realistic transition between understandable
and non-understandable information. This behaviour is achieved by the following
estimation:
1 arctan F OG(d@k) − th
P2 (U |d@k) ∝ − (7)
2 π
where arctan is the arctangent trigonometric function and F OG(d@k) is the
FOG readability score of the document at rank k. (Other readability scores
could be used instead of FOG.) Equation 7 is not a probability distribution
per se, but one such distribution
can
be obtained by normalising Eq. 7 by its
integral between [min F OG(d@k) , max F OG(d@k) ]. However Eq. 7 is rank
equivalent to such distribution, not changing the effect on uRBP. These settings
lead to the formulation of a second simulated variant of uRBP, uRBP2 , which
is based on this second user model and is obtained by substituting u2 (k) =
P2 (U |d@k) to u1 (k) in Eq. 6.
while documents with FOG scores above 15 and 20 increasingly restrict the audi-
ence able to understand the text. We performed simple cleansing of the HTML
pages, although a more conscious pre-processing may be more appropriate [19].
In the following we report the results observed using the CLEF eHealth 2013
topics and assessments. Figure 1 reports RBP vs. uRBP for the 2013 systems.
Table 1 reports the values of Kendall rank correlation (τ ) and AP correlation
(τAP ) between system rankings obtained with RBP and uRBP.
Higher values of th produce higher correlations between systems rankings
obtained with RBP and uRBP; this is regardless of the user model used in uRBP
(Table 1). This is expected as the higher the threshold, the more documents will
have P (U |d@k) = 1 (or ≈ 1 for uRBP2 ): in this case uRBP degenerates to RBP.
Overall, uRBP2 is correlated with RBP more than uRBP1 is to RBP. This is
because of the smoothing effect provided by the arctan function. This function in
fact increases the number of documents for which P (U |d@k) is not zero, despite
their readability score being above th. This in turn narrows the scope for ranking
differences between systems effectiveness. These observations are confirmed in
Fig. 1, where only few changes in the rank of systems are shown for th = 20 (×
in Fig. 1), with more changes found for th = 10 (◦) and th = 15 (+).
The simulations reported in Fig. 1 demonstrate the impact of understand-
ability in the evaluation of systems for the considered task. The system
ranked highest according to RBP (MEDINFO.1.3.noadd) is second to a num-
ber of systems according to uRBP if user understandability of up to FOG
level 15 is wanted. Similarly, the highest uRBP1 for th = 10 is achieved by
Fig. 1. RBP vs. uRBP for CLEF eHealth 2013 systems (left: uRBP1 ; right: uRBP2 )
at varying values of readability threshold (th = 10, 15, 20).
Table 1. Correlation (τ and τAP ) between system rankings obtained with RBP and
uRBP1 or uRBP2 for different values of readability threshold on CLEF eHealth 2013.
th = 10 th = 15 th = 20
RBP vs τ = .1277 τ = .5603 τ = .9574
uRBP1 τAP = −.0255 τAP = .2746 τAP = .9261
RBP vs τ = .5887 τ = .6791 τ = .9574
uRBP2 τAP = .2877 τAP = .4102 τAP = .9407
Understandability Biased Evaluation for Information Retrieval 289
UTHealth CCB.1.3. noadd, which is ranked 28th according to RBP, and for
th = 15 by teamAEHRC.6.3, which is ranked 19th according to RBP and achieves
the highest uRBP2 for th = 10, 15.
We repeated the same simulations for the 2014 and 2015 tasks. While we
omit to report all results here due to space constraints, we do report in Table 2
the results of the simulations for the first user model tested on the 2015 task,
so that these values can be directly compared to those obtained using the real
assessments (Sect. 6). The trends observed in these results are similar to those
reported for the 2013 data (and for the 2014 data), i.e., the higher the threshold
th, and the larger the correlation between RBP and uRBP becomes. However
larger absolute correlations values between RBP and uRBP1 are found when
using the 2015 data, if compared to the correlations reported in Table 1 for the
2013 task. The full set of results, including high resolutions plots, is made avail-
able at http://github.com/ielab/ecir2016-UnderstandabilityBiasedEvaluation.
Table 2. Correlation (τ and τAP ) between system rankings obtained with RBP and
uRBP1 for different values of the readability threshold on CLEF eHealth 2015.
th = 10 th = 15 th = 20
RBP vs τ = .5931 τ = .8898 τ = .9986
uRBP1 τAP = .5744 τAP = .8777 τAP = .9990
0.4
0.30
0.3
0.3
uRBPgr
uRBP
uRBP
0.25
0.2
0.2
0.1
0.1
0.20
0.0
0.0
0.15
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.25 0.30 0.35 0.40
RBP RBP
RBP
Fig. 2. RBP vs. uRBP for CLEF eHealth 2015 systems, with understandability judge-
ments sourced from human assessors (binary uRBP left, uRBPgr (graded) right). Cen-
tre: a detail of the correlation between RBP vs. binary uRBP.
290 G. Zuccon
7 Conclusions
References
1. Ahmed, O.H., Sullivan, S.J., Schneiders, A.G., McCrory, P.R.: Concussion informa-
tion online: evaluation of information quality, content and readability of concussion-
related websites. Br. J. Sports Med. 46(9), 675–683 (2012)
2. Barry, C.L.: User-defined relevance criteria: an exploratory study. JASIS 45(3),
149–159 (1994)
3. Bruza, P.D., Zuccon, G., Sitbon, L.: Modelling the information seeking user by the
decision they make. In: Proceedings of MUBE, pp. 5–6 (2013)
4. Carterette, B.: System effectiveness, user models, and user utility: a conceptual-
framework for investigation. In: Proceedings of SIGIR, pp. 903–912 (2011)
5. Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of
cascade measures for novelty and diversity. In: Proceedings of WSDM, pp. 75–84
(2011)
6. Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher,
S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In:
Proceedings of SIGIR, pp. 659–666 (2008)
7. Collins-Thompson, K., Callan, J.: Predicting reading difficulty with statistical lan-
guage models. JASIST 56(13), 1448–1462 (2005)
8. Cosijn, E., Ingwersen, P.: Dimensions of relevance. IP&M 36(4), 533–550 (2000)
9. Cuadra, C.A., Katter, R.V.: Opening the black box of ‘relevance’. J. Doc. 23(4),
291–303 (1967)
10. Eisenberg, M., Barry, C.: Order effects: a study of the possible influence of pre-
sentation order on user judgments of document relevance. JASIS 39(5), 293–300
(1988)
11. Friedman, D.B., Hoffman-Goetz, L., Arocha, J.F.: Health literacy and the world
wide web: comparing the readability of leading incident cancers on the internet.
Inf. Health Soc. Care 31(1), 67–87 (2006)
12. Goeuriot, L., Jones, G., Kelly, L., Leveling, J., Hanbury, A., Müller, H., Salanterä,
S., Suominen, H., Zuccon, G.: ShARe/CLEF eHealth Evaluation Lab 2013, Task 3:
Informationretrieval to address patients’ questions when reading clinical reports.
In: Proceedings of CLEF (2013)
13. Goeuriot, L., Kelly, L., Lee, W., Palotti, J., Pecina, P., Zuccon, G., Hanbury, A.,
Gareth, H.M., Jones, J.F.: ShARe, CLEF eHealth Evaluation Lab 2014, Task 3:
User-centred health information retrieval. In: Proceedings of CLEF Sheffield, UK
(2014)
292 G. Zuccon
Abstract. Measures of user behaviour and user perception have been used to
evaluate interactive information retrieval systems. However, there have been few
efforts taken to understand the relationship between these two. In this paper, we
investigated both using user actions from log files, and the results of the User
Engagement Scale, both of which came from a study of people interacting with
a novel interface to an image collection, but with a non-purposeful task. Our
results suggest that selected behavioural actions are associated with selected user
perceptions (i.e., focused attention, felt involvement, and novelty), while typical
search and browse actions have no association with aesthetics and perceived
usability. This is a novel finding that can lead toward a more systematic user-
centered evaluation.
1 Introduction
(if they exist) among the various perception and behavioural measures will suggest that
the measures are evaluating the same phenomena, which may lead to a more parsimo‐
nious set of measures. Surprisingly, we still do not know which measures are the more
reliable and robust, and indicative of overall results.
This paper is structured as follows: Sect. 2 discusses how both user perception and
behaviour are used in IIR evaluations. Section 3 describes the dataset used in this study,
the measures extracted from the dataset, and our approach to the analysis. Sections 4-6
deal, respectively, with the results, discussion and conclusions.
2 Background
The evaluation of IR systems has puzzled the field for half a century. Initially relevance
emerged as the measure of preference to assess primarily topical relevance using, e.g.,
mean average precision, mean reciprocal rank, and discounted cumulative gain [15].
But with interactive IR came a focus on users and their needs, which examined the effect
of individual differences [6, 9] on search, and evaluated search outcomes [16], as well
as user behaviour [24] and user perception [16] of the search process. More recently
broader aspects of user context [5, 8] have been considered.
Due to the iterative nature of the search process, we do not know if and when an
outcome meets a user’s need. A user may assess an outcome immediately, but when the
task that prompted the search is complex, that judgment may only come after a succes‐
sion of search tasks (and other types of information tasks) and over a period of time.
Individual differences such as age, gender, expertise, mental effort, and learning style
may affect the process, but there is as yet definitive influential set [1, 6, 8].
The core measures used in evaluations to date have tended to combine elements of
user behaviour (e.g., number of queries) and perception (e.g., satisfaction) as demon‐
strated by results of the various TREC, INEX and CLEF interactive tracks over the years.
These have been characterized in multiple ways [14, 19, 25]. One of the few attempts
to examine the interactions between these two dimensions is the work of Al-Maskari
and Sanderson [1, 2], who examined the relationship between selected aspects of behav‐
iour and perception, and found significant associations between user satisfaction and
user effectiveness (e.g., completeness), and user satisfaction and system effectiveness
(e.g., precision, recall). To our knowledge, there is only one measure that integrates user
behaviour with user perception: Tague-Sutcliffe’s informativeness measure [20] that
assesses the performance of the system simultaneously with the perception of the user.
But this is atypical and due to the effort (e.g., constant user feedback) required in imple‐
mentation is rarely used [10].
One recent multi-dimensional measure is the User Engagement Scale (UES) [16]
which calculates six dimensions (Table 1) of a user experience: Aesthetic Appeal,
Novelty, Focused Attention, Felt Involvement, Perceived Usability, and Endurability
(see definitions in Table 1). The scale contains 31 items; each item is presented as a
statement using a 5 point scale from “strongly disagree” to “strongly agree”. Unlike
other measures, the model underpinning the UES shows how Endurability is explained
either directly or indirectly by the other five dimensions. The UES has been used to
evaluate multiple types of systems (e.g., e-shopping [16], wikiSearch [17], Facebook
[4]). This scale follows standard psychometric scale development methods [7], and has
been tested for reliability and validity. Although differences have emerged [17] in the
various applications, it is the most tested measure of user perception of a system.
How a user interacts with a search system is characterized typically by a set of low-level
user actions and selections at the interface (see [2, 14, 18, 20]):
• frequency of interface object use, e.g., number of times search box has been used;
• counts of queries, categories viewed in a menu, mouse clicks, mouse rollovers;
• time spent using objects, viewing pages.
Multiple efforts have attempted to look for patterns in these actions, patterns that
might have the capability to predict likelihood of a successful outcome [21, 24]. The
challenge with user behaviour measures is that they are only descriptive of the outcome,
and are not interpretive of the process. That is to say, they lack the rationale behind why
those behaviours may lead to a successful outcome. The challenge with log files is the
voluminous number of data points and the need to find a reliable approach to defining
groups or sets based on behavioural patterns. Not all users are alike and nor do they all
take the same approach to searching for the same things as evidenced by the TREC,
INEX and CLEF interactive tracks.
3 Methods
3.1 Overview
We used the data collected by the CLEF 2013 Cultural Heritage Track (CHiC). This
section briefly describes that dataset, the measures we extracted from the dataset, and
how we approached the analysis, but see [12] for the details of that study.
296 M. Zhuang et al.
3.2 Dataset
Application System. The system, an image Explorer, based on Apache Solr1 contains
about one million records from the Europeana Digital Library’s English language
collection. The Explorer was accessed using a custom-developed interface (see Fig. 1
[12]), adapted from wikiSearch [22], with three types of access: (1) hierarchical category
browser, (2) search box, and (3) a metadata filter based on the Dublin core ontology
although the labels were modified for better user understanding. The interface used a
single display panel that brought items to the surface leaving the interface structure as
a constant. Using one of the three access methods, participants searched or browsed the
content, adding interesting items into a book-bag, and at the same time providing infor‐
mation about why the object was added using a popup box.
Task. Participants first read the scenario: “Imagine you are waiting to meet a friend in
a coffee shop or pub or the airport or your office. While waiting, you come across this
website and explore it looking at anything that you find interesting, or engaging, or
relevant…” The next display, Fig. 1, presented the browse task with no explicit goals
in the upper left corner: “Your Assignment: explore anything you wish using the Cate‐
gories below or the Search box to the right until you are completely and utterly bored.
When you find something interesting, add it to the Book-bag.”
Participants. 180 participants volunteered with 160 on-line participants and 20 in-lab
participants who were recruited via a volunteers’ mailing list.
Procedure. Participants (both lab and online) used a web-based system, SPIRES [11]
which guided them through the process. The only difference between the two is that lab
participants were interviewed, which is outside the scope of this analysis. The SPIRES
system started with an explanation of the study, acquired informed consent, and asked
for a basic demographic profile and questions about culture before presenting the
1
http://lucene.apache.org/solr/.
The Relationship Between User Perception and User Behaviour 297
Explorer and the task to participants. Once participants had executed the task, and
essentially were “bored,” they moved on to the 31 item UES questionnaire [7, 16] about
their perceptions of the search experience and the interface, and provided a brief explan‐
ation of objects in the book-bag, the metadata and the interface.
3.3 Measures
The following measures (see Table 1) were extracted from the CHiC study data:
Variable Definition
User Perception measures – the User Engagement Scale (UES)
Aesthetic Appeal Perception of the visual appearance of interface.
Felt Involvement Feelings of being drawn in and entertained in interaction.
Focused Attention The concentration of mental activity, flow an absorption.
Novelty Curiosity evoked by content.
Perceived Usability Affective and cognitive response to interface/content.
Endurability Overall evaluation of the experience and future intentions.
User Behaviour measures
Queries Number of queries used
Query Time Time spent issuing queries following the links
Items viewed (Queries) Number of items viewed from queries
Bookbag (Queries) Number of items added to Bookbag from queries
Topics Number of categories used.
Topics Time Time spent exploring categories and following links
Items viewed (Topics) Number of items viewed from categories
Actions Number of actions (e.g., keystrokes, mouseclicks)
Pages Number of pages examined
Bookbag Time Total time spent reviewing contents of Bookbag.
Bookbag (Total) Number of items added to the Bookbag
Bookbag (Topics) Number of items added to Bookbag from category.
Task Time Total time user spent on the task.
298 M. Zhuang et al.
1. User perception measures: the UES with six user perception dimensions [16];
2. User behaviour: 13 variables that represent typical user actions e.g., examining
items, selecting categories, and deploying queries. Times were measured in seconds.
Data Preparation. After extracting the data, each participant set was scanned for
irregularities. Pilot participants and those who did not engage (e.g. left the interface for
hours) were removed. 157 participants remained. The two datasets were saved into a
spreadsheet or database for preliminary examination, and exported to SPSS.
User Perception. First, Reliability Analysis assessed the internal consistency [3] of the
UES sub-scales using Cronbach’s α. Second, the inter-item correlations among items
were used to test the distinctiveness of the sub-scales. Third, Exploratory Factor Analysis
using Maximum Likelihood with Oblique Rotation (as we assumed correlated factors
[18]) to estimate factor loadings tested the underlying factors, to compare with previous
UES analyses, and validate it for use in this research.
User Behaviour. First, the raw log file data were exported to a spreadsheet. A two-step
data reduction process sorted 15396 user actions into 157 participant groups containing
participant id, time stamp, action type and parameter. Next Exploratory Factor Analysis
(using Maximum Likelihood with Oblique Rotation) was used to identify the main
behavioural classes. These then were used to calculate the measure per participant for
each variable listed in Table 1. Finally, Cluster Analysis extracted symbolic user arche‐
types across 157 participants.
Correlation Analysis. Correlation analysis using Pearson’s r was then used to examine
the relationship between user perception and user behaviour.
4 Results
The results first present the analysis of the user perception measures, then the user
behaviour measures and finally the analysis of the relationship between the two.
An initial examination of the scree plot (i.e., the eigenvalues of the principal compo‐
nents) that resulted from the Factor Analysis identified a four-factor solution that
accounted for 59.8 % of the variance. A five-factor solution, albeit accounting for 63 %
of the variance, was less appropriate as only two items were loaded on Factor 5 with
lower absolute loading values than those on Factor 4. The four-factor model demon‐
strated a very high Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO =
0.924), indicating that the factors are distinct. The statistically significant result from
Bartlett’s Test of Sphericity also suggested rela‐
tionships existed amongst the dimensions. Table 3 summarises the four factors that were
generated: Factor 1 contained 11 items from Novelty (3 of 3), Focused Attention (1 of
7), Felt Involvement (3 of 3), and Endurability (4 of 5). Factor 4 remained as in the
original UES, Focused Attention (6 of 7) almost remained distinct (Factor 2), and
Perceived Usability (8 of 8) plus 1 item from Endurability formed Factor 3. Factor 2-4
had good internal consistency as demonstrated by Cronbach’s α. Correlation analysis
resulted in significant, although moderate, correlations amongst the factors. Given the
results, some of the overlapped items may be removed from Factor 1 (Cronbach’s α >
0.95) (see Table 3). However, we used the original factors in our remaining analysis.
Factor
Factor Sub-scale Cronbach’s α M n 2 3 4
1 EN, FA, FI, NO 0.95 2.67 11 0.66** 0.45 0.59**
2 FA 0.90 2.19 6 -0.26** 0.36**
3 PU,EN 0.86 3.14 9 0.51**
4 AE 0.89 2.55 5
To assess how participants acted, one action item (i.e., the one with highest weight,
shown in italics in Table 4), was selected from each factor and submitted to a Cluster
Analysis using Ward’s hierarchical clustering method [23]. The results were manually
inspected including descriptive statistics for each action item, and the resulting dendro‐
gram. The 157 participants best distributed into 3 clusters (see Table 5).
The Relationship Between User Perception and User Behaviour 301
Each of the clusters represents a set of participants who exhibit certain types of
behaviours illustrative of information seekers. The first represents explorers, who spent
the longest time checking items in the book-bag, and used on average the most queries.
They were clearly concerned about their results, and specific about what they were
looking for. The second group contains directionless followers. They do not appear to
have specific interests about the content and just trailed the inter-linked categories rather
than using queries. They added fewer items to the bookbag, and appeared to stop early.
The third group acted much like Bates’ berrypickers [5]. Their search and browse activ‐
ities interacted to sustain participants’ interests in the collection. They seemed to obtain
information by noticing and stopping to examine other contents, which are not strongly
relating to the item that they currently viewing. Some used queries to refine their
searches. The interpretation of three clusters suggests the three behavioural factors
described the participants in this case. For the subsequent examination of the relationship
between perception and behaviour, these three behaviour factors (Table 4) were used.
We tested the relationships among the three user behaviour factors and the six UES sub-
scales (see Table 6). The user behaviour factors do not correlate with Aesthetics and
Perceived Usability. Of the others, correlations between the searching and browsing
behaviour factors and Endurability, Focused Attention, and Novelty were also insignif‐
icant. Only the general behaviour had a moderate correlation with Focused Attention,
Felt Involvement, and Novelty.
302 M. Zhuang et al.
5 Discussion
The reliability analysis of all six original UES sub-scales demonstrated good internal
consistency, which aligns with previous studies [16, 17]. In our correlation analysis,
Perceived Usability had a positive and moderate relationship with Focused Attention,
which is in contrast to the results of the wikiSearch study, which found a negative correla‐
tion between the two [17]. A key difference between the two studies is the interface and
content, e.g., images versus Wikipedia, and multiple access tools versus only a search box.
The original six-dimensional UES structure was developed with e-shopping data
[16]. However, our results identified four factors, which is consistent with the result
obtained from the wikiSearch study [17] and Facebook [4]. This suggests that in a
searching environment, the dimensions of UES structure may remain consistent regard‐
less of data type (text data or image data), or perhaps it is due to the presence of rich
information and interactivity. Novelty, Felt Involvement, and Endurability had been
demonstrated to be reliable sub-scales in the e-shopping environment, and some of the
items within these sub-scales were used successfully to measure website engagement
and value as a consequence of website engagement in online travel planning [13]. This
highlights the notion that different user perception dimensions may be more relevant to
different interactive search systems. In our setting we observed that Endurability, Felt
Involvement, and Novelty show the same information.
Extracting types of user actions from the logfile resulted in three key behavioural classes
that relate to users’ search or browse behaviours and their general task-based actions.
The searching behaviours were primarily associated with query actions. The browsing
behaviours included actions related to using the categories as well as those related to
keystroke and mouse activity and what could be construed as navigational activities.
Actions and Pages, the items viewed, did not map well to any factor. While the third,
which we call general, is more associated with actions related to the result and task.
Notably actions associated with items selected as a result of using categories fit into this
factor, whereas, those that resulted from using a query loaded with the other actions
associated with a query.
In addition to examining and grouping the behavioural actions into usable sets, we
found a novel set of user archetypes (explorers, followers, berrypickers) among our
participant group. The explorers submitted sets of highly relevant queries. More specif‐
ically, subsequent queries were aimed at refining former ones. For instance, an explorer
exhibited a closely related pathway: modern sculpture, modern british sculpture,
hepworth, hepworth sculpture, henry moore, henry moore sculpture, family of man,
family of man sculpture. In contrast, the query pathways input by followers and berry‐
pickers are typically short (both pathway and query length), e.g., Scotland, Edinburgh.
The user archetypes and pattern of query might be useful in evaluation simulations and
in advancing log analysis techniques.
The Relationship Between User Perception and User Behaviour 303
6 Conclusion
The key objective of our research was to assess whether a relationship exists between
user behaviour and user perception of information retrieval systems. This was achieved
by using actions from log files to represent behaviour and results from the UES to
represent perception. The data came from a study in which people had no defined task
while interacting with a novel interface to a set of images. In the past, studies have
considered measures of behaviour and perception as two relatively independent aspects
in evaluation. Our results showed that the aesthetics and usability perceptions of those
searching and browsing appear un-influenced by their interactions with the system.
However, general actions were associated with attention, involvement and novelty.
In addition, our research tested the UES scale, and like the wikiSearch results [17],
we found four factors. This may be because both implementations were in information
finding systems, and not the focused task of a shopper [16]. We also produced a novel
set of information-seeking user archetypes (i.e., explorers, followers, and berrypickers),
defined by their behavioural features which may be useful in testing evaluation simu‐
lations and build novel log analysis techniques that simulate user studies. Moreover,
these user archetypes were reflective of search reality as behavioural measures were
direct observables. On the other hand, user perception measures are based on a psycho‐
metric scale or descriptive data and thus are largely affected by context.
Our findings are preliminary and we need to replicate them using additional datasets.
We have isolated selected behavioural variables that are significant to the analysis. The
emerging relationship with the UES demonstrates that we may be able to isolate selected
variables from log files that are indicative of user perception. Being able to do so would
mean that IIR evaluations could be parsimoniously completed using only log file data.
304 M. Zhuang et al.
This means that we need also to refine the UES so that the result consistently outputs
distinctive reliable and valid factors that represent human perception. The additional
part of the analysis lies with the task and with the user’s background and personal expe‐
rience, which may account for the remaining variance in the result.
References
1. Al-Maskari, A., Sanderson, M.: The effect of user characteristics on search effectiveness in
information retrieval. Inf. Process. Manage. 47, 719–729 (2011)
2. Al-Maskari, A., Sanderson, M.: A review of factors influencing user satisfaction in
information retrieval. JASIST 61, 859–868 (2010)
3. Aladwani, A.M., Palvia, P.C.: Developing and validating an instrument for measuring user-
perceived web quality. Inf. Manage. 39, 467–476 (2002)
4. Banhawi, F., Ali, N.M.: Measuring user engagement attributes in social networking
application. In: Semantic Technology and Information Retrieval, pp. 297–301. IEEE (2011)
5. Bates, M.J.: The design of browsing and berrypicking techniques for the online search
interface. Online Inf. Rev. 13, 407–424 (1989)
6. Borgman, C.L.: All users of information retrieval systems are not created equal: an
exploration into individual differences. Inf. Process. Manage. 25, 237–251 (1989)
7. DeVellis, R.: Scale Development. Sage, Newbury Park, California (2003)
8. Dillon, A., Watson, C.: User analysis in HCI: the historical lessons from individual differences
research. Int. J. Hum. Comput. Stud. 45, 619–637 (1996)
9. Fenichel, C.H.: Online searching: Measures that discriminate among users with different types
of experiences. JASIS 32, 23–32 (1981)
10. Freund, L., Toms, E.G.: Revisiting informativeness as a process measure for information
interaction. In: Proceedings of the WISI Workshop of SIGIR 2007, pp. 33–36 (2007)
11. Hall, M.M., Toms, E.: Building a common framework for iir evaluation. In: Forner, P., Müller,
H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 17–28. Springer,
Heidelberg (2013)
12. Hall, M.M., Villa, R., Rutter, S.A., Bell, D., Clough, P., Toms, E.G.: Sheffield submission to
the chic interactive task: Exploring digital cultural heritage. In: Proceedings of the CLEF
2013 (2013)
13. Hyder, J.: Proposal of a Website Engagement Scale and Research Model: Analysis of the
Influence of Intra-Website Comparative Behaviour. Ph.D. Thesis, University of Valencia
(2010)
14. Kelly, D.: Methods for evaluating interactive information retrieval systems with users. Found.
Trends Inf. Retrieval 3, 1–224 (2009)
15. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. Cambridge
University Press, Cambridge (2008)
16. O’Brien, H.L., Toms, E.G.: The development and evaluation of a survey to measure user
engagement. JASIST 61, 50–69 (2010)
17. O’Brien, H.L., Toms, E.G.: Examining the generalizability of the User Engagement Scale
(UES) in exploratory search. Info. Proc. Mgmt. 49, 1092–1107 (2013)
18. Reise, S.P., Waller, N.G., Comrey, A.L.: Factor analysis and scale revision. Psychol. Assess.
12, 287 (2000)
19. Su, L.T.: Special Issue: Evaluation Issues in Information Retrieval Evaluation measures for
interactive information retrieval. Info. Proc. Mgmt. 28, 503–516 (1992)
The Relationship Between User Perception and User Behaviour 305
1 Introduction
speech, internally they construct a ranked n-best list of the most confident
hypotheses. In our n-best list re-ranking task, the input to the system is a spo-
ken query’s n-best list, and the goal of the system is to predict the candidate
transcription from the n-best list that maximizes retrieval performance, which
may not necessarily be the top candidate.
In certain situations, a spoken query may perform poorly due to ASR error
or the user’s failure to formulate an effective query. In our second predictive task,
the input to the system is a spoken query (specifically, the top candidate from
the ASR system’s n-best list), and the goal of the system is to predict whether
to run the input query or to ask for a reformulation.
For both tasks, we use machine learning to combine a wide range of per-
formance predictors as features. We trained and tested models using a set of
5,000 spoken queries that were collected in a crowdsourced study. Our spoken
queries were based on 250 TREC topics and were automatically transcribed
using freely available APIs from AT&T and WIT.AI. We evaluate our models
based on retrieval performance using the TREC 2004 Robust Track collection.
2 Related Work
3 Data Collection
In the next sections, we describe the user study that we ran to collect spoken
queries, our search tasks, the ASR systems used, and our spoken queries.
User Study. Spoken queries were collected using Amazon’s Mechanical Turk
(MTurk). Each MTurk Human Intelligence Task (HIT) asked the participant
to read a search task description and produce a recording of how they would
request the information from a speech-enabled search engine.1
The study protocol proceeded as follows. Participants were first given a set
of instructions and a link to a video explaining the steps required to complete
the HIT. Participants were then asked to click a “start” button to open the
main voice recording page in a new browser tab. While loading, the main page
asked participants to grant access to their computer’s microphone. Participants
were required to grant access in order to continue. The main page provided
participants with: (1) a button to display the search task description in a pop-
up window, (2) Javascript components to record the spoken query and save the
recording as a WAV file on the participant’s computer, and (3) an HTML form
to upload the WAV file to our server.
Within the main voice recording page, participants were first asked to click
a “view task” button to display the search task description in a pop-up window.
The task was displayed in a pop-up window to prevent participants from reading
the task while recording their spoken query.2 Participants were instructed to
read the task carefully and to “imagine that you are looking for information on
this specific topic and that you are going to ask a speech-enabled search engine
for help in finding this information”. Participants were asked to “not try to
memorize the task description word-by-word”. The instructions explained that
our goal was to “learn how someone might formulate the information request as
naturally as possible”.
After reading the task, participants were asked to click a “record” button
to record their spoken query and then a “save” button to save the recording
1
Our source code and search task descriptions are available at: http://ils.unc.edu/
∼jarguell/ecir2016/.
2
Participants had to close the pop-up window to continue interacting with the page.
312 J. Arguello et al.
ASR Systems. In this work, we treat the ASR system as a “black box” and used
two freely available APIs provided by AT&T and WIT.AI.3 Both APIs accept
a WAV file as input and return one or more candidate transcriptions in JSON
format. The AT&T API was configured to return an n-best list in cases where
the API was less confident about the input speech. The AT&T API returned
an n-best list with at most 10 candidates along with their ranks and confidence
values. The WIT.AI API could not be configured to return an n-best list and
simply returned the single most confident transcription without a confidence
value.
Spoken Queries and ASR Output. In this section, we describe our spoken
queries and ASR output. To conserve space, we focus on the ASR output from
the AT&T API. The AT&T API was able to transcribe 4,905 of our 5,000 spoken
queries due to the quality of the recording. Spoken queries had an average length
of 5.86 ± 2.50 s and 10.04 ± 2.18 recognized tokens. The AT&T API returned
an n-best list with more than one candidate for 70 % of the 4,905 transcribed
spoken queries.
We were interested in measuring the variability between candidates from the
same n-best list. To this end, we measured the similarity between candidate-
pairs from the same n-best list in terms of their recognized tokens, top-10 docu-
ments retrieved, and retrieval performance. In terms of their recognized tokens,
3
http://developer.att.com/apis/speech and https://wit.ai/.
Using Query Performance Predictors to Improve Spoken Queries 313
spoken query. Otherwise, if the system does decide to ask for a reformulation,
then the user experiences a gain equal to the retrieval performance of the new
query discounted by a factor denoted by α (in the range [0,1]). The system
must decide whether to ask for a new spoken query without knowing the true
performance of the original (e.g., using only pre- and post-retrieval performance
predictors as evidence).
To illustrate, suppose that given an input spoken query, the system decides
to ask for a reformulation. Furthermore, suppose that the original query achieves
an average precision (AP) value of 0.15 and that the reformulated query achieves
a AP value of 0.20. In this case, the user experiences a discounted gain of AP =
α × 0.20. If we set α = 0.50, then the discounted gain of the new query (0.50 ×
0.20 = 0.10) is less than the original (0.15), and so the system made the incorrect
choice. Parameter α can be varied to simulate different costs of asking a user for
a spoken query reformulation. The higher the α, the lower the cost. The goal of
the system is to maximize the gain over a set of input spoken queries for a given
value of α.
5 Features
For both tasks, we used machine learning to combine different types of evi-
dence as features. We grouped our features into three categories. The numbers
in parentheses indicate the number of features in each category.
N-best List Features (2). These features were generated from the ASR sys-
tem’s n-best list. We included two n-best list features: the rank of the transcrip-
tion in the n-best list and its confidence value. These features were only available
for the AT&T API and only used in the n-best list re-ranking task.
Pre-retrieval Features (27). Prior work shows that a query is more likely to
perform well if it contains discriminative terms that appear in only a few docu-
ments. We included five types of features aimed to capture this type of evidence.
Our inverse document frequency (IDF) and inverse collection term frequency
(ICTF) features measure the IDF and ICTF values across query terms [2,6,18].
We included the min, max, sum, average, and standard deviation of IDF and
ICTF values across query terms. The query-collection similarity (QCS) score
measures the extent to which the query terms appear many times in only a few
documents [18]. We included the min, max, sum, average, and standard deviation
of QCS values across query terms. The query scope score is inversely proportional
to the number of documents with at least one query term [6]. Finally, the simpli-
fied clarity score measures the KL-divergence between the query and collection
language models [6].
Prior work also shows that a query is more likely to perform well if the query
terms describe a coherent topic. We included one type of feature to capture this
type of evidence. Our point-wise mutual information (PMI) features measure
the degree of co-occurrence between query terms [5]. We included the min, max,
sum, average, and standard deviation of PMI values across query-term pairs.
Using Query Performance Predictors to Improve Spoken Queries 315
6 Evaluation Methodology
Retrieval performance was measured by issuing spoken query transcriptions
against the TREC 2004 Robust Track collection. In all experiments, we used
Lucene’s implementation of the query-likelihood model with Dirichlet smoothing.
Queries and documents were stemmed using the Krovetz stemmer and stopped
using the SMART stopword list. We evaluated in terms of average precision
(AP), NDCG@30, and P@10.
Re-ranking the N-Best List. We cast this as a learning-to-rank (LTR) task,
and trained models to re-rank an n-best list in descending order of retrieval
performance. At test time, we re-rank the input n-best list and select the top
query transcription as the one to run against the collection. We used the linear
RankSVM implementation in the Sophia-ML toolkit and trained separate models
for each retrieval performance metric.
Models were evaluated using 20-fold cross-validation. Recall that each TREC
topic had 20 spoken queries from different study participants. To avoid train-
ing and testing on n-best lists for the same TREC topic (potentially inflating
performance), we assigned all n-best lists for the same topic to the same fold.
We report average performance across held-out folds and measure statistical sig-
nificance using the approximation of Fisher’s randomization test described in
Smucker et al. [15]. We used the same cross-validation folds in all our experi-
ments. Thus, when testing significance, the randomization was applied to the 20
pairs of performance values for the two models being compared. We normalized
feature values to zero-min and unit-max for each spoken query (i.e., using the
min/max values from the same n-best list).
316 J. Arguello et al.
7 Results
Results for the n-best list re-ranking task are presented in Table 1. For this task,
we used the n-best lists produced by the AT&T API. Furthermore, we focus
on the subset of 3,414 (out of 5,000) spoken queries for which the AT&T API
returned an n-best list with more than one transcription. The first and last rows
in Table 1 correspond to our two baseline approaches: selecting the top-ranked
candidate from the n-best list (top) and selecting the best-performing candidate
for the corresponding metric (oracle). The middle rows correspond to the LTR
model using all features (all), all features except for those in group x (no.x), and
only those features in group x (only.x).
Using Query Performance Predictors to Improve Spoken Queries 317
Table 1. Results for the n-best list re-ranking task. The percentages indicate percent
improvement over top. A denotes a significant improvement compared to top, and for
no.x and only.x, a denotes a significant performance drop compared to all. We used
Bonferroni correction for multiple comparisons (p < .05).
AP NDCG@30 P@10
top 0.081 0.148 0.159
all 0.091 (13.75 %) 0.162 (12.50 %) 0.174 (12.26 %)
no.nbest 0.090 (12.50 %) 0.162 (12.50 %) 0.173 (11.61 %)
no.pre 0.089 (11.25 %) 0.155 (7.64 %) 0.166 (7.10 %)
no.post 0.085 (6.25 %) 0.154 (6.94 %) 0.168 (8.39 %)
only.nbest 0.080 (0.00 %) 0.144 (0.00 %) 0.155 (0.00 %)
only.pre 0.084 (5.00 %) 0.154 (6.94 %) 0.167 (7.74 %)
only.post 0.089 (11.25 %) 0.155 (7.64 %) 0.166 (7.10 %)
oracle 0.102 (27.50 %) 0.186 (29.17 %) 0.205 (32.26 %)
The results from Table 1 suggest five important trends. First, the LTR model
using all features (all) significantly outperformed the baseline approach of always
selecting the top-ranked transcription from the n-best list (top). The LTR model
using all features had a greater than 10 % improvement across all metrics.
Second, our results suggest that both pre- and post-retrieval query perfor-
mance predictors contribute useful evidence for this task. The LTR model using
only pre-retrieval features (only.pre) and only post-retrieval features (only.post)
significantly outperformed the top baseline across all metrics. Furthermore, in
all cases, individually ignoring pre-retrieval features (no.pre) and post-retrieval
features (no.post) resulted in a significant drop in performance compared to the
LTR model using all features (all).
Third, there is some evidence that post-retrieval features were more predic-
tive than pre-retrieval features. In terms of AP, ignoring post-retrieval features
(no.post) and using only pre-retrieval features (only.pre) had the greatest perfor-
mance drop compared to the model using all features (all). In terms of AP, post-
retrieval features were more predictive in spite of having only 5 post-retrieval
features versus 27 pre-retrieval features.
The fourth trend worth noting is that n-best list features contributed little
useful evidence. In most cases, ignoring n-best list features (no.nbest) resulted in
only a small drop in performance compared to the LTR model using all features
(all). Furthermore, the LTR model using only n-best list features (only.nbest)
was the worst-performing LTR model across all metrics and performed at the
same level as the top baseline.
The final important trend is that there is still room for improvement. Across
all metrics, the oracle performance was at least 25 % greater than the top baseline.
While not shown in Table 1, the oracle outperformed all the LTR models and
the top baseline across all metrics by a statistically significant margin (p < .05).
318 J. Arguello et al.
Results for the task of predicting when to ask for a spoken query reformu-
lation are shown in Tables 2 and 3. To conserve space, we only show results in
terms of AP. However, the results in terms of NDCG@30 and P@10 had the same
trends. We show results using the most confident transcriptions from the AT&T
API (Table 2) and the WIT.AI API (Table 3). Because the WIT.AI API only
returned the most confident transcription without a confidence value, we ignore
n-best lists features in this analysis. Results are presented for different values of
α, with higher values indicating a higher cost of asking for a reformulation and
therefore fewer cases where it was the correct choice. We show results for our
four baselines (oracle, always, never, and random), as well as the regression model
using all features (all), ignoring pre-retrieval features (no.pre), and ignoring post-
retrieval features (no.post). The performance of never asking for a reformulation
(never) is constant because it is independent of α. The performance of always
asking for a reformulation (always) increases with α (lower cost).
The results in Tables 2 and 3 suggest three important trends. First, the model
using all features (all) performed equal to or better than always, never, and
random for both APIs and all values of α. The model performed at the same
level as never for low values of α (high cost) and at the same level as always
for high values of α (low cost). The model outpormed these three baselines for
values of α in the mid-range (0.4 ≤ α ≤ 0.6). For these values of α, the system
had to be more selective about when to ask for a reformulation. These results
show that pre- and post-retrieval performance predictors provide useful evidence
for predicting when the input spoken query is relatively poor.
Second, post-retrieval features were more predictive than pre-retrieval fea-
tures. This is consistent with the AP results from Table 1. For values of α in the
mid-range, ignoring post-retrieval (no.post) features resulted in a greater drop
in performance than ignoring pre-retrieval features (no.pre). The drop in per-
formance was statistically significant for two values of α for the AT&T results
Table 2. Results for predicting when to ask for a spoken query reformulation: AT&T
API, Average Precision. A denotes a significant improvement compared to always,
never, and random. A denotes a significant performance drop in for no.pre and no.post
compared to all. We report significance for p < .05 using Bonferroni correction.
Table 3. Results for predicting when to ask for a spoken query reformulation: WIT.AI
API, Average Precision. Symbols and denote statistically significant differences as
described in Table 2.
and one value of α for the WIT.AI results. Again, we observed this trend in spite
of having fewer post-retrieval than pre-retrieval features.
Finally, we note that there is room for improvement. For both APIs, the
oracle baseline (oracle) outperformed the model using all features (all) across all
values of α. While not shown in Tables 2 and 3, all differences between oracle
and all were statistically significant (p < .05).
8 Discussion
Our results from Sect. 7 show that the top candidate from an ASR system’s n-
best list is not necessarily the best-performing query and that we can use query
performance predictors to find a lower-ranked candidate that performs better.
A reasonable question is: Why is the most confident candidate not always the
best query? We examined n-best lists where a lower-ranked candidate outper-
formed the most confident, and encountered cases belonging to three categories.
In the first category, the lower-ranked candidate was a more accurate tran-
scription of the input speech. For example, the lower-ranked candidate ‘protect
children poison paint’ (AP = 0.467) outperformed the top candidate ‘protect chil-
dren poison pain’ (AP = 0.055). Similarly, the lower-ranked candidate ‘prostate
cancer detect treat’ (AP = 0.301) outperformed the top candidate ‘press can-
cer detect treat’ (AP = 0.014). Finally, the lower-ranked candidate ‘drug treat
alzheimer successful’ (AP = 0.379) outperformed the top candidate ‘drug treat
timer successful’ (AP = 0.001). We do not know why the ASR system assigned the
correct transcription a lower probability. It may be that the correct query terms
(‘paint’, ‘prostate’, ‘alzheimer’) had a lower probability in the ASR system’s lan-
guage model than those in the top candidates (‘pain’, ‘press’, ‘timer’). Such errors
might be reduced by using a language model from the target collection. However,
this may not be possible with an off-the-shelf ASR system.
320 J. Arguello et al.
9 Conclusion
We developed and evaluated models for two tasks associated with speech-enabled
search: (1) re-ranking the ASR system’s n-best hypotheses and (2) predicting
when to ask for a spoken query reformulation. Our results show that pre- and
post-retrieval performance predictors contribute useful evidence for both tasks.
With respect to the first task, our analysis shows that lower-ranked candidates
in the n-best list may perform better due to mispronunciation errors in the
input speech or because the ASR system may not explicitly favor candidates
that describe a coherent topic with respect to the target collection.
There are several directions for future work. In this work, we improved the
input query by exploring candidates in the same n-best list. Future work might
consider exploring a larger space, including reformations of the top candidate
that are specifically designed for the speech domain (e.g., term substitutions
with similar Soundex codes). Additionally, in this work, we predicted when to
ask for a new spoken query. Future work might consider learning to ask more
targeted clarification or disambiguation questions about the input spoken query.
References
1. Aslam, J.A., Pavlu, V.: Query Hardness Estimation Using Jensen-Shannon Diver-
gence Among Multiple Scoring Functions. In: Amati, G., Carpineto, C., Romano,
G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)
2. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In:
SIGIR (2002)
3. Dang, V., Bendersky, M., Croft, W.B.: Learning to rank query reformulations. In:
SIGIR (2010)
4. Diaz, F.: Performance prediction using spatial autocorrelation. In: SIGIR (2007)
5. Hauff, C.: Predicting the effectiveness of queries and retrieval systems. dissertation,
Univeristy of Twente (2010)
6. He, B., Ounis, I.: Inferring Query Performance Using Pre-retrieval Predictors.
In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54.
Springer, Heidelberg (2004)
7. Jiang, J., Jeng, W., He, D.: How do users respond to voice input errors?. SIGIR,
Lexical and phonetic query reformulation in voice search. In (2013)
8. Kumaran, G., Carvalho, V.R.: Reducing long queries using query quality predic-
tors. In: SIGIR (2009)
9. Li, X., Nguyen, P., Zweig, G., Bohus, D.: Leveraging multiple query logs to improve
language models for spoken query recognition. In: ICASSP (2009)
10. Mamou, J., Sethy, A., Ramabhadran, B., Hoory, R., Vozila, P.: Improved spoken
query transcription using co-occurrence information. In: INTERSPEECH (2011)
11. Peng, F., Roy, S., Shahshahani, B., Beaufays, F.: Search results based n-best
hypothesis rescoring with maximum entropy classification. In: IEEE Workshop
on Automatic Speech Recognition and Understanding (2013)
12. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M.,
Kamvar, M., Strope, B.: Your word is my command: Google search by voice: A case
study. In: Neustein, A. (ed.) Advances in Speech Recognition. Springer, Heidelberg
(2010)
13. Sheldon, D., Shokouhi, M., Szummer, M., Craswell, N.: Lambdamerge: Merging
the results of query reformulations. In: WSDM (2011)
14. Shtok, A., Kurland, O., Carmel, D., Raiber, F., Markovits, G.: Predicting query
performance by query-drift estimation. TOIS, 30(2) (2012)
15. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance
tests for information retrieval evaluation. In: CIKM (2007)
16. Xue, X., Huston, S., Croft, W.B.: Improving verbose queries using subset distrib-
ution. In: CIKM (2010)
17. Yom-Tov, E., Fine, S., Carmel, D., Darlow, A.: Learning to estimate query diffi-
culty: Including applications to missing content detection and distributed informa-
tion retrieval. In: SIGIR (2005)
18. Zhao, Y., Scholer, F., Tsegay, Y.: Effective Pre-retrieval Query Performance Pre-
diction Using Similarity and Variability Evidence. In: Macdonald, C., Ounis, I.,
Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp.
52–64. Springer, Heidelberg (2008)
19. Zhou, Y., Croft, W.B.: A novel framework to predict query performance. In: CIKM
(2006)
20. Zhou, Y., Croft, W.B.: Query performance prediction in web search environments.
In: SIGIR (2007)
Fusing Web and Audio Predictors
to Localize the Origin of Music Pieces
for Geospatial Retrieval
1 Introduction
Predicting the location of a person or item is an appealing task given today’s
omnipresence and abundance of information about any topic on the web and
in social media, which are easy to access through corresponding APIs. While
a majority of research focuses on automatically placing images [5], videos [22],
or social media users [1,7], we investigate the problem of placing music at its
location of origin, focusing on the country of origin, which we define as the
main country or area of residence of the artist(s). We approach the task by
audio content-based and web-based strategies and eventually propose a hybrid
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 322–334, 2016.
DOI: 10.1007/978-3-319-30671-1 24
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces 323
method that fuses these two sources. We show that the fused method is capable
of outperforming stand-alone approaches.
The availability of information about a music piece’s or artist’s origin opens
interesting opportunities, not only for computational ethnomusicology [3], but
also for location-aware music retrieval and recommendation systems. Examples
include browsing and exploration of music from different regions in the world.
This task seems particularly important as the strong focus on Western music in
music information retrieval (MIR) research has frequently been criticized [13,17].
Other tasks that benefit from information about the origin of music are trend
analysis and prediction. If we understand better where a particular music trend
emerges – which is strongly related to the music’s origin – and how it spreads
(e.g., locally, regionally, or globally), we could use this information for person-
alized and location-aware music recommendation or for predicting the future
popularity of a song, album, artist, or music video [10,25]. Another use case
is automatically selecting music suited for a given place of interest, a topic
e-tourism is interested in [6].
The remainder of this paper is structured as follows. Section 2 presents related
work and highlights the main contributions of the paper at hand. Section 3
presents the proposed audio- and web-based methods as well the hybrid strat-
egy. Section 4 outlines the evaluation experiments we conducted, presents and
discusses their results. Eventually, Sect. 5 rounds off the paper with a summary
and pointers to future research directions.
2 Related Work
by extracting and analyzing audio descriptors through the MARSYAS [23] soft-
ware. They use spectral, timbral and chroma features. The authors then apply
K-nearest neighbor (KNN) and random forest regression methods for prediction.
1
https://en.wikipedia.org.
2
http://www.last.fm.
3
http://www.freebase.com.
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces 325
Fig. 1. Overview of the feature extraction (top) and summarization process (bottom)
in the block-level framework.
326 M. Schedl and F. Zhou
The average coordinates (x̄, ȳ, z̄) are converted into the latitude and longitude
(φp , λp ) for the midpoint.
φp = arctan 2(z̄, x̄2 + ȳ 2 ) (4)
To make predictions for a given music piece p, we first fetch the top-ranked
web pages returned by the Bing Search API4 for several queries: ‘‘piece’’
music, ‘‘piece’’ music biography, and ‘‘piece’’ music origin, in which
‘‘piece’’ refers to the exact search for the music piece’s name.5,6 In the fol-
lowing, we abbreviate these query settings by M, MB, and MO, respectively. We
subsequently concatenate the content of the retrieved web pages for each p to
yield a single document for p. Previous web-based approaches for the task at
hand [14,16] only considered the problem at the artist level and only employed
the query scheme ‘‘artist’’ music.
Given a list of country names, we compute the term frequency (TF) of all
countries in the document of p, and we predict the K countries with highest
scores. We do not perform any kind of normalization, nor account for different
overall frequencies of country names. This choice was made in accordance with
previous research on the topic of country of origin detection, as [16] shows that
TF outperforms TF·IDF weighting, and also outperforms more complex rule-
based approaches.
In addition to different query settings (M, MB, and MO), we also consider
fetching either 20 or 50 web pages per music piece. Knees et al. investigate the
influence of different numbers of fetched web pages for the task of music similarity
and genre classification [8]. According to the authors, considering more than 50
web pages per music item does not significantly improve results, in some cases
4
https://datamarket.azure.com/dataset/bing/search.
5
Please note that the obvious query scheme ‘‘piece’’ (music) country does not
perform well as it results in too many irrelevant pages about country music.
6
Please further note that investigating queries in languages other than English is out
of the scope of the work at hand, but will be addressed as part of future work.
328 M. Schedl and F. Zhou
even worsens them. Since overall best results were achieved when considering
between 20 and 50 pages, we investigate these two numbers.
In order to control for uncertainly in made predictions, we further introduce
a confidence parameter α. For each of the top K countries predicted for p, we
relate its TF value to the sum of TF values of all predicted countries. We only
keep a country c predicted for p if its resulting relative TF value is at least α,
Cp being the set of top K countries predicted for p:
T F (p, c)
≥α (7)
c∈Cp T F (p, c)
4 Evaluation
We used the dataset presented by Zhou et al. [26], containing 1,059 pieces of
music originating from 33 countries. Music was selected based on the following
two criteria: First, no “Western” music is included, as its influence is global.
We only consider the music that is strongly influenced by a specific location,
namely only traditional, ethnic, or “World” music was included in this study,
Second, any music that has ambiguous origin was removed from the dataset. The
geographical origin was collected from compact disc covers. Since most location
information is country names, we used the country’s capital city (or the province
of the area) to represent the absolute point of origin (represented as latitude and
longitude), assuming that the political capital is also the cultural capital of the
country or area. The country of origin is determined by the artist’s or artists’
main country or area of residence. If an artist has lived in several different places,
we only consider the place that presumably had major influence. For example,
if a Chinese artist is living in New York but composes a piece of traditional
Chinese music, we take it as Chinese music.
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces 329
For evaluation, all music from the same country is equally distributed among
10 groups. We then apply 10-fold cross-validation and report the mean and
median error distance (in kilometers) from the true positions to their correspond-
ing predicted positions. We also measure prediction accuracy, i.e. the percentage
of music pieces assigned to the correct class, treating countries as classes.
Table 1 shows the predictive accuracies for the different parameter settings (α,
query scheme, and number of retrieved web pages). Obviously, the ‘‘piece’’
music (M) query setting yields the best results, compared to the other two
query settings. This is in line with similar findings that strongly restricting the
330 M. Schedl and F. Zhou
Table 1. Accuracies for different variants of the web-based approach (query settings
and number of web pages) and various confidence thresholds α. Settings yielding the
highest performance are printed in boldface.
search may yield too specific results and in turn deteriorate performance [8]. As
expected, for the same query setting, a higher confidence threshold improves the
predictive accuracy.
Regarding the number of web pages, in general, using more web pages
increases the amount of information considered. However, only the general query
setting M seems to benefit from this. For a threshold of α = 0.5, accuracy
increases by 4.3 percentage points when comparing the M20 to the M50 setting.
In contrast, for the other query settings MB and MO, no substantial increase
(MB) or even a decrease (MO) can be observed, when using a large number of
pages (and a high α). Taking a closer look at the fetched pages, we identified as
a reason for this an increase of irrelevant or noisy pages using the more specific
query settings. This might, however, also be influenced by the fact that we are
dealing with “World” music. Therefore, many pieces in the collection are not
very prominently represented on the web, meaning a rather small number of
relevant pages is available.
Figure 3 shows the predictive accuracy for 1-NN for the different parameters
(confidence threshold α and mixture coefficient ξ) of the hybrid approach. When
ξ equals 0, it means there is no audio-based prediction input. With increasing ξ
values, the weight of the audio-based predictions is increasing; when ξ reaches 1,
solely audio-based predictions are made. We can clearly observe from Fig. 3 a
strong improvement of the web-based results when adding audio-based predic-
tions, irrespective of the confidence threshold α. For large confidence thresholds,
including only a small fraction of audio-based predictions actually increases
accuracy the most. This means the more confident we are in the web-based
predictions, the less audio-based predictions we need to include. Nevertheless,
even when α = 0, i.e., we consider all web-based predictions, including audio
improves performance. We can also observe that different mixture coefficients ξ
are required for different levels of α in order to reach peak performance.
Fusing Web and Audio Predictors to Localize the Origin of Music Pieces 331
Fig. 3. Accuracies achieved by the proposed hybrid approach, for different values of
parameters α (confidence threshold for web-based predictions) and ξ (mixture coeffi-
cient for Borda rank aggregation).
Table 2. Mean distance error (kilometers) for 1-NN predictions for the audio-based
prediction and the hybrid approach with different confidence thresholds α and ξ.
α ξ BLF+M50 BLF
0.0 0.35 2,656 2,621
0.1 0.16 2,791 2,616
0.2 0.06 2,540 2,576
0.3 0.06 2,221 2,769
0.4 0.06 2,077 2,833
0.5 0.06 1,825 2,749
correct country
JM JM
JP JP
KE KE
KG KG
KH KH
LT LT
MA MA
ML ML
MM MM
PK PK
RO RO
SN SN
TH TH
TR TR
TW TW
TZ TZ
UK UK
UZ UZ
AL AU BR BZ CN CV DZ EG ET GEGR ID IN IR IT JM JP KE KG KH LT MA MLMMPK RO SN TH TRTW TZ UK UZ AL AU BR BZ CN CV DZ EG ET GEGR ID IN IR IT JM JP KE KG KH LT MA MLMMPK RO SN TH TRTW TZ UK UZ
predicted country predicted country
Fig. 4. Confusion matrix for the audio-based BLF approach and the web-based app-
roach (M, p = 50, α = 0.5). Country names are encoded according to ISO 3166-1
alpha-2 codes.
Given the large amount of non-Western music in the collection, we will look
into multilingual extensions to our web-based approach. Furthermore, based on
the finding that, for the used dataset, more specific query schemes deteriorate
performance, rather then boost it, which is because of the small amount of rele-
vant web pages, we will investigate whether this finding also holds for mainstream
Western music. To this end, we will additionally investigate larger datasets, as
ours is relatively small in comparison to the ones used in geolocalizing other
kinds of multimedia material. We also plan to create more precise annotations
for the origin of the pieces since the current granularity, i.e. the capital of the
country or area, may introduce a distortion of results. Finally, we plan to look
into data sources other than web pages, for instance social media, and to inves-
tigate aggregation techniques other than Borda rank aggregation.
References
1. Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach
to geo-locating twitter users. In: Proceedings of CIKM, October 2010
2. de Borda, J.-C.: Mémoire sur les élections au scrutin. Histoire de l’Académie Royale
des Sciences (1781)
3. Gómez, E., Herrera, P., Gómez-Martin, F.: Computational ethnomusicology: per-
spectives and challenges. J. New Music Res. 42(2), 111–112 (2013)
4. Govaerts, S., Duval, E.: A web-based approach to determine the origin of an artist.
In: Proceedings of ISMIR, October 2009
5. Hauff, C., Houben, G.-J.: Placing Images on the world map: a microblog-based
enrichment approach. In: Proceedings of SIGIR, August 2012
6. Kaminskas, M., Ricci, F., Schedl, M.: Location-aware music recommendation using
auto-tagging and hybrid matching. In: Proceedings of RecSys, October 2013
7. Kinsella, S., Murdock, V., O’Hare, N.: “I’m eating a sandwich in Glasgow”: mod-
eling locations with tweets. In: Proceedings of SMUC, October 2011
8. Knees, P., Schedl, M., Pohle, T.: A deeper look into web-based classification of
music artists. In: Proceedings of LSAS, June 2008
9. Koenigstein, N., Shavitt, Y.: Song ranking based on piracy in peer-to-peer net-
works. In: Proceedings of ISMIR, October 2009
10. Koenigstein, N., Shavitt, Y., Tankel, T.: Spotting out emerging artists using geo-
aware analysis of P2P query strings. In: Proceedings of KDD, August 2008
11. Liu, J., Inkpen, D.: Estimating user location in social media with stacked denoising
auto-encoders. In: Proceedings of Vector Space Modeling for NLP, June 2015
12. Ripley, B.D.: Spatial Statistics. Wiley, New York (2004)
13. Schedl, M., Flexer, A., Urbano, J.: The neglected user in music information retrieval
research. J. Intell. Inf. Syst. 41, 523–539 (2013)
14. Schedl, M., Schiketanz, C., Seyerlehner, K.: Country of origin determination via
web mining techniques. In: Proceedings of AdMIRe, July 2010
334 M. Schedl and F. Zhou
15. Schedl, M., Schnitzer, D.: Hybrid retrieval approaches to geospatial music recom-
mendation. In: Proceedings of SIGIR, July–August 2013
16. Schedl, M., Seyerlehner, K., Schnitzer, D., Widmer, G., Schiketanz, C.: Three web-
based heuristics to determine a person’s or institution’s country of origin. In: Pro-
ceedings of SIGIR, July 2010
17. Serra, X.: Data gathering for a culture specific approach in MIR. In: Proceedings
of AdMIRe, April 2012
18. Seyerlehner, K., Schedl, M., Knees, P., Sonnleitner, R.: A refined block-level feature
set for classification, similarity and tag prediction. In: Extended Abstract MIREX,
October 2009
19. Seyerlehner, K., Schedl, M., Sonnleitner, R., Hauger, D., Ionescu, B.: From
improved auto-taggers to improved music similarity measures. In: Nürnberger, A.,
Stober, S., Larsen, B., Detyniecki, M. (eds.) AMR 2012. LNCS, vol. 8382, pp.
193–202. Springer, Heidelberg (2014)
20. Seyerlehner, K., Widmer, G., Pohle, T.: Fusing block-level features for music sim-
ilarity estimation. In: Proceedings of DAFx, September 2010
21. Seyerlehner, K., Widmer, G., Schedl, M., Knees, P.: Automatic music tag classifi-
cation based on block-level features. In: Proceedings of SMC, July 2010
22. Trevisiol, M., Jégou, H., Delhumeau, J., Gravier, G.: Retrieving geo-location of
videos with a divide & conquer hierarchical multimodal approach. In: Proceedings
of ICMR, April 2013
23. Tzanetakis, G., Cook, P.: MARSYAS: a framework for audio analysis. Organ.
Sound 4, 169–175 (2000)
24. Workman, S., Souvenir, R., Jacobs, N.: Wide-area image geolocalization with aerial
reference imagery. In: Proceedings of ICCV, December 2015
25. Yu, H., Xie, L., Sanner, S.: Views, Twitter-driven YouTube : beyond individual
influencers. In: Proceedings of ACM Multimedia, November 2014
26. Zhou, F., Claire, Q., King, R.D.: Predicting the geographical origin of music. In:
Proceedings of ICDM, December 2014
Key Estimation in Electronic Dance Music
1 Introduction
The notion of tonality is one of the most prominent concepts in Western music.
In its broadest sense, it defines the systematic arrangements of pitch phenomena
and the relations between them, specially in reference to a main pitch class [9].
The idea of key conveys a similar meaning, but normally applied to a smaller
temporal scope, being common to have several key changes along the same musi-
cal piece. Different periods and musical styles have developed different practices
of tonality. For example, modulation (i.e. the process of digression from one local
key to another according to tonality dynamics [21]) seems to be one of the main
ingredients of musical language in euroclassical1 music [26], whereas pop music
tends to remain in a single key for a whole song or perform key changes by
different means [3,16].
Throughout this paper, we use the term electronic dance music (EDM) to
refer to a number of subgenres originating in the 1980’s and extending into the
present, intended for dancing at nightclubs and raves, with a strong presence
1
We take this term from Tagg [26] to refer to European Classical Music of the so-called
common practice repertoire, on which most treatises on harmony are based.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 335–347, 2016.
DOI: 10.1007/978-3-319-30671-1 25
336 Á. Faraldo et al.
of percussion and a steady beat [4]. Some of such styles even seem to break up
with notions such as chord and harmonic progression (two basic building blocks
of tonality in the previously mentioned repertoires) and result in an interplay
between pitch classes of a given key, but without a sense of tonal direction.
These differences in the musical function of pitch and harmony suggest that
computational key estimation, a popular area in the Music Information Retrieval
(MIR) community, should take into account style-specific particularities and be
tailored to specific genres rather than aiming at all-purpose solutions.
In the particular context of EDM, automatic key detection could be useful
for a number of reasons, such as organising large music collections or facilitating
harmonic mixing, a technique used by DJ’s and music producers to mix and
layer sound files according to their tonal content.
One of the most important aspects of such an approach is the model used
in the similarity measure. Different key profiles have been proposed since the
Key Estimation in Electronic Dance Music 337
3 Method
For this study, we gathered a collection of complete EDM tracks with a single
key estimation per item. The main source was Sha’ath’s list of 1,000 annota-
tions, determined by three human experts3 . However, we filtered out some non-
EDM songs and completed the training dataset with other manually annotated
resources from the internet4 , leaving us with a total of 925 tracks to extract new
tonal profiles.
To avoid overfitting, evaluations were carried on an independent dataset of
EDM, the so-called GiantSteps key dataset [12], consisting of 604 two-minute
long excerpts from Beatport 5 , a well-known internet music store for DJs and
other EDM consumers. Additionally, we used Harte’s dataset [15] of 179 songs
by The Beatles reduced to a single estimation per song [20], to compare and
test our method on other popular styles that do not follow the typical EDM
conventions.
Despite the arguments presented in Sect. 2 about multi-modality in EDM,
we decided to shape our system according to a major/minor binary model. In
academic research, there has been little or no concern about tonality in electronic
popular music, normally considered as a secondary domain compared to rhythm
and timbre. In a way, the current paper stands as a first attempt at compensating
this void. Therefore, we decided to use available methodologies and datasets
(and all of these only deal with binary modality), to be able to compare our
work with existing research, showing that current algorithms perform poorly on
this repertoire. Furthermore, even in the field of EDM, commercial applications
and specialised websites seem to ignore the modal characteristics referred and
label their music within the classical paradigm.
drum’n’bass, electro, hip-hop, house, techno and trance. The most prominent
aspect is its bias toward the minor mode, which as stated above, seems repre-
sentative of this kind of music. Compared to beatles dataset, of which only a
10.6 % is annotated in minor (considering one single key estimation per song),
the training dataset shows exactly the inverse proportion, with 90.6 % of it in
minor. The GiantSteps dataset shows similar statistics (84.8 % minor), confirm-
ing theoretical observation [22].
Fig. 2. Distribution of major (bottom) and minor (top) keys by tonal center in beatles
(left), GiantSteps (center) and training (right) datasets.
3.1 Algorithm
Fig. 3. The four major (above) and minor (below) key profiles. Note that the major
profile of edmm is flat.
The simple method we chose is implemented with Essentia 6,7 , a C++ library
for audio information retrieval [1], and it is based on prior work by Gómez [6,7].
6
http://essentia.upf.edu/.
7
After informal testing, we decided to use the following settings in all the experiments
reported: mix-down to mono; sampling rate: 44,100 Hz.; window size: 4,096 hanning;
hop size: 16,384; frequency range: 25–3,500 Hz.; PCP size: 36 bins; weighting size: 1
semitone; similarity: cosine distance.
Key Estimation in Electronic Dance Music 341
1. The ones proposed by Temperley [27], which are based on corpus analysis of
euroclassical music repertoire.
2. Manual modifications of the original Krumhansl profiles [13] by Sha’ath [23],
specifically oriented to EDM. The main differences in Sha’ath’s profiles are (a)
a slight boost of the weight for the VII degree in major; and (b) a significant
increment of the subtonic (VII) in minor. Other than these, the two profiles
remain essentially identical.
3. Major and minor profiles extracted as the median of the averaged chroma-
grams of the training set. This provided best results compared to other gen-
eralisation methods (such as grand average, max average, etc.). Throughout
this paper we refer to these profiles as edma.
4. Manual adjustments on the extracted profiles (referenced as edmm) account-
ing for some of the tonal characteristics described in Sect. 2, especially the
prominence of the aeolian mode, and the much greater proportion of minor
keys. In that regard, given the extremely low proportion of major tracks in
the corpus, we decided to flatten the profile for major keys.
Figure 3 shows a comparison between these four profiles. They are all nor-
malised so that the sum of each vector equals 1. It is visible how the profiles
by Temperley favour the leading-tone in both modes, according to the euro-
classical tonal tradition, whilst the other three profiles increase the weight for
the subtonic. We can see that automatically generated profiles (edma) give less
prominence to the diatonic third degree in both modes, reflecting the modal
ambiguity present in much EDM. We compensated this manually, raising the
III in the minor profile, together with a decrement of the II (edmm).
Detuning Correction. We noted that some of the estimations with the basic
method produced tritone and semitone errors. Our hypothesis was that these
could be due to possible de-tunings produced by record players with manual
pitch/tempo corrections [23]. In order to tackle this, our algorithm uses a PCP
resolution of 3 bins per semitone, as it is usual in key detection algorithms [8,19].
This allowed us to insert a post-processing stage that shifts the averaged PCP
±33 cents, depending on the position of the maximum peak in the vector.
342 Á. Faraldo et al.
4 Results
Table 1 presents the weighted scores of our basic algorithm with the variations
we have described, now tested with 2 independent set collections, different than
those used for the algorithm development. The top four rows show the effect of
the key profiles discussed without further modifications. As expected, different
profiles provide quite different responses, depending on the repertoire. Temper-
ley’s profiles perform well on the beatles set, whereas they offer poor performance
8
http://www.ibrahimshaath.co.uk/keyfinder/.
9
http://isophonics.net/QMVampPlugins.
Key Estimation in Electronic Dance Music 343
Table 1. MIREX weighted scores for the two datasets with four different profiles and
the proposed modifications: spectral whitening (sw) and detuning correction (dc). We
additionally report the weighted scores on the training set on the third column.
In any case, the combination of both processing stages gives the best results.
It is noteworthy that these modifications address different problems in the key-
estimation process, and consequently, the combined score results in the addition
of the two previous improvements. With these settings, the edma profiles yield
25.9 p.p. over the default settings in beatles, on which all key profiles obtain
significant improvement. On GiantSteps, however, we observe more modest
improvement.
4.1 Evaluation
Table 2 shows the results of our evaluation following the MIREX convention.
Results are organised separately for each dataset. Along with QM Key Detector
(qm) and KeyFinder (kf), we present our algorithm (with spectral whitening and
detuning correction) with three different profiles, namely Temperley’s (edmt),
automatically extracted profiles from our training dataset (edma), and the man-
ually adjusted ones (edmm).
Both benchmarking algorithms were tested using their default settings.
KeyFinder uses Sha’ath’s own profiles presented above, and provides a single
estimate per track. QM Key Detector, on the other hand, uses key profiles derived
from analysis of J. S. Bach’s Well Tempered Klavier I (1722), with window and
hop sizes of 32,768 points, providing a key estimation per frame. We have reduced
these by taking the most frequent estimation per track.
Edmt yields a weighted score of 81.2 in the beatles dataset, followed by edma
(76), above both benchmarking algorithms. Most errors concentrate on the fifth,
however, other common errors are minimised. Edmm produces 48 % parallel
errors, identifying all major keys as minor due to its flat major profile. For
GiantSteps, results are slightly lower. The highest rank is for edmm, with a
weighted score of 72.0, followed by edma and KeyFinder.
Table 2. Performance of the algorithms on the two evaluation datasets. Our method
is reported with spectral whitening and detuning correction, on three different pro-
files: temperley (edmt), edm-auto (edma) and edm-manual (edmm). Under the correct
estimations, we show results for different types of common errors.
Key Estimation in Electronic Dance Music 345
4.2 Discussion
Among all algorithms under comparison, edma provides the best compromise
among different styles, scoring 76.0 points on beatles and 67.3 on GiantSteps.
This suggests that the modifications described are style-agnostic, since they offer
improvement over the compared methods in both styles. Spectral whitening and
detuning correction address different aspects of the key estimation process, and
their implementation works best in combination, independently of the key pro-
files used. However, results vary drastically depending on this last factor, evi-
dencing that a method based on tonality profiles should be tailored to specific
uses and hence not suitable for a general-purpose key identification algorithm.
This is especially the case with our manually adjusted profiles, which are highly
biased toward the minor modes.
5 Conclusion
References
1. Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O.: ESSEN-
TIA: an open-source library for sound and music analysis. In: Proceedings 21st
ACM-ICM, pp. 855–858 (2013)
2. Cannam, C., Mauch, M., Davies, M.: MIREX 2013 Entry: Vamp plugins from the
Centre For Digital Music (2013). www.music-ir.org
3. Everett, W.: Making sense of rock’s tonal systems. Music Theory Online, vol. 10(4)
(2004)
346 Á. Faraldo et al.
4. Dayal, G., Ferrigno, E.: Electronic Dance Music. Grove Music Online. Oxford Uni-
versity Press, Oxford (2012)
5. Dressler, K., Streich, S.: Tuning frequency estimation using circular statistics. In:
Proceedings of the 8th ISMIR, pp. 2–5 (2007)
6. Gómez, E.: Tonal description of polyphonic audio for music content processing.
INFORMS J. Comput. 18(3), 294–304 (2006)
7. Gómez, E.: Tonal description of music audio signals. Ph.D. thesis, Universitat
Pompeu Fabra, Barcelona (2006)
8. Harte., C.: Towards automatic extraction of harmony information from music sig-
nals. Ph.D. thesis, Queen Mary University of London (2010)
9. Hyer, B.: Tonality. Grove Music Online. Oxford University Press, Oxford (2012)
10. James, R.: My life would suck without you / Where have you been all
my life: Tension-and-release structures in tonal rock and non-tonal EDM pop.
www.its-her-factory.com/2012/07/my-life-would-suck-without-youwhere-have-you-
been-all-my-life-tension-and-release-structures-in-tonal-rock-and-non-tonal-edm-
pop. Accessed 16th December 2014
11. Klapuri, A.: Multipitch analysis of polyphonic music and speech signals using an
auditory model. IEEE Trans. Audio Speech Lang. Process. 16(2), 255–266 (2008)
12. Knees, P., Faraldo, Á., Herrera, P., Vogl, R., Böck, S., Hörschläger, F., Le Goff, M.:
Two data sets for tempo estimation and key detection in electronic dance music
annotated from user corrections. In: Proceeings of the 16th ISMIR (2015)
13. Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford Unversity Press,
New York (1990)
14. Mauch, M., Dixon., S.: Approximate note transcription for the improvedidentifi-
cation of difficult chords. In: Proceedings of the 11th ISMIR, pp. 135–140 (2010)
15. Mauch, M., Cannam, C., Davies, M., Dixon, S., Harte, C., Kolozali, S., Tidjar,
D.: OMRAS2 metadata project 2009. In: Proceedings of the 10th ISMIR, Late-
Breaking Session (2009)
16. Moore, A.: The so-called “flattened seventh” in rock. Pop. Music 14(2), 185–201
(1995)
17. Müller, M., Ewert, S.: Towards timbre-invariant audio features for harmony-based
music. IEEE Trans. Audio Speech Lang. Process. 18(3), 649–662 (2010)
18. Noland, K.: Computational Tonality estimation: Signal Processing and Hidden
Markov Models. Ph.D. thesis, Queen Mary University of London (2009)
19. Noland, K., Sandler, M.: Signal processing parameters for tonality estimation. In:
Proceedings of the 122nd Convention Audio Engeneering Society (2007)
20. Pollack., A.W.: Notes on..series. (Accessed: 1 February 2015). www.icce.rug.nl/
soundscapes/DATABASES/AWP/awp-notes on.shtml
21. Saslaw, J.: Modulation (i). Grove Music Online. Oxford University Press, Oxford
(2012)
22. Schellenberg, E.G., von Scheve, C.: Emotional cues in American popular music: five
decades of the Top 40. Psychol. Aesthetics Creativity Arts 6(3), 196–203 (2012)
23. Sha’ath., I.: Estimation of key in digital music recordings. In: Departments of
Computer Science & Information Systems, Birkbeck College, University of London
(2011)
24. Spicer, M.: (Ac)cumulative form in pop-rock music. Twentieth Century Music 1(1),
29–64 (2004)
25. Tagg, P.: From refrain to rave: the decline of figure and raise of ground. Pop. Music
13(2), 209–222 (1994)
26. Tagg., P.: Everyday tonality II (Towards a tonal theory of what most people hear).
The Mass Media Music Scholars’ Press. New York and Huddersfield (2014)
Key Estimation in Electronic Dance Music 347
27. Temperley, D.: What’s key for key? The Krumhansl-Schmuckler key-finding algo-
rithm reconsidered. Music Percept. Interdiscip. J. 17(1), 65–100 (1999)
28. Röbel, A., Rodet, X.: Efficient spectral envelope estimation and its application to
pitch shifting and envelope preservation. In: Proceedings of the 8th DAFX (2005)
29. Zhu, Y., Kankanhalli, M.S., Gao., S.: Music key detection for musical audio. In:
Proceedings of the 11th IMMC, pp. 30–37 (2005)
Summarization
Evaluating Text Summarization Systems
with a Fair Baseline from Multiple
Reference Summaries
1 Introduction
Human quality text summarization systems are difficult to design and even
more difficult to evaluate [1]. The extractive summarization task has been most
recently portrayed as ranking sentences based on their likelihood of being part
of the summary and their salience. However different approaches are also being
tried with the goal of making the ranking process more semantically meaning-
ful, for example: using synonym-antonym relations between words, utilizing a
semantic parser, relating words not only by their co-occurrence, but also by
their semantic relatedness. Work is also on going to improve anaphora resolu-
tion, defining dependency relations, etc. with a goal of improving the language
understanding of a system.
A series of workshops on text summarization (WAS 2000-2002), special ses-
sions in ACL, CoLING, SIGIR, and government sponsored evaluation efforts in
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 351–365, 2016.
DOI: 10.1007/978-3-319-30671-1 26
352 F. Hamid et al.
United States (DUC 2001-DUC2007) have advanced the technology and pro-
duced a couple of experimental online systems [15]. However there are no com-
mon, convenient, and repeatable evaluation methods that can be easily applied
to support system development and comparison among different summarization
techniques [8].
Several studies ([9,10,16,17]) suggest that multiple human gold-standard
summaries would provide a better ground for comparison. Lin [5] states that
multiple references tend to increase evaluation stability although human judge-
ments only refer to single reference summary.
After considering the evaluation procedures of ROUGE [6], Pyramid [12], and
their variants e.g., ParaEval [19], we present another approach to evaluating the
performance of a summarization system which works with one or many reference
summaries.
Our major contributions are:
– We propose the average or expected size of the intersection of two random
generated summaries as a generic baseline (Sects. 3 and 4). Such a strategy was
discussed briefly by Goldstein et al. [1]. However, to the best of our knowledge,
we have found no direct use of the idea while scoring a summarization system.
We use the baseline to find a related (normalized) score for each reference and
machine-generated summaries.
– Using this baseline, we outline an approach (Sect. 5) to evaluating a summary.
Additionally, we outline the rationale for a new measure of summary quality,
detail some experimental results and also give an alternate derivation of the
average intersection calculation.
2 Related Work
Most of the existing evaluation approaches use absolute scales (e.g., precision,
recall, f-measure) to evaluate the performance of the participating systems. Such
measures can be used to compare summarization algorithms, but they do not indi-
cate how significant the improvement of one summarizer over another is [1].
ROUGE (Recall Oriented Understudy for Gisting Evaluation) [6] is one of the
well known techniques to evaluate single/multi-document summaries. ROUGE is
closely modelled after BLEU [14], a package for machine translation evaluation.
ROUGE includes measures to automatically determine the quality of a summary
by comparing it to other (ideal) summaries created by humans. The measures
count the number of overlapping units such as n-gram, word sequences, and word
pairs between the machine-generated summary and the reference summaries.
Among the major variants of ROUGE measures, e.g., ROUGE-N, ROUGE-L,
ROUGE-W, and, ROUGE-S, three have been used in the Document Understand-
ing Conference (DUC) 2004, a large-scale summarization evaluation sponsored
by NIST. Though ROUGE shown to correlate well with human judgements, it
considers fragments, of various lengths, to be equally important, a factor that
rewards low informativeness fragments unfairly to relative high informativeness
ones [3].
Evaluating Text Summarization Systems with a Fair Baseline 353
Our method uses a unique baseline for all (system, and reference summaries)
and it does not need the absolute scale (like f, p, r) to score the summaries.
We need to ensure a single rating for each system unit [7]. Besides, we need a
common ground for comparing available multiple references to reach a unique
standard. Precision, Recall, and F-measure are not exactly good fit in such case.
Another important task for an evaluation technique is defining a fair baseline.
Various ways (first sentence, last sentence, sentences overlapped mostly with the
title, etc.) are being tried to generate the baseline. Nenkova [13] designed a
baseline generation approach: SumBasic. It was applied over DUC 2004 dataset.
But we need a generic way to produce the baseline for all types of documents.
The main task of a baseline is to define (quantify) the least possible result that
can be compared with the competing systems to get a comparative scenario.
354 F. Hamid et al.
For each possible size i = {0..k} of an intersecting subset, the numerator sums
the product of i and the number of different possible subsets of size i, giving the
total number of
elements in all possible intersecting subsets. For a particular size
i there are ki ways to select the i intersecting elements from K, which leaves
n−k elements from which to choose the k−i non-intersecting elements (or l−i in
the case of two randomly selected subsets). The denominator simply counts the
number of possible subsets, so that the fraction itself gives the expected number
of elements in a randomly selected subset.
Simplifying Equation 2: Equation 2 is expressed as a combinatorial construc-
tion, but the probabilistic one is perhaps simpler: the probability of any element
x being present in both subset K and subset L is the probability that x is
contained in the intersection of those two sets I = L ∩ K.
Pr(x ∈ K) · Pr(x ∈ L)
= Pr(x ∈ (L ∩ K)) (3)
= Pr(x ∈ I)
i i 2pr
r= p= f -measure =
k l p+r
Therefore, f-measure (the balanced harmonic mean of p and r) for these two
random sets is:
f -measureexpected = 2pr/(p + r)
= 2(l/n)(k/n)/(l/n + k/n)
= 2(lk)/(n2 )/((l + k)/n)
(6)
= 2(lk)/(n(l + k))
= 2i/(l + k)
= i/((l + k)/2)
|K ∩ L| ω
r= = |K ∩ L| ω
|K| k p= =
|L| l
f -measureobserved = 2pr/(p + r)
2 (k+l)·ω
= 2·ω
k·l / k·l (7)
= 2ω/(k + l)
= ω/((k + l)/2)
356 F. Hamid et al.
– The system’s summary size (l) does not have to be exactly same as the refer-
ence’ summary size size (k); which is a unique feature. Giving this flexibility
encourages systems to produce more informative summaries.
– If k and l are equal, i-measure follows the observed intersection, for example
b wins over a and c. In this case i-measure shows a compatible behavior with
recall based approaches.
– For two systems with different l values, but same intersection size, the one
with smaller l wins (e.g., a,d, and g). It indicates that system g (in this case)
was able to extract important information with greater compression ratio; this
is compatible with the precision based approaches.
Our approach is generic and can be used for any summarization model that uses
multiple reference summaries. We have used DU C-2004 structure as a model.
We use i-measure(d, xj , xk ) to denote the i-measure calculated for a particular
document d using the given summaries xj and xk .
Let λ machines (S = {s1 , s2 , . . . , sλ }) participate in a single docu-
ment summarization task. For each document, m reference summaries (H =
{h1 , h2 , . . . , hm }) are provided. We compute the i-measure between m
2 pairs of
reference summaries and normalize with respect to the best pair. We also com-
pute the i-measure for each machine generated summary with respect to each
reference summary and then normalize it. We call these normalized i-measures
and denote them as
358 F. Hamid et al.
i-measure(d,hp ,hq )
wd (hp , hq ) = μd
i-measure(d,sj ,hp ) (9)
wd (sj , hp ) = μ(d,hp )
where,
μd = max(i-measure(d, hp , hq )), ∀hp ∈ H, hq ∈ H, hp = hq
μ(d,hp ) = max(i-measure(d, s, hp )), ∀s ∈ S
The next phase is to build a heterogeneous network of systems and references
to represent the relationship.
Given t different tasks (single documents) for which there are reference and
machine generated summaries from the same sources, we can define the total
performance of system sj as
t
score(sj , di )
i-score(sj ) = i=1 . (12)
t
Reference Summary
B Clinton arrives in Israel, to go to Gaza, attempts to salvage Wye accord.
G Mid-east Wye Accord off-track as Clintons visit; actions stalled, violence
E President Clinton met Sunday with Prime Minister Netanyahu in Israel
F Clinton meets Netanyahu, says peace only choice. Office of both shaky
90 ISRAELI FOREIGN MINISTER ARIEL SHARON TOLD REPORTERS
DURING PICTURE-TAKIN=
6 VISIT PALESTINIAN U.S. President Clinton met to put Wye River
peace accord
31 Clinton met Israeli Netanyahu put Wye accord
Evaluating Text Summarization Systems with a Fair Baseline 359
0.272 0.272
0.0
31
90
31 0.2676
0.0
E
6 0.1850
0.111
0.476
90 0.0198
Fig. 1. System-reference graph:
edge-weights represent the normal-
ized i-measure
Table 2 shows four reference summaries (B, G, E, F ) and three machine sum-
maries (31, 90, 6) for document D30053.APW19981213.0224. Table 3 shows the
normalized i-measure for each reference pair. While comparing the summaries,
we ignored the stop-words and punctuations. Tables 4 and Table 5, and Fig. 1
represents some intermediate calculation using Eqs. 10 and 11 for document
D30053.APW19981213.0224.
7 Experimental Results
We perform different experiments over the dataset. Section 7.1 describes how
i-measure among the reference summaries can be used to find the confidence/
360 F. Hamid et al.
c sen
score(sj , d) = cd (hp ) × wd (sj , hp ) × (13)
p=1
t sen
8 Conclusion
We present a mathematical model for defining a generic baseline. We also pro-
pose a new approach to evaluate machine-generated summaries with respect to
multiple reference summaries, all normalized with the baseline. The experiments
show comparable results with existing evaluation techniques (e.g., ROUGE). Our
model correlates well with human decision as well.
The i-measure based approach shows some flexibility with summary length.
Instead of using average overlapping of words/phrases, we define pair based
conf idence calculation between each reference. Finally, we propose an extension
of the model to evaluate the quality of a summary by combining the bag-of-words
like model to accredit sentence structure while scoring.
We will be extending the model, in future, so it works with semantic relations
(e.g. synonym, hypernym etc.) We also need to investigate some more on the con-
fidence defining approach for question-based/ topic-specific summary evaluation
task.
A Appendix
The equivalence of Eqs. 2 and 5 can be shown using the following elemen-
tary identities on binomial coefficients: the symmetry rule, the absorption
rule and Vandermonde’s convolution [2].
Proof. Consider first the denominator of Eq. 2. The introduction of new
variables makes it easier to see that identities are appropriately applied,
and we do so here by letting s = n − k and then swapping each binomial
coefficient for its symmetrical equivalent (symmetry rule).
k k
s k
k s
=
i=0 i l−i i=0 k−i s−l+i
Evaluating Text Summarization Systems with a Fair Baseline 363
k
s k k + s
=
i=0 j+i k−i j+k
k+n−k
=
n−k−l+k
n
=
n−l
n
=
l
References
1. Goldstein, J., Kantrowitz, M., Mittal, V., Carbonell, J.:Summarizing text doc-
uments: sentence selection and evaluation metrics. In: Proceedings of the 22Nd
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR 1999, pp. 121–128. ACM, New York (1999). http://
doi.acm.org/10.1145/312624.312665
2. Graham, R., Knuth, D., Patashnik, O.: Concrete Mathematics: A Foundation for
Computer Science. Addison-Wesley, Boston (1994)
3. Hovy, E., Lin, C.-Y., Zhou, L., Fukumoto, J.: Automated summarization evalu-
ation with basic elements. In: Proceedings of the Fifth Conference on Language
Resources and Evaluation (LREC 2006) (2006)
4. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93
(1938). http://www.jstor.org/stable/2332226
364 F. Hamid et al.
5. Lin, C.Y.: Looking for a few good metrics: automatic summarization evaluation -
how many samples are enough? In: Proceedings of the NTCIR Workshop 4 (2004)
6. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries, pp. 25–26
(2004)
7. Lin, C.Y., Hovy, E.: Manual and automatic evaluation of summaries. In: Proceed-
ings of the ACL-2002 Workshop on Automatic Summarization, AS 2002, vol. 4, pp.
45–51. Association for Computational Linguistics, Stroudsburg, PA, USA (2002).
http://dx.doi.org/10.3115/1118162.1118168
8. Lin, C.Y., Hovy, E.: Automatic evaluation of summaries using n-gram co-
occurrence statistics. In: Proceedings of the 2003 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics on Human Lan-
guage Technology, NAACL 2003, vol. 1, pp. 71–78. Association for Computational
Linguistics, Stroudsburg, PA, USA (2003). http://dx.doi.org/10.3115/1073445.
1073465
9. Mani, I., Maybury, M.T.: Automatic summarization. In: Association for Compu-
tational Linguistic, 39th Annual Meeting and 10th Conference of the European
Chapter, Companion Volume to the Proceedings of the Conference: Proceedings
of the Student Research Workshop and Tutorial Abstracts, p. 5, Toulouse, France,
9-11 July 2001
10. Marcu, D.: From discourse structures to text summaries. In: Proceedings of the
ACL Workshop on Intelligent Scalable Text Summarization, pp. 82–88 (1997)
11. Nenkova, A., Passonneau, R., McKeown, K.: The pyramid method: Incorporat-
ing human content selection variation in summarization evaluation. ACM Trans.
Speech Lang. Process. 4(2) (2007). http://doi.acm.org/10.1145/1233912.1233913
12. Nenkova, A., Passonneau, R.J.: Evaluating content selection in summarization: the
pyramid method. In: HLT-NAACL, pp. 145–152 (2004). http://acl.ldc.upenn.edu/
hlt-naacl2004/main/pdf/91 Paper.pdf
13. Nenkova, A., Vanderwende, L.: The impact of frequency on summarization.
Microsoft Research, Redmond, Washington, Technical report MSR-TR-2005-101
(2005)
14. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on
Association for Computational Linguistics, ACL 2002, pp. 311–318. Association
for Computational Linguistics, Stroudsburg, PA, USA (2002). http://dx.doi.org/
10.3115/1073083.1073135
15. Radev, D., Blair-Goldensohn, S., Zhang, Z., Raghavan, R.: Newsinessence: a sys-
tem for domain-independent, real-time news clustering and multi-document sum-
marization. In: Proceedings of the First International Conference on Human Lan-
guage Technology Research (2001). http://www.aclweb.org/anthology/H01-1056
16. Rath, G.J., Resnick, A., Savage, T.R.: The formation of abstracts by the selection
of sentences. Part I. Sentence selection by men and machines. Am. Documentation
12, 139–141 (1961). http://dx.doi.org/10.1002/asi.5090120210
17. Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic text struc-
turing and summarization. Inf. Process. Manage. 33(2), 193–207 (1997).
http://dx.doi.org/10.1016/S0306-4573(96)00062-3
Evaluating Text Summarization Systems with a Fair Baseline 365
18. Spearman, C.: The proof and measurement of association between two things. Am.
J. Psychol. 15(1), 72–101 (1904). http://www.jstor.org/stable/1412159
19. Zhou, L., Lin, C.Y., Munteanu, D.S., Hovy, E.: Paraeval: using paraphrases to eval-
uate summaries automatically. Association for Computational Linguistics, April
2006. http://research.microsoft.com/apps/pubs/default.aspx?id=69253
Multi-document Summarization
Based on Atomic Semantic Events
and Their Temporal Relationships
1 Introduction
Automatic multi-document summarization (MDS) extracts core information
from the source text and presents the most important content to the user in
a concise form [24]. The important information is contained in textual units or
groups of textual units which should be taken into consideration in generating a
coherent and salient summary. In this paper, we propose an event based model
of the generic MDS where we represent the generic summarization problem as an
atomic event extraction as well as a topic distribution problem. Another new type
of summarization called update MDS, whose goal is to get a salient summary of
the updated documents supposing that the user has read the earlier documents
about the same topic. The best of the recent efforts to generate update sum-
mary use graph based algorithms with some additional features to explore the
novelty of the document [9,20,36]. Maximal Marginal Relevance (MMR) based
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 366–377, 2016.
DOI: 10.1007/978-3-319-30671-1 27
Multi-document Summarization Based on Atomic Semantic Events 367
approach [3] is used to blindly filter out the new information. These approaches
discard the sentences containing novel information if they contain some old infor-
mation from the previous document sets [7].
Steinberger et al. [33] use the sentence time information in the Latent Seman-
tic Analysis (LSA) framework to get the novel sentences. They only consider the
first time expression as the anchored time of the sentence, but sentences may
contain multiple time expressions from various chronologies. For instance, con-
sider the sentence “Two members of Basque separatist group ETA arrested while
transporting half a tonne of explosives to Madrid just prior to the March 2004
bombings received jail sentences of 22 years each on Monday ”. Here we get two1
time expressions: March 2004 and Monday. The first expression represents the
very old information, and the second one represents the accurate anchoring of
the sentence. If we consider the first time expression as the sentence’s time, like
Steinberger et al. [33] would, then it would give us false novel/update informa-
tion. This is why we take into account all of the events of a sentence to calculate
its anchored time. In this paper, we also design a novel approach by taking into
account all of the events in a sentence and their temporal relations to ensure
the novelty, as well as the saliency, in update summarization. We represent the
novelty detection problem as a chronological ordering problem of the tempo-
ral events and time expressions. Our event based sentence ranking system uses
a topic model to identify all of the salient sentences. The rest of the paper is
organized as follows. Section 2 reviews previous related works in text summa-
rization. Section 3 describes our proposed summarization models. Section 4 gives
the evaluation of our systems. Section 5 presents some conclusions and future
works.
2 Related Works
Every document covers a central theme or event. There are other sub-events
which support the central event. There are also many words or terms across the
whole document, which can act as an individual event, they contribute to the
main theme. Named entities such as time, date, person, money, organizations,
locations, etc., are also significant because they build up the document structure.
Although events and named entities are terms or group of terms, they have a
higher significance than the normal words or terms. Those events and named
entities can help to generate high performing summaries. Filatova and Hatzivas-
siloglou [11] used atomic events in extractive summarization. They considered
events as a triplet of two named entities and verb (or action noun), where the
verb (or action noun) connects the two named entities. Several greedy algorithms
based on the co-occurrence statistics of events are used to generate a summary.
They showed that event-based summaries get a much better score than the sum-
maries generated by tf*idf weighing of words. Li et al. [19] also defined the same
complex structure as an event and the PageRank algorithm [29] is applied to
1
Here ‘22 years’ is a time period. Time periods do not carry important information
for detecting novelty.
368 Y. Chali and M. Uddin
3 Our Methodologies
In this paper, we use Stanford CoreNLP3 for tokenization, named entity recog-
nition, and cross-document coreference resolution. We remove all of the candi-
date sentences containing quotations. We also remove the candidate sentences
whose length are less than 11 words. Sentences containing quotations are not
appropriate for summary and shorter sentences carry a small amount of relative
2
http://duc.nist.gov/duc2007/tasks.html.
3
http://nlp.stanford.edu/software/corenlp.shtml.
Multi-document Summarization Based on Atomic Semantic Events 369
information [17]. After tokenization, we remove stop words. We use Porter Stem-
mer [30] for stemming. Stemmed words or terms are then fed to Latent Dirichlet
Allocation (LDA) engine for further processing. We use ClearTK4 system [2] for
event and temporal relation extraction.
i
Sp = 1 − (2)
SC d
In Eq. (3), Wg is the specific weight factor for each group of terms. T Cg is
the number of terms in one group g where g ∈ {event(e), named-entity(n),
other(o)}. We consider empirically Wg is 1 for the group called other (which is
a set of normal terms other than events and named entities) and Wg for groups
event and named entity can be calculated by Eq. (5).
Our weight calculating scheme ensures larger weights for event and named entity
groups and also prevents the high occurring group from scoring high. The steps
of our generic MDS system are as follows:
4
http://code.google.com/p/cleartk/.
5
Document Creation Time (DCT) can be calculated from document name.
370 Y. Chali and M. Uddin
1. Apply the LDA topic model on the corpus of documents for a fixed number6
of topics K.
2. Compute the probability of topic Tj by Eq. (1) and sort the topics in the
descending order of their probabilities.
3. Pick the topic Tj from the sorted list in the order of the probabilities of Tj ,
i.e., P (T1 ), .., P (Tk ).
4. For topic Tj , compute the score of all of the sentences by Eq. (3) where P (t)
is the unigram probability distribution obtained from the LDA topic model.
5. For topic Tj , pick up the sentence with the highest score and include it in
the summary. If it is already included in the summary or it dissatisfies other
requirements (cosine score between candidate sentence and already-included
summary sentences crosses the certain range), then pick up the sentence with
the next highest score for this topic Tj .
6. Each selected sentence is compressed according to the method described in
Sect. 3.3.
7. If the summary reaches its desired length then terminate the operation, else
continue from step 3.
department official said”. Here last week and Thursday are the time expressions
of the sentence. SUTime output of the above text is mentioned below, where
October 11th , 2006 is a reference date:
“The Amish school where a gunman shot 10 girls <TIMEX3 tid=“t2”
type=“DATE” value=“2006-W40”> last week </TIMEX3>, killing five
of them, is expected to be demolished <TIMEX3 tid=“t3” type=“DATE”
value= “2006-10-12”>Thursday </TIMEX3 >, a fire department official
said.”
SUTime extracts 2006-W40 and 2006-10-12 as the normalized date of
last week and Thursday, respectively. We convert them into an absolute time
end point on a universal timeline. We follow standard date and time format
(YYYY-MM-DD hh:mm:ss) for the time end point. For example, after conver-
sion of 2006-W40 and 2006-10-12, we get 2006-09-23 23:59:59 and 2006-10-12
23:59:59, respectively.
We anchor all of the events to absolute times based on the ‘includes’ and ‘is-
included’ relations of the event-time links. The remaining events are anchored
approximately, based on other relations which are ‘before’ and ‘after’.
Temporal Score. To obtain temporal score, we use ClearTK system [2] for
initial temporal relation extraction and some transitive rules as described earlier.
First, we relax the original event time association problem by anchoring the event
to an approximate time. Then, we calculate the temporal score of a sentence
by taking an average time score of all of the events’ anchored time. Then, all
of the sentences are ordered in the descending order of their temporal scores
372 Y. Chali and M. Uddin
except for the first sentence of each document. Then, we calculate the temporal
position score (tps ) of the temporally ordered sentences. The tps of the first
sentence of a document is considered to be one. Temporal position scores of
the remaining sentences can be calculated by the Eq. (6). Ds is the number of
sentences in document D and the temporally ordered sentence position index,
i ∈ {0, . . . , Ds − 1}.
γ×i
tps = 1 − (6)
Ds
The parameter (γ) is used to tune the weight of the relative temporal position
of the sentences.
Sentence Ranking. From the Latent Dirichlet Allocation (LDA) topic model,
we obtain a unigram (event or named entity) probability distribution, P (t). For
each topic, the sentence score can be computed using Eq. (7).
Score(s) = tps × ( (P (t) × α × Wg ) + (P (t) × β × Wg )) (7)
t∈S t∈S
In Eq. (7), Wg can be calculated using Eq. (5), tps is the temporal position score
of a sentence obtained from Eq. (6), and α and β are the weight factors of the new
terms and the topic title terms, respectively, which are learnt from TAC’2010
dataset. For each topic, one sentence is taken as a summary sentence from the
ordered list of sentences (descending order of their score, Score(s)). We use cosine
similarity score to remove the redundancy of the summary. Additionally, we use
the same sentence compression technique as in the generic summarization.
4 Evaluation
4.1 ROUGE Evaluation: Generic Summarization
We use the DUC 2004 dataset to evaluate our generic MDS system. We perform
our experiment on 35 clusters of 10 documents each. DUC 2004 Task-2 was to
create short multi-document summaries no longer than 665 bytes. We evaluate
the summaries generated by our system using the automatic evaluation toolkit
ROUGE7 [22]. We compare our system with some recent systems including,
the best system in DUC 2004 (Peer 65), a conceptual units based model [34],
and G-FLOW, a recent state-of-the-art coherent summarization system [6]. As
shown in Table 1, our system outperforms those three systems. It also scores
better than the recent submodular functions based state-of-the-art system8 [23]
7
ROUGE runtime arguments for DUC 2004:
ROU GE -a -c 95 -b 665 -m -n 4 -w 1.2.
8
We do not compare our system with the recent topic model based system [14] because
that system is significantly outperformed by Lin and Bilmes’s [23] system in terms
of both ROUGE-1 recall and f1 -measure.
Multi-document Summarization Based on Atomic Semantic Events 373
Table 1. Evaluation on the DUC 2004 dataset (The best results are bolded)
Systems R-1 F1
Peer 65 0.3828 0.3794
Takamura and Okumura [34] 0.3850 -
G-FLOW 0.3733 0.3743
Lin 0.3935 0.3890
Our generic MDS System 0.3953 0.3983
To evaluate our update MDS system, we use the TAC’2011 dataset. TAC’2011
dataset contains two groups of data, A and B. Group A contains the old dataset.
Group B contains the new dataset of the same topic as group A. We perform our
experiment on 28 clusters of 10 documents each. TAC’2011 guided update sum-
marization task was to create short multi-document summaries no longer than
100 words with the assumption that the user has already read the documents
from group A. Table 3 tabulates ROUGE scores of our system and best perform-
ing systems in TAC’2011 update summarization task. Our model outperforms
the current state-of-the-art system, which is h-uHDPSum, as well as the best
update summarization system (peer 43) of TAC’2011 summarization track. 95 %
confidence intervals in Table 4 show that our system obtains significant improve-
ment over the two systems (h-uHDPSum and Peer 43) in terms of ROUGE-2
and ROUGE-SU4. The performance of our event and temporal relation based
summarizer changes according to the type of documents we are considering to be
summarized. Our system gets very high recall and f-measures for the documents
that are well constituents of events. Our temporal relation based system reveals
374 Y. Chali and M. Uddin
all of the hidden novel information. At the same time, our event and named
entity based scoring scheme ensures the saliency in update summarization.
Relevancy 3.92
N on-redundancy 3.50
Overall responsiveness 3.70
Multi-document Summarization Based on Atomic Semantic Events 375
N ovelty 4.13
F luency 3.92
Overall responsiveness 4.07
References
1. James, F.: Allen.: maintaining knowledge about temporal intervals. Commun.
ACM 26(11), 832–843 (1983)
2. Bethard, S.: Cleartk-timeml: a minimalist approach to tempeval. In: Second Joint
Conference on Lexical and Computational Semantics (* SEM), vol. 2, pp. 10–14
(2013)
3. Boudin, F., El-Bèze, M., Torres-Moreno, J. M.: A scalable MMR approach to
sentence scoring for multi-document update summarization. COLING (2008)
4. Cer, D.M., De Marneffe, M.-C., Jurafsky, D., Manning, C.D.: Parsing to stanford
dependencies: trade-offs between speed and accuracy. In: LREC (2010)
5. Chang, A.X., Manning, C.D.: Sutime: a library for recognizing and normalizing
time expressions. In: Language Resources and Evaluation (2012)
376 Y. Chali and M. Uddin
6. Christensen, J., Mausam, S.S., Etzioni, O.: Towards coherent multi-document sum-
marization. In: Proceedings of NAACL-HLT, pp. 1163–1173 (2013)
7. Delort, J.-Y., Alfonseca, E.: Dualsum: a topic-model based approach for update
summarization. In: Proceedings of the 13th Conference of the European Chapter
of the Association for Computational Linguistics, pp. 214–223 (2012)
8. Denis, P., Muller, P.: Predicting globally-coherent temporal structures from texts
via endpoint inference and graph decomposition. In: Proceedings of the Twenty-
Second International Joint Conference on Artificial Intelligence, vol. 3, pp. 1788–
1793. AAAI Press (2011)
9. Pan, D., Guo, J., Zhang, J., Cheng, X.: Manifold ranking with sink points for
update summarization. In: Proceedings of the 19th ACM International Conference
on Information and Knowledge Management, pp. 1757–1760 (2010)
10. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text
summarization. J. Artif. Intell. Res. (JAIR) 22(1), 457–479 (2004)
11. Filatova, E., Hatzivassiloglou, V.: Event-based extractive summarization. In: Pro-
ceedings of ACL Workshop on Summarization, vol. 111 (2004)
12. Fisher, S., Roark, B.: Query-focused supervised sentence ranking for update sum-
maries. In: Proceeding of TAC 2008 (2008)
13. Gillick, D., Favre, B., Hakkani-Tur, D., Bohnet, B., Liu, Y., Xie, S.: The icsi/utd
summarization system at tac. In: Proceedings of the Second Text Analysis Con-
ference, Gaithersburg, Maryland, USA. NIST (2009)
14. Haghighi, A., Vanderwende, L.: Exploring content models for multi-document sum-
marization. In: Proceedings of Human Language Technologies: The Annual Con-
ference of the North American Chapter of the Association for Computational Lin-
guistics, pp. 362–370. Association for Computational Linguistics (2009)
15. Kullback, S.: The kullback-leibler distance (1987)
16. Li, J., Li, S., Wang, X., Tian, Y., Chang, B.: Update summarization using a multi-
level hierarchical dirichlet process model. In: COLING (2012)
17. Li, L., Heng, W., Jia, Y., Liu, Y., Wan, S.: Cist system report for acl multiling
2013-track 1: multilingual multi-document summarization. In: MultiLing 2013, p.
39 (2013)
18. Li, P., Wang, Y., Gao, W., Jiang, J.: Generating aspect-oriented multi-document
summarization with event-aspect model. In: Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pp. 1137–1146 (2011)
19. Li, W., Mingli, W., Qin, L., Wei, X., Yuan, C.: Extractive summarization using
inter-and intra-event relevance. In: Proceedings of the 21st International Confer-
ence on Computational Linguistics and the 44th Annual Meeting of the Association
for Computational Linguistics, pp. 369–376. Association for Computational Lin-
guistics (2006)
20. Li, X., Liang, D., Shen, Y.-D.: Graph-based marginal ranking for update summa-
rization. In: SDM, pp. 486–497. SIAM (2011)
21. Li, Xuan, Liang, Du, Shen, Yi-Dong: Update summarization via graph-based sen-
tence ranking. IEEE Trans. Knowl. Data Eng. 25(5), 1162–1174 (2013)
22. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text
Summarization Branches Out: Proceedings of the ACL-2004 Workshop, pp. 74–81
(2004)
23. Lin, H., Bilmes, J.: A class of submodular functions for document summarization.
In: ACL, pp. 510–520 (2011)
24. Mani, I.: Automatic Summarization, vol. 3. John Benjamins Publishing, Amster-
dam (2001)
Multi-document Summarization Based on Atomic Semantic Events 377
25. Mani, I., Schiffman, B., Zhang, J.: Inferring temporal ordering of events in news.
In: Proceedings of the Conference of the North American Chapter of the Associa-
tion for Computational Linguistics on Human Language Technology: Companion
Volume of the Proceedings of HLT-NAACL 2003-Short Papers, vol. 2, pp. 55–57.
Association for Computational Linguistics (2003)
26. Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of
EMNLP, vol. 4, p. 275, Barcelona, Spain (2004)
27. Ng, J.-P., Kan, M.-Y.: Improved temporal relation classification using dependency
parses and selective crowdsourced annotations. In: COLING, pp. 2109–2124 (2012)
28. Ng, J.-P., Kan, M.-Y., Lin, Z., Feng, W., Chen, B., Jian, S., Tan, C.L.: Exploiting
discourse analysis for article-wide temporal classification. In: EMNLP, pp. 12–23
(2013)
29. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web (1999)
30. Martin, F.: Porter.: an algorithm for suffix stripping. Program Electr. Libr. Inf.
Syst. 14(3), 130–137 (1980)
31. Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer,
A., Katz, G., Radev, D.R.: Timeml: robust specification of event and temporal
expressions in text. In: New Directions in Question Answering, vol. 3, pp. 28–34
(2003)
32. Steinberger, J., Ježek, K.: Update summarization based on novel topic distribution.
In: Proceedings of the 9th ACM symposium on Document Engineering, pp. 205–
213 (2009)
33. Steinberger, J., Kabadjov, M., Steinberger, R., Tanev, H., Turchi, M., Zavarella,
V.: Jrcs participation at tac: Guided and multilingual summarization tasks. In:
Proceedings of the Text Analysis Conference (TAC) (2011)
34. Takamura, H., Okumura, M.: Text summarization model based on maximum cover-
age problem and its variant. In: Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguistics, pp. 781–789. Associa-
tion for Computational Linguistics (2009)
35. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes.
J. Am. Stat. Assoc. 101, 1566–1581 (2004)
36. Wenjie, L., Wei Furu, L., Qin, H.Y.: Pnr 2: ranking sentences with positive and
negative reinforcement for query-oriented update summarization. In: Proceedings
of the 22nd International Conference on Computational Linguistics, pp. 489–496
(2008)
37. Zhang, R., Li, W., Qin, L.: Sentence ordering with event-enriched semantics and
two-layered clustering for multi-document news summarization. In: Proceedings of
the 23rd International Conference on Computational Linguistics, pp. 1489–1497
(2010)
Tweet Stream Summarization for Online
Reputation Management
1 Introduction
Since the advent of Social Media, an essential part of Public Relations (for
organizations and individuals) is Online Reputation Management, which consists
of actively listening online media, monitoring what is being said about an entity
and deciding how to act upon it in order to preserve or improve the public
reputation of the entity. Monitoring the massive stream of online content is the
first task of online reputation experts. Given a client (e.g. a company), the expert
must provide frequent (e.g. daily) reports summarizing which are the issues that
people are discussing and involve the company.
In a typical workflow, the reputation experts start with a set of queries that
try to cover all possible ways of referring to the client. Then they take the results
set and filter out irrelevant content (e.g., texts about apple pies when looking for
the Apple company). Next, they determine which are the different issues people
are discussing, evaluate their priority, and produce a report for the client.
Crucially, the report must include any issue that may affect the reputation of
the client (reputation alerts) so that actions can be taken upon it. The summary,
therefore, is guided by the relative priority of issues. This notion of priority
differs from the signals that are usually considered in summarization algorithms,
and it depends on many factors, including popularity (How many people are
commenting on the issue?), polarity for reputation (Does it have positive or
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 378–389, 2016.
DOI: 10.1007/978-3-319-30671-1 28
Tweet Stream Summarization for Online Reputation Management 379
negative implications for the client?), novelty (Is it a new issue?), authority (Are
opinion makers engaged in the conversation?), centrality (Is the client central
to the conversation?), etc. This complex notion of priority makes the task of
producing reputation-oriented summaries a challenging and practical scenario.
In this context, we investigate two main research questions:
RQ1. Given the peculiarities of the task, what is the most appro-
priate evaluation methodology?
Our research is triggered by the availability of the RepLab dataset [1], which
contains annotations made by reputation experts on tweet streams for 61 entities,
including entity name disambiguation, topic detection and topic priority.
We will discuss two types of evaluation methodologies, and in both cases we
will adapt the RepLab dataset accordingly. The first methodology sticks to the
traditional summarization scenario, under the hypothesis that RepLab annota-
tions can be used to infer automatically entity-oriented summaries of near-manual
quality. The second evaluation methodology models the task as producing a rank-
ing of tweets that maximizes both coverage of topics and priority. This provides an
analogy with the problem of search with diversity, where the search system must
produce a rank that maximizes both relevance and coverage.
RQ2. What is the relationship between centrality and priority?
The most distinctive feature of reputation reports is that issues related with
the entity are classified according to their priority from the perspective of repu-
tation handling (the highest priority being a reputation alert). We want to inves-
tigate how the notion of priority translates to the task of producing extractive
summaries, and how important it is to consider reputational signals of priority
when building and appropriate summary.
We will start by discussing how to turn the RepLab setting and datasets into
a test collection for entity-oriented tweet stream summarization. Then we will
introduce our experimental setting to compare priority signals with text quality
signals and assess our evaluation methodology, discuss the results, link our study
with related work, and finish with the main conclusions learned.
– Starts with a set of queries that cover all possible way of referring to the client.
– Takes the results set and filter out irrelevant content.
– Groups tweets according to the different issues (topics) people are discussing.
380 J. Carrillo-de-Albornoz et al.
The reputation report must include any issue that may affect the reputation
of the client (reputation alerts) so that action can be taken upon it. This (extrac-
tive) summary, therefore, is guided by the relative priority of issues. However,
as we pointed out in the introduction, this notion of priority differs from the
signals that are usually considered in summarization algorithms, and it depends
on many factors, including: popularity, polarity for reputation, novelty, author-
ity, and centrality. Thus, the task is novel and attractive from the perspective of
summarization, because the notion of which are the relevant information nuggets
is focused and more precisely defined than in other summarization tasks. Also,
it explicitly connects the summarization problem with other Natural Language
Processing tasks: there is a filtering component (because it is entity-oriented),
a social media component (because, in principle, non-textual Twitter signals
may help discovering priority issues), a semantic understanding component (to
establish, for instance, polarity for reputation), etc.
The RepLab 2013 task is defined as (multilingual) topic detection combined with
priority ranking of the topics. Manual annotations are provided for the following
subtasks:
– Filtering. Systems are asked to determine which tweets are related to the
entity and which are not. Manual annotations are provided with two possible
values: related/unrelated. For our summarization task, we will use as input
only those tweets that are manually annotated as related to the entity.
– Polarity for Reputation Classification. The goal is to decide if the tweet con-
tent has positive or negative implications for the company’s reputation. Man-
ual annotations are: positive/negative/neutral.
– Topic Detection: Systems are asked to cluster related tweets about the entity
by topic with the objective of grouping together tweets referring to the same
subject/event/conversation.
– Priority Assignment. It involves detecting the relative priority of topics. Man-
ual annotations have three possible values: Alert, mildly important, unimpor-
tant.
RepLab 2013 uses Twitter data in English and Spanish. The collection com-
prises tweets about 61 entities from four domains: automotive, banking, uni-
versities and music. We will restrict our study to the automotive and banking
domains, because they consist of large companies which are the standard subject
Tweet Stream Summarization for Online Reputation Management 381
experiments, we generate 1,000 model summaries for every entity using this
model. Note that the excess of simplification in our assumptions pays off, as
we are able to generate a large number of model summaries with the manual
annotations provided by the RepLab dataset.
Once we have created the models (1,000 per test case), automatic summaries
can be evaluated using standard text similarity measures. In our experiments
we use ROUGE [2], a set of evaluation metrics for summarization which mea-
sure the content overlap between a peer and one or more reference summaries.
The most popular variant is ROUGE-2, due to its high correlation with human
judges. ROUGE-2 counts the number of bigrams that are shared by the peer
and reference summaries and computes a recall-related measure [2].
Our second approach to evaluate summaries does not require model summaries.
It reads the summary as a ranked list of tweets, and evaluates the ranking with
respect to relevance and redundancy as measured with respect to the annotated
topics in the RepLab dataset. The idea is making an analogy between the task
of producing a summary and the task of document retrieval with diversity. In
this task, the retrieval system must provide a ranked list of documents that
maximizes both relevance (documents are relevant to the query) and diversity
(documents reflect the different query intents, when the query is ambiguous, or
the different facets in the results when the query is not ambiguous).
Producing an extractive summary is, in fact, a similar task: the set of selected
sentences should maximize relevance (they convey essential information from the
documents) and diversity (sentences should minimize redundancy and maximize
coverage of the different information nuggets in the documents). The case of
reputation reports using Twitter as a source is even more clear, as relevance is
modeled by the priority of each of the topics. An optimal report should maximize
the priority of the information conveyed and the coverage of priority entity-
related topics (which, in turn, minimizes redundancy).
Let’s think of the following user model for tweet summaries: the user starts
reading the summary from the first tweet. At each step, the user goes on to the
next tweet or stops reading the summary, either because she is satisfied with
the knowledge acquired so far, or because she does not expect the summary
to provide further useful information. User satisfaction can be modeled via two
variables: (i) the probability of going ahead with the next tweet in the summary;
(ii) the amount of information gained with every tweet. The amount of informa-
tion provided by a tweet depends on the tweets that precede it in the summary:
a tweet from a topic that has already appeared in the summary contributes less
than a tweet from a topic that has not yet been covered by the preceding tweets.
To compute the expected user satisfaction, the evaluation metric must also take
into account that tweets deeper in the summary (i.e. in the rank) are less likely
Tweet Stream Summarization for Online Reputation Management 383
3 Experimental Design
Our first research question (how to evaluate the task) is partially answered in the
previous section. We now want to compare how the two alternative evaluation
metrics behave, and we want to investigate the second research question: what is
the relationship between centrality and priority, and how priority signals can be
used to enhance summaries. For this purpose, we will compare three approaches
(two baselines and one contrastive system):
LexRank. As a standard summarization baseline, we use LexRank [5], one of
the best-known graph-based methods for multi-document summarization based
on lexical centrality. LexRank is executed through the MEAD summarizer [6]
(http://www.summarization.com/mead/) using these parameters: -extract -s
-p 10 -fcp delete. We build summaries at 5, 10, 20 and 30 % compression rate,
for LexRank and also for the other approaches.
Followers. As a priority baseline, we simply rank the tweets by the number of
followers of the tweet author, and then apply a technique to remove redundancy.
The number of followers is a basic indication of priority: things being said by
people with more followers are more likely to spread over the social networks.
Redundancy is avoided using an iterative algorithm: a tweet from the ranking
is included in the summary only if it has a vocabulary overlap less than 0.02, in
terms of the Jaccard measure, with any of the tweets already included in the sum-
mary. Once the process is finished, if the resulting compression rate is higher than
desired, discarded tweets are reconsidered and included by recursively increas-
ing the threshold in 0.02 similarity points until the desired compression rate is
reached.
Signal Voting. Our contrastive system considers a number of signals of priority
and content quality. Each signal (computed using the training set) provides a
ranking of all tweets for a given test case (an entity). We follow this procedure:
– Using the training part of the RepLab dataset, we compute two estimations
of the quality of each signal: the ratio between average values within priority
Tweet Stream Summarization for Online Reputation Management 385
(a) Ratio between average values for pri- (b) Pearson correlation between signal
ority vs unimportant topics values and manual priority
values (if priority tweets receive higher values than unimportant tweets, the
signal is useful), and the Pearson correlation between the signal values and
the manual priority values. The signals (which are self-descriptive) and the
indicators are displayed in Fig. 2.
– We retain those signals with a Pearson correlation above 0.02 and with a ratio
of averages above 10 %. The resulting set of signals is: URLS count (number
of URLs in the tweet), 24h similar tweets (number of similar tweets pro-
duced in a time span of 24 hours), Author num followers (number of fol-
lowers of the author), Author num followees (number of people followed by
the author), neg words (number of words with negative sentiment), Num pos
emoticons (number of emoticons associated with a positive sentiment), and
Mentions count (number of Twitter users mentioned).
– Each of the selected signals produces a ranking of tweets. We combine them
to produce a final ranking using Borda count [7], a standard voting scheme to
combine rankings.
– We remove redundancy with the same iterative procedure used in the Follow-
ers baseline.
rate where the difference between signal voting and LexRank is not significant
(p = 0.08). Remarkably, at 20 % and 30 % compression rates even the Followers
baseline – which uses very little information and is completely unsupervised –
outperforms the LexRank baseline. Altogether, these are clear indicators that
priority signals play a major role for the task.
In terms of recall of relevant topics, the figure shows that Signal voting > Fol-
lowers > LexRank at all compression ratios. In terms of RBP-SUM, results are
similar. With both relevance scoring functions, signal voting outperforms the two
baselines at all compression rates, and all differences are statistically significant.
The only difference is that this evaluation methodology, which penalizes redun-
dancy more heavily (tweets from the same topic receive an explicit penalty),
gives the followers baseline a higher score than LexRank at all compression lev-
els (with both relevance scoring functions).
Relative differences are rather stable between both p values and between
both relevance scoring functions. Naturally, absolute values are lower for RBP-
SUM-B, as the scoring function is stricter. Although experimentation with users
would be needed to appropriately set the most adequate p value and relevance
scoring schema, the measure differences seem to be rather stable with respect to
both choices.
Tweet Stream Summarization for Online Reputation Management 387
5 Related Work
5.1 Centrality Versus Priority-Based Summarization
Centrality has been one of the most widely used criteria for content selection
[8]. Centrality refers to the idea of how much a fragment of text (usually a sen-
tence) covers the main topic of the input text (a document or set of documents).
However, the information need of users frequently goes far beyond centrality
and should take into account other selection criteria such as diversity, novelty
and priority. Although the importance of enhancing diversity and novelty in
various NLP tasks has been widely studied [9,10], reputational priority is a
domain-dependent concept that has not been considered before. Other priority
criteria have been previously considered in some areas: In [11], concepts related
to treatments and disorders are given higher importance than other clinical con-
cepts when producing automatic summaries of MEDLINE citations. In opinion
summarization, positive and negative statements are given priority over neutral
ones. Moreover, different aspects of the product/service (e.g., technical perfor-
mance, customer service, etc.) are ranked according to their importance to the
user [12]. Priority is also tackled in query (or topic)-driven summarization where
terms from the user query are given more weight under the assumption that they
reflects the user relevance criteria [13].
6 Conclusions
We have introduced the problem of generating reputation reports as a variant of
summarization that is both practical and challenging from a research perspective,
as the notion of reputational priority is different from the traditional notion of
importance or centrality. We have presented two alternative evaluation method-
ologies that rely on the manual annotation of topics and their priority. While
the first evaluation methodology maps such annotations into summaries (and
then evaluates with standard summarization measures), the second methodol-
ogy establishes an analogy with the problem of search with diversity, and adapts
an IR evaluation metric to the task (RBP-SUM).
Given the high correlation between Rouge and RBP-SUM values, we advo-
cate the use of the latter to evaluate reputation reports. There are two main
reasons: first, it avoids the need of explicitly creating reference summaries, which
is a costly process (or suboptimal if, as in our case, they are generated automat-
ically from topic/priority annotations); the annotation of topics and priorities is
sufficient. Second, it allows an explicit modeling of the patience of the user when
reading the summary, and of the relative contribution of information nuggets
depending on where in the summary they appear and their degree of redun-
dancy with respect to already seen text.
As for our second research question, our experiments indicate that priority sig-
nals play a relevant role to create high-quality reputation reports. A straightfor-
ward voting combination of the rankings produced by useful signals consistently
outperforms a standard summarization baseline (LexRank) at all compression
rates and with all the evaluation metrics considered. In fact, the ranking produced
by just one signal (number of followers) also may outperform LexRank, indicating
that standard summarization methods are not competitive.
In future work we will consider including graded relevance with respect to
priority levels in the data. In our setting, we have avoided such graded relevance
to avoid bias in favor of priority-based methods, but RBP-SUM directly admits
a more sophisticated weighting scheme via ri .
References
1. Amigó, E., Carrillo-de-Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Martı́n,
T., Meij, E., de Rijke, M., Spina, D.: Overview of RepLab 2013: Evaluating online
reputation monitoring systems. In: Forner, P., Müller, H., Paredes, R., Rosso, P.,
Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 333–352. Springer, Heidelberg
(2013)
2. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proceed-
ings of the ACL Workshop on Text Summarization Branches Out, pp. 74–81 (2004)
Tweet Stream Summarization for Online Reputation Management 389
3. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effective-
ness. ACM Trans. Inf. Syst. (TOIS) 27(1), 2 (2008)
4. Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document
organization tasks. In: Proceedings of ACM SIGIR, pp. 643–652. ACM (2013)
5. Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text
summarization. J. Artif. Int. Res. 22(1), 457–479 (2004)
6. Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J., Çelebi, A., Dimitrov, S.,
Drabek, E., Hakim, A., Lam, W., Liu, D., Otterbacher, J., Qi, H., Saggion, H.,
Teufel, S., Topper, M., Winkel, A., Zhang, Z.: MEAD – A platform for multidoc-
ument multilingual text summarization. In: Proceedings of LREC (2004)
7. Van Erp, M., Schomaker, L.: Variants of the borda count method for combining
ranked classifier hypotheses. In: Proceedings of Seventh International Workshop
on Frontiers in Handwriting recognition. pp. 443–452 (2000)
8. Cheung, J.C.K., Penn, G.: Towards robust abstractive multi-document summa-
rization: A caseframe analysis of centrality and domain. In: Proceedings of ACL,
Sofia, Bulgaria. pp. 1233–1242 (2013)
9. Mei, Q., Guo, J., Radev, D.: Divrank: The interplay of prestige and diversity in
information networks. In: Proceedings of ACM SIGKDD. pp. 1009–1018 (2010)
10. Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher,
S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In:
Proceedings of ACM SIGIR 2008, pp. 659–666 (2008)
11. Fiszman, M., Demner-Fushman, D., Kilicoglu, H., Rindflesch, T.C.: Automatic
summarization of medline citations for evidence-based medical treatment: A topic-
oriented evaluation. J. Biomed. Inform. 42(5), 801–813 (2009)
12. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr.
2(1–2), 1–135 (2008)
13. Nastase, V.: Topic-driven multi-document summarization with encyclopedic knowl-
edge and spreading activation. In: Proceedings of EMNLP, pp. 763–772 (2008)
14. Inouye, D., Kalita, J.: Comparing twitter summarization algorithms for multiple
post summaries. In: Proceedings of the IEEE Third International Conference on
Social Computing, pp. 298–306 (2011)
15. Liu, X., Li, Y., Wei, F., Zhou, M.: Graph-based multi-tweet summarization using
social signals. In: Proceedings of COLING 2012, pp. 1699–1714 (2012)
16. Takamura, H., Yokono, H., Okumura, M.: Summarizing a document stream. In:
Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V.
(eds.) ECIR 2011. LNCS, vol. 6611, pp. 177–188. Springer, Heidelberg (2011)
17. Duan, Y., Chen, Z., Wei, F., Zhou, M., Shum, H.Y.: Twitter topic summarization
by ranking tweets using social influence and content quality. In: Proceedings of
COLING 2012, Mumbai, India, pp. 763–780 (2012)
18. Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of
EMNLP 2004, Barcelona, Spain pp. 404–411 (2004)
19. Sharifi, B., Hutton, M.A., Kalita, J.: Summarizing microblogs automatically. In:
Proceedings of NAACL, pp. 685–688 (2010)
Reproducibility
Who Wrote the Web? Revisiting Influential
Author Identification Research Applicable
to Information Retrieval
1 Introduction
Author identification is concerned with whether and how an author’s identity can
be inferred from their writing by modeling writing style. Author identification
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 393–407, 2016.
DOI: 10.1007/978-3-319-30671-1 29
394 M. Potthast et al.
has a long history, the first known approach dating back to the 19th century [27].
Ever since, historians and linguists have tried to settle disputes over the author-
ship of important pieces of writing by manual authorship attribution, employing
basic style markers, such as average sentence length, average word length, or hapax
legomena (i.e., words that occur only once in a given context), to name only a few.
It is estimated that more than 1,000 basic style markers have been proposed [31].
In the past two decades, author identification has become an active field of research
for computer linguists as well, who employ machine learning on top of models that
combine traditional style markers with new ones, the manual computation of which
has been infeasible before. Author identification technology is evolving at a rapid
pace. The field has diversified into many sub-disciplines where correlations of writ-
ing style with author traits are studied, such as age, gender, and other demograph-
ics. Moreover, in an attempt to scale their approaches, researchers apply them on
increasingly large datasets with up to thousands of authors and tens of thousands
of documents. Naturally, some of the document collections used for evaluation
are sampled from the web, carefully ensuring that individual documents can be
attributed with confidence to specific authors.
While applying this technology at web scale is still out of reach, we conjecture
that it is only a matter of time until tailored information retrieval systems will
index authorial style, retrieve answers to writing style-related queries as well
as queries by example, and eventually, shed light on the question: Who wrote
the web? Besides obvious applications in law enforcement and intelligence—
a domain for which little is known about the state of the art of their author
identification efforts—many other stakeholders will attempt to tap authorial
style for purposes of targeted marketing, copyright enforcement, writing support,
establishing trustworthiness, and of course as yet another search relevance signal.
Many of these applications bring about ethical and privacy issues that need to be
reconciled. Meanwhile, authorial style patterns already form a part of every text
on the web that has been genuinely written by a human. At present, however,
the two communities of information retrieval and author identification hardly
intersect, whereas integration of technologies from both fields is necessary to
scale author identification to the web.
The above observations led us to devise and carry out a novel kind of repro-
ducibility study that has an added benefit for both research fields: we team up
with a domain expert and a group of students, identify 15 influential author
identification methods of the past two decades, and have each approach reimple-
mented by the students. By reproducing performance results from the papers’
experiments, we aim at raising confidence that our implementations come close
to those of the papers’ authors. This paper surveys the approaches and reports
on their reproducibility. The resulting source code is shared publicly. We further
conduct comparative experiments among the reimplemented approaches, which
has not been done before. The primary purpose of our reproducibility study is
not to repeat every experiment reported in the selected papers, since it is unlikely
that the most influential research is outright wrong. Rather, our goal is to release
working implementations to both the information retrieval community as well
Who Wrote the Web? 395
or Terrier [28] and Blei et al.’s LDA implementation [6], which have spread
throughout information retrieval (IR). Various initiatives in IR have emerged
simultaneously in 2015: the ECIR has introduced a dedicated track for repro-
ducibility [17], a corresponding workshop has been organized at SIGIR [2], and
the various groups that develop Evaluation-as-a-Service platforms for shared
tasks have met for the first time [19].
One of the traditional forms of reproducibility research are meta studies,
where existing research on a specific problem of interest is surveyed and sum-
marized with special emphasis on performance. For example, in information
retrieval, the meta study of Armstrong et al. [3] reveals that the improvements
reported in various papers of the past decade on the ad hoc search task are void,
since they employ too weak baselines. Recently, Tax et al. [40] have conducted a
similar study for 87 learning-to-rank papers, where they summarize for the first
time which of them perform best.
Still, meta studies usually do not include a reimplementation of existing
methods. Reimplementation of existing research has been conducted by Ferro
and Silvello [13] and Hagen et al. [15], the former aiming for exact replicability
and the latter for reproducibility (i.e., obtaining similar results under compara-
ble circumstances). Finally, Di Buccio et al. [11] and Lin [26] both propose the
development of a central repository of baseline IR systems on standard tasks
(e.g., ad hoc search). They observe that even the baselines referred to in most
papers may vary greatly in performance when using different parameterizations,
rendering results incomparable. A parameter model, repositories of runs, and
executable baselines are proposed as a remedy. When open baseline implemen-
tations are available in a given research field such as IR, this is a sensible next
step, whereas in the case of author identification, there are only a few publicly
available baseline implementations to date. We are the first to provide them at
scale.
Publication
[4] [5] [7] [10] [12] [22] [23] [24] [25] [29] [32] [33] [34] [35] [41]
Task cA cA cA cA cA cA cA V oA cA cA cA cA cA cA
Features lex chr lex mix chr chr chr lex chr mix lex syn lex chr chr
Paradigm p i i i i p p i p p i i i p p
Complexity ** * * * *** * ** ** * ** *** ** * * **
Citations 14 377 213 366 41 267 60 75 89 201 17 44 26 43 80
Year 09 02 02 01 11 03 03 07 11 04 12 14 06 07 03
3 Reproducibility Study
Our reproducibility study consists of seven steps: (1) paper selection, (2) stu-
dent recruitment, (3) paper assignment and instruction, (4) implementation and
experimentation, (5) auditing, (6) publication, and (7) post-publication rebuttal.
(1) Paper Selection. Every reproducibility study should supply justification for
its selection of papers to be reproduced. For example, Ferro and Silvello [13]
reproduce a method that has become important for performance measurement
in IR in order to raise confidence in its reliability; Hagen et al. [15] reproduce the
three best-performing approaches in a shared task, since shared task notebooks
are often less well-written than other papers, rendering their reproduction diffi-
cult. Other justifications may include: comparison of a method with one’s own
approach, doubts whether a particular contribution works as advertised, com-
pleting a software library, using an approach as a sub-module to solve a different
task, or identifying the best approach for an application.
Who Wrote the Web? 399
4 Reproducibility Report
Each paper was assessed with regard to a number of reproducibility criteria per-
taining to (1) approach clarity, (2) experiment clarity and soundness, (3) dataset
availability or reconstructability, and (4) overall replicability, reproducibility,
simplifiability (e.g., omitting preprocessing steps without harming performance),
and improvability (e.g., with respect to runtime). The assessments result from
presentations given by the students, a questionnaire, and subsequent individual
discussions; Table 2 overviews the results.
(1) Approach Clarity. For none of the approaches source code (or executa-
bles) were available accompanying the papers (only in row “Code avail-
able” of Table 2), so that all students had to start from scratch. The students
Criterion Publication
[4] [5] [7] [10] [12] [22] [23] [24] [25] [29] [32] [33] [34] [35] [41]
(1) Approach clarity
Code available
Description sound
Details sufficient
Paper self-contained
Preprocessing – – –
– – –
Parameter settings –
–
Library versions – – –
– –
– – – –
Reimplementation
Language Py Py Py C++ J Py C++ Py Py C# C++ J Py Py Py
(2) Experiment clarity / soundness
Setup clear
Exhaustiveness
Compared to others
Result reproduced
(3) Dataset reconstructability / availability
Text length L L M S M M M M L M L S M M M
Candidate set M M M S M M L L M M S L M L M
Origin given
Corpora available
(4) Overall assessment
Replicability
Reproducibility
Simplifiability
Improvability
402 M. Potthast et al.
chose the programming language they are most familiar with, resulting in nine
Python reimplementations, four reimplementations in a C dialect, and two Java
reimplementations. Keeping in mind that most of the students had not worked
in text processing before, it is a good sign that overall they had no significant
problems with the approach descriptions. Some questions were answered by the
domain expert, while some students also just looked up basic concepts like tok-
enizing or cosine similarity on their own. The students with backgrounds in
math and theory mentioned a lack of formal rigor in the explanations of some
papers (indicated by a
in row “Description sound”); however, this was mostly
a matter of taste and did not affect the understandability of the approaches.
More problematic were two papers for which not even the references contained
sufficient information, so that additional sources had to be retrieved by the stu-
dents to enable them to reimplement the approach. The lack of details on how
input should be preprocessed ( in row “Preprocessing”), what parameter set-
tings were used ( in row “Parameter settings”), and missing version numbers
of libraries employed ( in row “Library versions”) render the replication of
seven out of the 15 selected papers’ approaches difficult. This had an effect on
the perceived approach clarity at an early stage of reimplementation.
(2) Experiment Clarity / Soundness. Since the students were asked to replicate
or at least reproduce one of the experiments of their assigned papers, this gave
us first-hand insights into the clarity of presentation of the experiments as well
as their soundness. The most common problems we found were unclear splits
between training and test data ( in row “Setup clear”). Another problem was
that rather many approaches are evaluated only against simple baselines or only
in small-scale experiments ( in rows “Exhaustiveness” and “Comparison to
others”). To rectify this issue, we conduct our own evaluation of all implemented
approaches on three standard datasets in Sect. 5. Altogether, given the influential
nature of the 15 selected approaches, it was not unexpected that in twelve cases
the students succeeded in reproducing at least one result similar to those reported
in the original papers ( or in row “Result reproduced”).
(3) Dataset Availability / Reconstructability. We also asked students and our
domain expert to assess the sizes of the originally used datasets. The approaches
have been evaluated using different text lengths (S, M, and L indicate message,
article, and book size in row “Text length”) and different candidate set sizes
(S, M, and L indicate below five, below 15, or more authors in row “Candidate
set”). In eleven cases, the origin of the data was given, whereas in two cases
each, the origin could only be indirectly inferred, or remained obscure. Corpora
of which the datasets used for evaluation have been derived were available in
four cases, whereas we tried to reconstruct the datasets in cases where sufficient
information was given.
(4) Overall Assessment and Discussion. To complete the picture of our assess-
ment, we have judged the overall replicability, reproducibility, simplifiability, and
improvability of the original papers. Taking into account papers with only par-
tially available information on preprocessing, parameter settings, and libraries
Who Wrote the Web? 403
(ten papers) as well as the non-availability of the originally used corpora, none
of the 15 publications’ results are replicable. This renders the question of at
least reproducing the results with a similar approach or using a similar dataset
even more important. To this end, students were instructed to use the latest ver-
sions of the respective libraries with default parameter settings, and if nothing
else helped, apply common sense. Regarding missing information on datasets,
our domain expert suggested substitutions. With these remedies, all but one
approach achieved results comparable to those originally reported ( or in
row “Reproducibility”). The three partially reproducible papers are due to non-
availability of the original data and the use of incomparable substitutions.
Only the reimplementation of the approach of Seroussi et al. [32] has been
unsuccessful to date: it appears to suffer from an imbalanced text length distrib-
ution across candidate authors, resulting in all texts being attributed to authors
with the fewest words among all candidates. This behavior is at odds with the
paper, since Seroussi et al. do not mention any problems in this regard, nor
that the evaluation corpora have been manually balanced. Since the paper is
exceptionally well-written, leaving little to no room for ambiguity, we are unsure
what the problem is and suspect a subtle error in our implementation. However,
despite our best efforts, we have been unable to find this error to date. Perhaps
the post-publication rebuttal phase or future attempts at reproducing Seroussi
et al.’s work will shed light on this issue.3
In four cases, the respective students, while working on the reimplemen-
tations, identified possibilities of simplifying or even improving the original
approaches ( in rows “Simplification” and “Improvability”). A few examples
that concern runtime: when constructing the function word graph of Arun et
al. [4], it suffices to take only the n last function words in a text window into
account, where n < 5, instead of all previous ones. In Benedetto et al.’s app-
roach [5], it suffices to only use the compression dictionary of the profile instead
of recompressing profile and test text every time. In Burrows’ approach [7],
POS-tagging can be omitted, and in the approach of Teahan and Harper [41]
one can refrain from actually compressing texts, but just compute entropy. For
all of these improvements, the attribution performance was not harmed but often
even improved while the runtime was substantially decreased.
On the upside, we can confirm that it is possible to reproduce almost all of
the most influential work of a field when employing students to do so. On the
downside, however, new ways of ensuring rigorous explanations of approaches
and experimental setups should be considered.
5 Evaluation
To evaluate the reimplementations under comparable conditions we use the fol-
lowing corpora:
3
Confer the repository of the reimplementation of Seroussi et al.’s approach to follow
up on this.
404 M. Potthast et al.
Corpus Publication
[4] [5] [7] [10] [12] [22] [23] [24] [25] [29] [32] [33] [34] [35] [41] BR
C10 9.0 72.8 59.8 50.2 75.4 71.0 77.2 22.4 72.0 76.6 – 29.8 73.8 70.8 76.6 86.4
PAN11 0.1 29.6 5.4 13.5 43.1 1.8 32.8 n/a 20.2 46.2 – n/a 7.6 34.5 65.0 65.8
PAN12 85.7 71.4 92.9 28.6 28.6 71.4 n/a 78.6 78.6 57.1 – n/a 7.1 85.7 64.3 92.9
– C10. English news from the CCAT topic of the Reuters Corpus Volume 1 for
10 candidate authors (100 texts each). Best results reported by Escalante et al.
[12].
– PAN11. English emails from the Enron corpus for 72 candidate authors with
imbalanced distribution of texts. The corpus was used in the PAN 2011 shared
task [1].
– PAN12. English novels for 14 candidate authors with three texts each. The
corpus was used in the PAN 2012 shared task [21].
Parameters were set as specified in the original papers, unless they were not
supplied, in which case parameters were optimized based on the training data.
One exception is the approach of Escalante et al. [12] where a linear kernel was
used instead of the diffusion kernel mentioned in that paper, since the latter
could not be reimplemented in time.
Table 3 shows the evaluation results. As can be seen, some approaches are
very effective on long texts (PAN12) but fail on short (C10) or very short texts
(PAN11) [4,7]. Moreover, some approaches are considerably affected by imbal-
anced datasets (PAN11) [22]. It is interesting that in two out of the three corpora
used (PAN12 and PAN11) at least one of the approaches competes with the best
reported results to date. In general, the compression-based models seem to be
more stable across corpora probably because they have few or none parameters
to be fine-tuned [5,23,29,41]. The best macro-average accuracies on these cor-
pora are obtained by Teahan and Harper [41] and Stamatatos [35]. Both follow
the profile-based paradigm which seems to be more robust in case of limited
text-length or limited number of texts per author. Moreover, they use character
features which seem to be the most effective ones for this task.
6 Conclusion
To the best of our knowledge, a reproducibility study like ours, with the explicit
goal of sharing working implementations of many important approaches, is
unprecedented in information retrieval and in author identification, if not com-
puter science as a whole. In this regard, we argue that employing students to
systematically reimplement influential research and publish the resulting source
Who Wrote the Web? 405
code may prove to be a way of scaling the reproducibility efforts in many branches
of computer science to a point at which a significant portion of research is cov-
ered. Conceivably, this would accelerate progress in the corresponding fields,
since the entire community would have access to the state of the art. For stu-
dents in their late education and early careers, reimplementing a given piece
of influential research, and verifying its correctness by reproducing experimen-
tal results is definitely a worthwhile learning experience. Moreover, reproducing
research from fields related to one’s own may foster collaboration between both
fields involved.
References
1. Argamon, S., Juola, P.: Overview of the international authorship identification
competition at PAN-. In: CLEF 2011 Notebooks (2011)
2. Arguello, J., Diaz, F., Lin, J., Trotman, A.: RIGOR @ SIGIR (2015)
3. Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add
up: ad-hoc retrieval results since. In: CIKM 2009, pp. 601–610 (1998)
4. Arun, R., Suresh, V., Veni Madhavan, C.E.: Stopword graphs and authorship attri-
bution in text corpora. In: ICSC, pp. 192–196 (2009)
5. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev.
Lett. 88, 048702 (2002)
6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
7. Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship.
Lit. Ling. Comp. 17(3), 267–287 (2002)
8. Chang, C.-C., Chih-Jen Lin, L.: A library for support vector machines. ACM TIST
2, 27:1–27:27 (2011)
9. Collberg, C., Proebstring, T., Warren, A.M.: Repeatability, benefaction in com-
puter systems research: a study and a modest proposal. TR 14–04, University of
Arizona (2015)
10. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author
identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
11. Di Buccio, E., Di Nunzio, G.M., Ferro, N., Harman, D., Maistro, M., Silvello, G.:
Unfolding off-the-shelf IR systems for reproducibility. In: RIGOR @ SIGIR (2015)
12. Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character
n-grams for authorship attribution. In: HLT 2011, pp. 288–298 (2011)
13. Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and general-
ization. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS,
vol. 9022, pp. 768–780. Springer, Heidelberg (2015)
14. Gamon, M.: Linguistic correlates of style: authorship classification with deep lin-
guistic analysis features. In: COLING (2004)
406 M. Potthast et al.
15. Hagen, M., Potthast, M., Büchner, M., Stein, B.: Twitter sentiment detection via
ensemble classification using averaged confidence scores. In: Hanbury, A., Kazai,
G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 741–754. Springer,
Heidelberg (2015)
16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
17. Hanbury, A., Kazai, G., Rauber, A., Fuhr, N.: Proceedings of ECIR (2015)
18. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Ling.
Comp. 13(3), 111–117 (1998)
19. Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer,
J., Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K., Eggel, I.: Report on
the Evaluation-as-a-Service (EaaS) expert workshop. SIGIR Forum 49(1), 57–65
(2015)
20. Juola, P.: Authorship attribution. FnTIR 1, 234–334 (2008)
21. Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF
Notebooks (2012)
22. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for
authorship attribution. In: PACLING 2003, pp. 255–264 (2003)
23. Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text
collections and for text categorization. In: SIGIR 2003, pp. 104–110 (2003)
24. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking
pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
25. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. LRE
45(1), 83–94 (2011)
26. Lin, J.: The open-source information retrieval reproducibility challenge. In: RIGOR
@ SIGIR (2015)
27. Mendenhall, T.C.: The characteristic curves of composition. Science ns–9(214S),
237–246 (1887)
28. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: a
high performance and scalable information retrieval platform. In: OCIR @ SIGIR
(2006)
29. Peng, F., Schuurmans, D., Wang, S.: Augmenting naive Bayes classifiers with sta-
tistical language models. Inf. Retr. 7(3–4), 317–345 (2004)
30. Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview
of the 3rd author profiling task at PAN. In: CLEF 2015 Notebooks (2015)
31. Rudman, J.: The state of authorship attribution studies: some problems and solu-
tions. Comput. Humanit. 31(4), 351–365 (1997)
32. Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware
topic models. In: ACL 2012, pp. 264–269 (2012)
33. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.:
Syntactic n-grams as machine learning features for natural language processing.
Expert Syst. Appl. 41(3), 853–860 (2014)
34. Stamatatos, E.: Authorship attribution based on feature set subspacing ensembles.
Int. J. Artif. Intell. Tools 15(5), 823–838 (2006)
35. Stamatatos, E.: Author identification using imbalanced and limited training texts.
In: DEXA 2007, pp. 237–241 (2007)
36. Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60,
538–556 (2009)
37. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in
terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Who Wrote the Web? 407
38. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P.,
Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification
task at PAN. In: CLEF 2014 Notebooks (2014)
39. Stodden, V.: The scientific method in practice: reproducibility in the computational
sciences. MIT Sloan Research Paper No. 4773–10 (2010)
40. Tax, N., Bockting, S., Hiemstra, D.: A cross-benchmark comparison of 87 learning
to rank methods. IPM 51(6), 757–772 (2015)
41. Teahan, W.J., Harper, D.J.: Using compression-based language models for text cat-
egorization, pp. 141–165. In: Language Modeling for Information Retrieval (2003)
42. van Halteren, H.: Linguistic profiling for author recognition and verification. In:
ACL 2004, pp. 199–206 (2004)
43. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification
of online messages: writing-style features and classification techniques. JASIST
57(3), 378–393 (2006)
Toward Reproducible Baselines: The
Open-Source IR Reproducibility Challenge
1 Introduction
As an empirical discipline, advances in information retrieval research are built
on experimental validation of algorithms and techniques. Critical to this process
is the notion of a competitive baseline against which proposed contributions are
measured. Thus, it stands to reason that the community should have common,
widely-available, reproducible baselines to facilitate progress in the field. The
Open-Source IR Reproducibility Challenge was designed to address this need.
In typical experimental IR papers, scant attention is usually given to baselines.
Authors might write something like “we used BM25 (or query likelihood) as the
baseline” without further elaboration. This, of course, is woefully under-specified.
For example, Mühleisen et al. [13] reported large differences in effectiveness across
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 408–420, 2016.
DOI: 10.1007/978-3-319-30671-1 30
Toward Reproducible Baselines: The Open-Source IR Reproducibility 409
four systems that all purport to implement BM25. Trotman et al. [17] pointed out
that BM25 and query likelihood with Dirichlet smoothing can actually refer to at
least half a dozen different variants; in some cases, differences in effectiveness are
statistically significant. Furthermore, what are the parameter settings (e.g., k1
and b for BM25, and μ for Dirichlet smoothing)?
Open-source search engines represent a good step toward reproducibility, but
they alone do not solve the problem. Even when the source code is available, there
remain many missing details. What version of the software? What configuration
parameters? Tokenization? Document cleaning and pre-processing? This list goes
on. Glancing through the proceedings of conferences in the field, it is not difficult
to find baselines that purport to implement the same scoring model from the
same system on the same test collection (by the same research group, even), yet
report different results.
Given this state of affairs, how can we trust comparisons to baselines when the
baselines themselves are ill-defined? When evaluating the merits of a particular
contribution, how can we be confident that the baseline is competitive? Perhaps
the effectiveness differences are due to inadvertent configuration errors? This is a
worrisome issue, as Armstrong et al. [1] pointed to weak baselines as one reason
why ad hoc retrieval techniques have not really been improving.
As a standard “sanity check” when presented with a purported baseline,
researchers might compare against previously verified results on the same test
collection (for example, from TREC proceedings). However, this is time consum-
ing and not much help for researchers who are trying to reproduce the result
for their own experiments. The Open-Source IR Reproducibility Challenge aims
to solve both problems by bringing together developers of open-source search
engines to provide reproducible baselines of their systems in a common exe-
cution environment on Amazon’s EC2 to support comparability both in terms
of effectiveness and efficiency. The idea is to gather everything necessary in a
repository, such that with a single script, anyone with a copy of the collection
can reproduce the submitted runs. Two longer-term goals of this project are to
better understand how various aspects of the retrieval pipeline (tokenization,
document processing, stopwords, etc.) impact effectiveness and how different
query evaluation strategies impact efficiency. Our hope is that by observing how
different systems make design and implementation choices, we can arrive at gen-
eralizations about particular classes of techniques.
The Open-Source IR Reproducibility Challenge was organized as part of the
SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability
of Results (RIGOR). We were able to solicit contributions from the developers
of seven open-source search engines and build reproducible baselines for the
Gov2 collection. In this respect, we have achieved modest success. Although this
project is meant as an ongoing exercise and we continue to expand our efforts,
in this paper we share results and lessons learned so far.
2 Methodology
The product of the Open-Source IR Reproducibility Challenge is a repository
that contains everything needed to reproduce competitive baselines on standard
410 J. Lin et al.
IR test collections1 . As mentioned, the initial phase of our project was organized
as part of a workshop at SIGIR 2015: most of the development took place between
the acceptance of the workshop proposal and the actual workshop. To begin, we
recruited developers of open-source search engines to participate. We emphasize
the selection of developers—either individuals who wrote the systems or were
otherwise involved in their implementation. This establishes credibility for the
quality of the submitted runs. In total, developers from seven open-source sys-
tems participated (in alphabetical order): ATIRE [16], Galago [6], Indri [10,12],
JASS [9], Lucene [2], MG4J [3], and Terrier [14]. In what follows, we refer to the
developer(s) from each system as a separate team.
Once commitments of participation were secured, the group (on a mailing
list) discussed the experimental methodology and converged on a set of design
decisions. First, the test collection: we wished to work with a collection that
was large enough to be interesting, but not too large as to be too unwieldy.
The Gov2 collection, with around 25 million documents, seemed appropriate;
for evaluation, we have TREC topics 701–850 from 2004 to 2006 [7].
The second major decision concerned the definition of “baseline”. Naturally,
we would expect different notions by each team, and indeed, in a research paper,
the choice of the baseline would naturally depend on the techniques being stud-
ied. We sidestepped this potentially thorny issue by pushing the decisions onto
the developers. That is, the developers of each system decided what the base-
lines should be, with this guiding question: “If you read a paper that used your
system, what would you like to have seen as the baseline?” This decision allowed
the developers to highlight features of their systems as appropriate. As expected,
everyone produced bag-of-words baselines, but teams also produced baselines
based on term dependence models as well as query expansion.
The third major design decision focused around parameter tuning: proper
parameter settings, of course, are critical to effective retrieval. However, we could
not converge on an approach that was both “fair” to all participants and feasible
in terms of implementation given the workshop deadline. Thus, as a compromise,
we settled on building baselines around the default “out of the box” experience—
that is, what a naı̈ve user would experience downloading the software and using
all the default settings. We realize that in most cases this would yield sub-optimal
effectiveness and efficiency, but at least such a decision treated all systems equi-
tably. This is an issue we will revisit in future work.
The actual experiments proceeded as follows: the organizers of the challenge
started an EC2 instance2 and handed credentials to each team in turn. The EC2
instance was configured with a set of standard packages (the union of the needs
of all the teams), with the Gov2 collection (stored on Amazon EBS) mounted
at a specified location. Each team logged into the instance and implemented
their baselines within a common code repository cloned from GitHub. Everyone
agreed on a directory structure and naming conventions, and checked in their
1
https://github.com/lintool/IR-Reproducibility/.
2
We used the r3.4xlarge instance, with 16 vCPUs and 122 GiB memory, Ubuntu
Server 14.04 LTS (HVM).
Toward Reproducible Baselines: The Open-Source IR Reproducibility 411
code when done. The code repository also contains standard evaluation tools
(e.g., trec eval) as well as the test collections (topics and qrels).
The final product for each system was an execution script that reproduced
the baselines from end to end. Each script followed the same basic pattern: it
downloaded the system from a remote location, compiled the code, built one or
more indexes, performed one or more experimental runs, and printed evaluation
results (both effectiveness and efficiency).
Each team got turns to work with the EC2 instance as described above.
Although everyone used the same execution environment, they did not necessar-
ily interact with the same instance, since we shut down and restarted instances
to match teams’ schedules. There were two main rounds of implementation—all
teams committed initial results and then were given a second chance to improve
their implementations. The discussion of methodology on the mailing list was
interleaved with the implementation efforts, and some of the issues only became
apparent after the teams began working.
Once everyone finished their implementations, we executed all scripts for
each system from scratch on a “clean” virtual machine instance. This reduced,
to the extent practical, the performance variations inherent in virtualized envi-
ronments. Results from this set of experiments were reported at the SIGIR work-
shop. Following the workshop, we gave teams the opportunity to refine their
implementations further and to address issues discovered during discussions at
the workshop and beyond. The set of experiments reported in this paper incor-
porated all these fixes and was performed in December 2015.
3 System Descriptions
The following provides descriptions of each system, listed in alphabetical order.
We adopt the terminology of calling a “count index” one that stores only term
frequency information and a “positions index” one that stores term positions.
ATIRE. ATIRE built two indexes, both stemmed using an s-stripping stemmer;
in both cases, SGML tags were pruned. The postings lists for both indexes were
compressed using variable-byte compression after delta encoding. The first index
is a frequency-ordered count index that stores the term frequency (capped at
255), while the second index is an impact-ordered index that stores pre-computed
quantized BM25 scores at indexing time [8].
For retrieval, ATIRE used a modified version of BM25 [16] (k1 = 0.9 and
b = 0.4). Searching on the quantized index reduces ranking to a series of integer
additions (rather than floating point calculations in the non-quantized index),
which explains the substantial reduction in query latencies we observe.
Galago (Version 3.8). Galago built a count index and a positions index, both
stemmed using the Krovetz stemmer and stored in document order. The post-
ings consist of separate segments for documents, counts, and position arrays (if
included), with a separate structure for skips every 500 documents or so. The
indexes use variable-byte compression with delta encoding for ids and positions.
Query evaluation uses the document-at-a-time MaxScore algorithm.
412 J. Lin et al.
Galago submitted two sets of search results. The first used a query-likelihood
model with Dirichlet smoothing (μ = 3000). The second used a sequential depen-
dence model (SDM) based on Markov Random Fields [11]. The SDM features
included unigrams, bigrams, and unordered windows of size 8.
Indri (Version 5.9). The Indri index contains both a positions inverted index
and DocumentTerm vectors (i.e., a forward index). Stopwords were removed and
terms were stemmed with the Krovetz stemmer.
Indri submitted two sets of results. The first was a query-likelihood model
with Dirichlet smoothing (μ = 3000). The second used a sequential dependence
model (SDM) based on Markov Random Fields [11]. The SDM features were
unigrams, bigrams, and unordered windows of size 8.
JASS. JASS is a new, lightweight search engine built to explore score-at-a-time
query evaluation on quantized indexes and the notion of “anytime” ranking func-
tions [9]. It does not include an indexer but instead post-processes the quantized
index built from ATIRE. The reported indexing times include both the ATIRE
time to index and the JASS time to derive its index. For retrieval, JASS imple-
ments the same scoring model as ATIRE, but requires an additional parameter
ρ, the number of postings to process. In the first submitted run, ρ was set to one
billion, which equates to exhaustive processing. In the second submitted run,
ρ was set to 2.5 million, corresponding to the “10 % of document collection”
heuristic proposed by the authors [9].
Lucene (Version 5.2.1). Lucene provided both a count and a positions index.
Postings were compressed using variable-byte compression and a variant of delta
encoding; in the positions index, frequency and positions information are stored
separately. Lucene submitted two runs, one over each index; both used BM25,
with the same parameters as in ATIRE (k1 = 0.9 and b = 0.4). The English
Analyzer shipped with Lucene was used with the default settings.
MG4J. MG4J provided an index containing all tokens (defined as maximal
subsequences of alphanumerical characters) in the collection stemmed using the
Porter2 English stemmer. Instead of traditional gap compression, MG4J uses
quasi-succinct indices [18], which provide constant-time skipping and uses the
least amount of space among the systems examined.
MG4J submitted three runs. The first used BM25 to provide a baseline for
comparison, with k1 = 1.2 and b = 0.3. The second run utilized Model B,
as described by Boldi et al. [4], which still uses BM25, but returns first the
documents containing all query terms, then the documents containing all terms
but one, and so on; quasi-succinct indices can evaluate these types of queries very
quickly. The third run used Model B+, similar to Model B, but using positions
information to generate conjunctive subqueries that are within a window two
times the length of the query.
Terrier (Version 4.0). Terrier built three indexes, the count and positions
indexes both use the single-pass indexer, while the “Count (inc direct)”—which
includes a direct file (i.e., a forward index)—uses a slower classical indexer.
Toward Reproducible Baselines: The Open-Source IR Reproducibility 413
The single-pass indexer builds partial posting lists in memory, which are flushed
to disk when memory is exhausted, and merged to create the final inverted index.
In contrast, the slower classical indexer builds a direct (forward) index based on
the contents of the documents, which is then inverted through multiple passes to
create the inverted index. While slower, the classical indexer has the advantage
of creating a direct index which is useful for generating effective query expan-
sions. All indexes were stemmed using the Porter stemmer and stopped using
a standard stopword list. Both docids and term positions are compressed using
gamma delta-gaps, while term frequencies are stored in unary. All of Terrier’s
indexers are single-threaded.
Terrier submitted four runs. The first was BM25 and used the parameters
k1 = 1.2, k3 = 8, and b = 0.75 as recommended by Robertson [15]. The second
run used the DPH ranking function, which is a hypergeometric parameter-free
model from the Divergence from Randomness family of functions. The query
expansion in the “DPH + Bo1 QE” was performed using the Bo1 divergence
from randomness query expansion model, from which 10 terms were added from
3 pseudo-relevance feedback documents. The final submitted run used positions
information in a divergence from randomness model called pBiL, which utilizes
sequential dependencies.
4 Results
Indexing results are presented in Table 1, which shows both indexing time, the
size of the generated index (1 GB = 109 bytes), as well as a few other statistics:
the number of terms denotes the vocabulary size, the number of postings is
equal to the sum of document frequencies of all terms, and the number of tokens
414 J. Lin et al.
System Effectiveness
0.75
0.50
MAP
0.25
0.00
25
:B
25
25
SD
25
E
s.
nt
Q
Q
:B
P
SD
SD
M
M
M
ou
4J
(P
x
:1
o1
o:
:
B
:B
:B
.5
4J
ro
G
dr
(C
ag
:
o:
i:
B
:2
SS
25
er
P
r:
G
M
dr
4J
In
nt
+
ag
al
+
ie
25
ri
SS
M
IR
M
JA
H
In
G
ua
G
al
er
rr
M
B
JA
P
T
M
G
T
P
:Q
Te
D
B
D
:
ne
E
r:
:
ne
r:
ce
IR
ie
ie
ce
Lu
rr
T
rr
A
Lu
Te
Te
System / Model
is the collection length (relevant only for positions indexes). Not surprisingly,
for systems that built both positions and count indexes, the positions index took
longer to construct. We observe a large variability in the time taken for index
construction, some of which can be explained by the use of multiple threads. In
terms of index size, it is unsurprising that the positions indexes are larger than
the count indexes, but even similar types of indexes differed quite a bit in size,
likely due to different tokenization, stemming, stopping, and compression.
Table 2 shows effectiveness results in terms of MAP (at rank 1000). Figure 1
shows the MAP scores for each system on all the topics organized as a box-
and-whiskers plot: each box spans the lower and upper quartiles; the bar in the
middle represents the median and the white diamond represents the mean. The
whiskers extend to 1.5× the inter-quartile range, with values outside of those
plotted as points. The colors indicate the system that produced the run.
We see that all the systems exhibit large variability in effectiveness on a
topic-by-topic basis. To test for statistical significance of the differences, we
used Tukey’s HSD (honest significant difference) test with p < 0.05 across all
150 queries. We found that the “DPH + Bo1 QE” run of Terrier was statistically
significantly better than all other runs and both Lucene runs significantly better
than Terrier’s BM25 run. All other differences were not significant. Despite the
results of the significance tests, we nevertheless note that the systems exhibit a
large range in scores, even though from the written descriptions, many of them
purport to implement the same model (e.g., BM25). This is true even in the
case of systems that share a common “lineage”, for example, Indri and Galago.
We believe that these differences can be attributed to relatively uninteresting
differences in document pre-processing, tokenization, stemming, and stopwords.
This further underscores the importance of having reproducible baselines to
control for these effects.
Toward Reproducible Baselines: The Open-Source IR Reproducibility 415
Topics
System Model Index 701–750 751–800 801–850 All
ATIRE BM25 Count 0.2616 0.3106 0.2978 0.2902
ATIRE Quantized BM25 Count + Quantized 0.2603 0.3108 0.2974 0.2897
Galago QL Count 0.2776 0.2937 0.2845 0.2853
Galago SDM Positions 0.2726 0.2911 0.3161 0.2934
Indri QL Positions 0.2597 0.3179 0.2830 0.2870
Indri SDM Positions 0.2621 0.3086 0.3165 0.2960
JASS 1B Postings Count 0.2603 0.3109 0.2972 0.2897
JASS 2.5M Postings Count 0.2579 0.3053 0.2959 0.2866
Lucene BM25 Count 0.2684 0.3347 0.3050 0.3029
Lucene BM25 Positions 0.2684 0.3347 0.3050 0.3029
MG4J BM25 Count 0.2640 0.3336 0.2999 0.2994
MG4J Model B Count 0.2469 0.3207 0.3003 0.2896
MG4J Model B+ Positions 0.2322 0.3179 0.3257 0.2923
Terrier BM25 Count 0.2432 0.3039 0.2614 0.2697
Terrier DPH Count 0.2768 0.3311 0.2899 0.2994
Terrier DPH + Bo1 QE Count (inc direct) 0.3037 0.3742 0.3480 0.3422
Terrier DPH + Prox SD Positions 0.2750 0.3297 0.2897 0.2983
Efficiency results are shown in Table 3: we report mean query latency (over
three trials). These results represent query execution on a single thread, with
timing code contributed by each team. Thus, these figures should be taken with
the caveat that not all systems may be measuring exactly the same thing, espe-
cially with respect to overhead that is not strictly part of query evaluation (for
example, the time to write results to disk). Nevertheless, to our knowledge this
is the first large-scale efficiency evaluation of open-source search engines. Previ-
ously, studies typically consider only a couple of systems, and different experi-
mental results are difficult to compare due to underlying hardware differences. In
our case, a common platform moves us closer towards fair efficiency evaluations
across many systems.
Figure 2 shows query evaluation latency in a box-and-whiskers plot, with the
same organization as Fig. 1 (note the y axis is in log scale). We observe a large
variation in latency: for instance, the fastest systems (JASS and MG4J) achieved
a mean latency below 50 ms, while the slowest system (Indri’s SDM model) takes
substantially longer. It is interesting to note that we observe different amounts
of per-topic variability in efficiency. For example, the fastest run (JASS 2.5M
Postings) is faster than the second fastest (MG4J Model B) in terms of mean
latency, but MG4J is actually faster if we consider the median—the latter is
hampered by a number of outlier slow queries.
416 J. Lin et al.
Topics
System Model Index 701–750 751–800 801–850 All
ATIRE BM25 Count 132 ms 175 ms 131 ms 146 ms
ATIRE Quantized BM25 Count + Quantized 91 ms 93 ms 85 ms 89 ms
Galago QL Count 773 ms 807 ms 651 ms 743 ms
Galago SDM Positions 4134 ms 5989 ms 4094 ms 4736 ms
Indri QL Positions 1252 ms 1516 ms 1163 ms 1310 ms
Indri SDM Positions 7631 ms 13077 ms 6712 ms 9140 ms
JASS 1B Postings Count 53 ms 54 ms 48 ms 51 ms
JASS 2.5M Postings Count 30 ms 28 ms 28 ms 28 ms
Lucene BM25 Count 120 ms 107 ms 125 ms 118 ms
Lucene BM25 Positions 121 ms 109 ms 127 ms 119 ms
MG4J BM25 Count 348 ms 245 ms 266 ms 287 ms
MG4J Model B Count 39 ms 48 ms 36 ms 41 ms
MG4J Model B+ Positions 91 ms 92 ms 75 ms 86 ms
Terrier BM25 Count 363 ms 287 ms 306 ms 319 ms
Terrier DPH Count 627 ms 421 ms 416 ms 488 ms
Terrier DPH + Bo1 QE Count (inc. direct) 1845 ms 1422 ms 1474 ms 1580 ms
Terrier DPH + Prox SD Positions 1434 ms 1034 ms 1039 ms 1169 ms
System Efficiency
100,000
10,000
Search Time (ms)
1,000
100
10
1
P
:B
25
.)
25
25
25
SD
M
nt
Q
os
:B
SD
SD
M
M
M
ou
4J
D
(P
x
:1
o1
o:
i:
B
:B
:B
B
.5
4J
ro
G
dr
(C
ag
r:
o:
i:
B
:2
SS
25
P
.
r:
G
M
4J
dr
In
ie
nt
ag
al
+
ie
25
SS
IR
M
JA
rr
In
G
ua
al
rr
H
M
Te
JA
P
T
G
P
:Q
Te
A
D
B
D
:
ne
E
r:
:
ne
r:
ce
IR
ie
ie
ce
Lu
rr
T
rr
A
Lu
Te
Te
System / Model
Fig. 2. Box-and-whiskers plot for query latency (all queries); diamonds are means.
Toward Reproducible Baselines: The Open-Source IR Reproducibility 417
5 Lessons Learned
Overall, we believe that the Open-Source IR Reproducibility Challenge achieved
modest success, having accomplished our main goals for the Gov2 test collection.
In this section, we share some of the lessons learned.
This exercise was a lot more involved than it would appear and the level
of collective effort required was much more than originally expected. We were
relying on the volunteer efforts of many teams around the world, which meant
that coordinating schedules was difficult to begin with. Nevertheless, the imple-
mentations generally took longer than expected. To facilitate scheduling, the
organizers asked the teams to estimate how long it would take to build their
implementations at the beginning. Invariably, the efforts took more time than
the original estimates. This was somewhat surprising because Gov2 is a standard
test collection that researchers surely must have previously worked with before.
The reproducibility efforts proved more difficult than imagined for a number
of reasons. In at least one case, the exercise revealed a hidden dependency—
a pre-processing script that had never been publicly released. In at least two
cases, the exercise exposed bugs in systems that were subsequently fixed. In
multiple cases, the EC2 instance represented a computing environment that
made different assumptions than the machines the teams originally developed
on. It seemed that the reproducibility challenge helped the developers improve
their systems, which was a nice side effect.
Effectiveness/Efficiency Tradeoff
10000 Indri: SDM
Galago: SDM
Terrier: DPH+Bo1 QE
Indri: QL Terrier: DPH+Prox SD
1000
Time (ms)
Galago: QL
Terrier: DPH
Terrier: BM25 MG4J: BM25
30
32
43
0.
0.
0.
0.
MAP
6 Ongoing Work
The Open-Source IR Reproducibility Challenge is not intended to be a one-off
exercise but a living code repository that is maintained and kept up to date. The
cost of maintenance should be relatively modest, since we would not expect base-
lines to rapidly evolve. We hope that sufficient critical mass has been achieved
with the current participants to sustain the project. There are a variety of moti-
vations for the teams to remain engaged: developers want to see their systems
“used properly” and are generally curious to see how their implementations
stack up against their peers. Furthermore, as these baselines begin appearing
in research papers, there will be further incentive to keep the code up to date.
However, only time will tell if we succeed in the long term.
There are a number of ongoing efforts in the project, the most obvious of
which is to build reproducible baselines for other test collections—work has
already begun for the ClueWeb collections. We are, of course, always interested
in including new systems into the evaluation mix.
Beyond expanding the scope of present efforts, there are two substantive
(and related) issues we are currently grappling with. The first concerns the
issue of training—from simple parameter tuning (e.g., for BM25) to a com-
plete learning-to-rank setup. In particular, the latter would provide useful base-
lines for researchers pushing the state of the art in retrieval models. We have
not yet converged on a methodology for including “trained” models that is not
overly burdensome for developers. For example, would the developers also need
to include their training code? And would the scripts need to train the models
from scratch? Intuitively, the answer seems to be “yes” to both, but asking devel-
opers to contribute code that accomplishes all of this seems overly demanding.
The issue of model training relates to the second issue, which concerns the
treatment of external resources. Many retrieval models (particularly in the web
context) take advantage of sources such as anchor text, document-level features
such as PageRank, spam score, etc. Some of these (e.g., anchor text) can be
derived from the raw collection, but others incorporate knowledge outside the
collection. How shall we handle such external resources? Since many of them are
quite large, it seems impractical to store in our repository, but the alternative
of introducing external dependencies increases the chances of errors.
A final direction involves efforts to better understand the factors that impact
retrieval effectiveness. For example, we suspect that a large portion of the effec-
tiveness differences we observe can be attributed to different document pre-
processing regimes and relatively uninteresting differences in tokenization, stem-
ming, and stopwords. We could explore this hypothesis by, for example, using a
Toward Reproducible Baselines: The Open-Source IR Reproducibility 419
Acknowledgments. This work was supported in part by the U.S. National Science
Foundation under IIS-1218043 and by Amazon Web Services. Any opinions, findings,
conclusions, or recommendations expressed are those of the authors and do not neces-
sarily reflect the views of the sponsors.
References
1. Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add
up: Ad-hoc retrieval results since 1998. In: CIKM, pp. 601–610 (2009)
2. Bialecki, A., Muir, R., Ingersoll, G.: Apache lucene 4. In: SIGIR 2012 Workshop
on Open Source Information Retrieval (2012)
3. Boldi, P., Vigna, S.: MG4J at TREC 2005. In: TREC (2005)
4. Boldi, P., Vigna, S.: MG4J at TREC 2006. In: TREC (2006)
5. Buccio, E.D., Nunzio, G.M.D., Ferro, N., Harman, D., Maistro, M., Silvello, G.:
Unfolding off-the-shelf IR systems for reproducibility. In: SIGIR 2015 Workshop
on Reproducibility, Inexplicability, and Generalizability of Results (2015)
6. Cartright, M.A., Huston, S., Field, H.: Galago: A modular distributed processing
and retrieval system. In: SIGIR 2012 Workshop on Open Source IR (2012)
7. Clarke, C., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track.
In: TREC (2004)
8. Crane, M., Trotman, A., O’Keefe, R.: Maintaining discriminatory power in quan-
tized indexes. In: CIKM, pp. 1221–1224 (2013)
9. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: ICTIR,
pp. 301–304 (2015)
10. Metzler, D., Croft, W.B.: Combining the language model and inference network
approaches to retrieval. Inf. Process. Manage. 40(5), 735–750 (2004)
11. Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In:
SIGIR, pp. 472–479 (2005)
12. Metzler, D., Strohman, T., Turtle, H., Croft, W.B.: Indri at TREC 2004: Terabyte
track. In: TREC (2004)
13. Mühleisen, H., Samar, T., Lin, J., de Vries, A.: Old dogs are great at new tricks:
Column stores for IR prototyping. In: SIGIR, pp. 863–866 (2014)
420 J. Lin et al.
14. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier:
A high performance and scalable information retrieval platform. In: SIGIR 2006
Workshop on Open Source IR (2006)
15. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi
at TREC-3. In: TREC (1994)
16. Trotman, A., Jia, X.F., Crane, M.: Towards an efficient and effective search engine.
In: SIGIR 2012 Workshop on Open Source IR (2012)
17. Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language mod-
els examined. In: ADCS, pp. 58–65 (2014)
18. Vigna, S.: Quasi-succinct indices. In: WSDM, pp. 83–92 (2013)
Experiments in Newswire Summarisation
1 Introduction
Text summarisation [15,19] is an information reduction process, where the aim
is to identify the important information within a large document, or set of docu-
ments, and infer an essential subset of the textual content for user consumption.
Examples of text summarisation being applied to assist with user’s informa-
tion needs include search engine results pages, where snippets of relevant pages
are shown, and online news portals, where extracts of newswire documents are
shown. Indeed, much of the research conducted into text summarisation has
focused on multi-document newswire summarisation. For instance, the input
to a summarisation algorithm being evaluated at the Document Understanding
Conference1 or Text Analysis Conference2 summarisation evaluation campaigns
is often a collection of newswire documents about a news-worthy event. Further,
research activity related to the summarisation of news-worthy events has recently
been conducted under the TREC Temporal Summarisation Track3 . Given the
1
duc.nist.gov.
2
nist.gov/tac.
3
trec-ts.org.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 421–435, 2016.
DOI: 10.1007/978-3-319-30671-1 31
422 S. Mackie et al.
(manual sentence extracts), and tested on DUC 2004 and TAC 2008. For our
experiments, we train a maximum entropy binary classifier5 , with feature values
scaled in the range [−1, 1]. The probability estimates output from the classifier
are used to score the sentences, producing a ranking of sentences that is passed
through an anti-redundancy component for summary sentence selection.
Anti-redundancy Components – Each anti-redundancy component takes as
input a list of sentences, previously ranked by a summarisation scoring function.
The first, highest-scoring, sentence is always selected. Then, iterating down the
list, the next highest-scoring sentence is selected on the condition that it satis-
fies a threshold. We experiment with the following anti-redundancy threshold-
ing components, namely NewWordCount, NewBigrams, and CosineSimilarity.
NewWordCount [1] only selects the next sentence in the list, for inclusion into
the summary text, if that sentence contributes n new words to the summary text
vocabulary. In our experiments, the value of n, the new word count parameter,
ranges from [1, 20], in steps of 1. NewBigrams only selects a sentence if that
sentence contributes n new bi-grams to the summary text vocabulary. In our
experiments, the value of n, the new bi-grams parameter, ranges from [1, 20],
in steps of 1. The CosineSimilarity thresholding component only selects the
next sentence if that sentence is sufficiently dis-similar to all previously selected
sentences. In our experiments, the value of the cosine similarity threshold ranges
from [0, 1] in steps of 0.05. As cosine similarity computations require a vector
representation of the sentences, we experiment with different weighting schemes,
denoted Tf, Hy, Rt, and HyRt. Tf is textbook tf*idf, specifically log(tf) ∗ log(idf),
where tf is the frequency of a term in a sentence, and idf is N/Nt, the number
of sentences divided by the number of sentences containing the term t. Hy is
a tf*idf variant, where the tf component is computed over all sentences com-
bined into a pseudo-document, with idf computed as N/N t. Rt and HyRt are
tf*idf variants where we do not use log smoothing, i.e. raw tf. The 4 variants of
weighting schemes are also used by Centroid and LexRank, to represent sentences
as weighted vectors.
Summarisation Datasets – In our summarisation experiments, we use
newswire documents from the Document Understanding Conference (DUC) and
the Text Analysis Conference (TAC). Each dataset consists of a number of top-
ics, where a topic is a cluster of related newswire documents. Further, each topic
has a set of gold-standard reference summaries, authored by human assessors, to
which system-produced summaries are compared in order to evaluate the effec-
tiveness of various summarisation algorithms. The DUC 2004 Task 2 dataset
has 50 topics of 10 documents per topic, and 4 reference summaries per topic.
The TAC 2008 Update Summarization Task dataset has 48 topics, and also 4
reference summaries per topic. For each topic within the TAC dataset, we use
the 10 newswire articles from document set A, and the 4 reference summaries
5
mallet.cs.umass.edu/api/cc/mallet/classify/MaxEnt.html.
Experiments in Newswire Summarisation 425
for document set A, ignoring the update summarisation part of the task (set B).
Further, we use the TAC 2008 dataset for generic summarisation (ignoring the
topic statements).
The Stanford CoreNLP toolkit is used to chunk the newswire text into sen-
tences, and tokenise words. Individual tokens are then subjected to the follow-
ing text processing steps: Unicode normalisation (NFD6 ), case folding, splitting
of compound words, removal of punctuation, Porter stemming, and stopword
removal (removing the 50 most common English words7 ). When summarising
multiple documents for a topic, we combine all sentences from the input docu-
ments for a given topic into a single virtual document. The sentences from each
document are interleaved one-by-one in docid order, and this virtual document
is given as input to the summarisation algorithms.
Summarisation Evaluation – To evaluate summary texts, we use the
ROUGE [9] evaluation toolkit8 , measuring n-gram overlap between a system-
produced summary and a set of gold-standard reference summaries. Following
best practice [7], the summaries under evaluation are subject to stemming,
stopwords are retained, and we report ROUGE-1, ROUGE-2 and ROUGE-4
recall – measuring uni-gram, bi-gram, and 4-gram overlap respectively – with
results ordered by ROUGE-2 (in bold), the preferred metric due to its reported
agreement with manual evaluation [17]. Further, for all experiments, summary
lengths are truncated to 100 words. The ROUGE parameter settings used
are: “ROUGE-1.5.5.pl -n 4 -x -m -l 100 -p 0.5 -c 95 -r 1000 -f A -t 0”. For
summarisation algorithms with parameters, we learn the parameter settings via
a five-fold cross validation procedure, optimising for the ROUGE-2 metric. Sta-
tistical significance in ROUGE results is reported using the paired Student’s
t-test, 95 % confidence level, as implemented in MATLAB. ROUGE results for
various summarisation systems are obtained using SumRepo [7]9 , which provides
the plain-text produced by 5 standard baselines, and 7 state-of-the-art systems,
over DUC 2004. Using this resource, we compute ROUGE results, over DUC 2004
only, for the algorithms available within SumRepo, obtaining reference results
for use in our later experiments.
Fig. 1. The interface for our user study, soliciting summary judgements via Crowd-
Flower.
10
crowdflower.com.
11
www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt.
Experiments in Newswire Summarisation 427
Table 1. Reference ROUGE results, over DUC 2004, and results from our crowd-
sourced user study validating ROUGE-2 is aligned with user judgements for summary
quality.
selection behaviour of the state-of-the-art systems varies over topics. This would
confirm that the state-of-the-art systems are selecting different content for inclu-
sion into the summary, reproducing and validating the previously published [7]
results.
For our analysis, we examine the ROUGE-2 effectiveness of the state-of-the-
art systems over the 50 topics of DUC 2004 Task 2, using the summary text
from SumRepo. In Fig. 2, we visualise the distribution of ROUGE-2 scores over
topics, for the top 6 state-of-the-art systems, with the topics on the x-axis ordered
by the ROUGE-2 effectiveness of ICSISumm. In Table 2, we then quantify the
ROUGE-2 effectiveness between the top 6 state-of-the-art systems, showing the
Pearson’s linear correlation coefficient of ROUGE-2 scores across the topics.
From Fig. 2, we observe that, for each of the top 6 state-of-the-art systems,
there is variability in ROUGE-2 scores over different topics. Clearly, for some top-
ics, certain systems are more effective, while for other topics, other systems are
more effective. This variability is usually masked behind the ROUGE-2 score,
which provides an aggregated view over all topics. Further, from Table 2 we
observe that the per-topic ROUGE-2 scores of the top 6 state-of-the-art systems
0.11
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
1 48 36 6 24 26 41 46 39 38 3 4 42 30 8 5 20 44 40 13 23 31 37 34 10 22 18 49 11 28 43 17 15 21 7 12 27 32 25 2 9 19 14 50 35 47 29 45 33 16
Topics -- ordered by ICSISumm ROUGE-2
Fig. 2. ROUGE-2 effectiveness profiles, across the 50 topics of DUC 2004, for the top
6 state-of-the-art systems, with the x-axis ordered by the ROUGE-2 effectiveness of
ICSISumm.
Table 2. Pearson’s linear correlation coefficient of ROUGE-2 scores between the top
six state-of-the-art systems, across the 50 topics of DUC 2004.
Table 3. ROUGE scores, over DUC 2004 and TAC 2008, for Random and Lead, the
Lead baseline augmented with different anti-redundancy components, and 5 standard
baselines.
effectiveness scores that exhibit a significant improvement over the Lead (inter-
leaved) baseline, as indicated by the “‡” symbol. In particular, over DUC 2004,
Lead (interleaved) augmented with anti-redundancy filtering results in signif-
icant improvements in ROUGE-1 scores for all anti-redundancy components
investigated, and significant improvements in ROUGE-2 scores using CosineS-
imilarityHyRt and CosineSimilarityHy. However, from Table 3, we observe that
anti-redundancy filtering of Lead (interleaved) is not as effective over TAC 2008,
where only CosineSimilarityHyRt exhibits significantly improved ROUGE-1 and
ROUGE-2 scores. From these observations, we conclude that the optimal Lead
baseline, for multi-document extractive newswire summarisation, can be derived
by augmenting an interleaved Lead baseline with anti-redundancy filtering (such
as cosine similarity).
Finally, from Table 3, we observe the 5 standard baselines, LexRank, Cen-
troid, FreqSum, TsSum, and Greedy–KL, do not exhibit significant differences in
ROUGE-2 scores, over DUC 2004, from CosineSimilarityHy, the most effective
anti-redundancy processed interleaved Lead baseline. Indeed, only Greedy–KL
exhibits a ROUGE-1 score (“✔”) that is significantly more effective that Lead
interleaved with CosineSimilarityHy, and further, LexRank shows a significant
degradation in ROUGE-4 effectiveness (“✗”). From this, we conclude that the
5 standard baselines, over DUC 2004, may be weak baselines to use in future
experiments, with any claimed improvements questionable.
Experiments in Newswire Summarisation 431
0.11
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
30 24 39 36 26 48 4 5 28 46 34 6 38 1 3 29 42 8 44 41 22 25 40 20 18 13 19 12 7 35 37 43 32 11 49 27 10 2 47 17 9 45 14 16 15 23 50 31 33 21
Topics -- ordered by KLDivergence_Lead ROUGE-2
Fig. 3. ROUGE-2 effectiveness profiles, over DUC 2004, for KLDivergence Lead, an
oracle system optimising selection of anti-redundancy components over topics, and the
worst case.
Table 6. Results over DUC 2004 and TAC 2008, showing the best/worst scores pos-
sible when manually selecting the most/least effective anti-redundancy components
per-topic.
2004, the oracle system is more effective under ROUGE-1 and ROUGE-4 than
the most effective anti-redundancy component (shown in bold). Over TAC 2008,
the oracle system is more effective under all ROUGE metrics than the most
effective anti-redundancy component (again, shown in bold). From the results
in Table 6, we conclude that, while we do not propose a solution for how such
an oracle system might be realised in practice, approximations of the oracle sys-
tem can potentially offer statistically significant improvements in summarisation
effectiveness.
7 Conclusions
In this paper, we have reproduced, validated, and generalised findings from the
literature. Additionally, we have reimplemented standard and state-of-the-art
baselines, making further observations from our experiments. In conclusion, we
have confirmed that the ROUGE-2 metric is aligned with crowd-sourced user
judgements for summary quality, and confirmed that several state-of-the-art sys-
tems behave differently, despite similar ROUGE-2 scores. Further, an optimal
Lead baseline can be derived from interleaving the first sentences from mul-
tiple documents, and applying anti-redundancy components. Indeed, an opti-
mal Lead baseline exhibits ROUGE-2 effectiveness with no significant difference
to standard baselines, over DUC 2004. Additionally, the effectiveness of the
standard baselines, as reported in the literature, can be improved to the point
where there is no significant difference to the state-of-the-art (as illustrated using
ICSISumm). Finally, given that an optimal choice of anti-redundancy compo-
nents, per-topic, exhibits significant improvements in summarisation effective-
ness, we conclude that future work should investigate learning algorithm (or
topic) specific anti-redundancy components.
References
1. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence
level. In: Proceedings of SIGIR (2003)
2. Conroy, J.M., Schlesinger, J.D., O’Leary, D.P.: Topic-focused multi-document sum-
marization using an approximate oracle score. In: Proceedings of COLING-ACL
(2006)
3. Erkan, G., Radev, D.R.: LexRank: Graph-based lexical centrality as salience in
text summarization. J. Artif. Intell. Res. 22(1), 457–479 (2004)
4. Gillick, D., Favre, B.: A scalable global model for summarization. In: Proceedings
of ACL ILP-NLP (2009)
5. Guo, Q., Diaz, F., Yom-Tov, E.: Updating users about time critical events. In:
Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein,
E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 483–494.
Springer, Heidelberg (2013)
Experiments in Newswire Summarisation 435
6. Haghighi, A., Vanderwende, L.: Exploring content models for multi-document sum-
marization. In: Proceedings of NAACL-HLT (2009)
7. Hong, K., Conroy, J., Favre, B., Kulesza, A., Lin, H., Nenkova, A.: A repository of
state of the art and competitive baseline summaries for generic news summariza-
tion. In: Proceedings of LREC (2014)
8. Kedzie, C., McKeown, K., Diaz, F.: Predicting salient updates for disaster sum-
marization. In: Proceedings of ACL-IJCNLP (2015)
9. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Pro-
ceedings of ACL (2004)
10. Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text sum-
marization. In: Proceedings of COLING (2000)
11. Mackie, S., McCreadie, R., Macdonald, C., Ounis, I.: Comparing algorithms for
microblog summarisation. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M.,
Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 153–159.
Springer, Heidelberg (2014)
12. Mackie, S., McCreadie, R., Macdonald, C., Ounis, I.: On choosing an effective
automatic evaluation metric for microblog summarisation. In: Proceedings of IIiX
(2014)
13. McCreadie, R., Macdonald, C., Ounis, I.: Incremental update summarization:
Adaptive sentence selection based on prevalence and novelty. In: Proceedings of
CIKM (2014)
14. Nenkova, A.: Automatic text summarization of newswire: Lessons learned from the
document understanding conference. In: Proceedings of AAAI (2005)
15. Nenkova, A., McKeown, K.: Automatic summarization. Found. Trends Inf.
Retrieval 5(2–3), 103–233 (2011)
16. Nenkova, A., Vanderwende, L., McKeown, K.: A compositional context sensitive
multi-document summarizer: Exploring the factors that influence summarization.
In: Proceedings of SIGIR (2006)
17. Owczarzak, K., Conroy, J.M., Dang, H.T., Nenkova, A.: An assessment of the
accuracy of automatic evaluation in summarization. In: Proceedings of NAACL-
HLT WEAS (2012)
18. Radev, D.R., Jing, H., Styś, M., Tam, D.: Centroid-based summarization of mul-
tiple documents. Inf. Process. Manage. 40(6), 919–938 (2004)
19. K, Spärck Jones: Automatic summarising: The state-of-the-art. Inf. Process. Man-
age. 43(6), 1449–1481 (2007)
On the Reproducibility of the TAGME Entity
Linking System
1 Introduction
Recognizing and disambiguating entity occurrences in text is a key enabling
component for semantic search [14]. Over the recent years, various approaches
have been proposed to perform automatic annotation of documents with entities
from a reference knowledge base, a process known as entity linking [7,8,10,12,
15,16]. Of these, TAGME [8] is one of the most popular and influential ones.
TAGME is specifically designed for efficient (“on-the-fly”) annotation of short
texts, like tweets and search queries. The latter task, i.e., annotating search
queries with entities, was evaluated at the recently held Entity Recognition and
Disambiguation Challenge [1], where the first and second ranked systems both
leveraged or extended TAGME [4,6]. Despite the explicit focus on short text,
TAGME has been shown to deliver competitive results on long texts as well [8].
TAGME comes with a web-based interface and a RESTful API is also provided.1
The good empirical performance coupled with the aforementioned convenience
features make TAGME one of the obvious must-have baselines for entity linking
research. The influence and popularity of TAGME is also reflected in citations;
the original TAGME paper [8] (from now on, simply referred to as the TAGME
paper) has been cited around 50 times according to the ACM digital library and
nearly 200 times according to Google scholar, at the time of writing. The authors
1
http://tagme.di.unipi.it/.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 436–449, 2016.
DOI: 10.1007/978-3-319-30671-1 32
On the Reproducibility of the TAGME Entity Linking System 437
have also published an extended report [9] (with more algorithmic details and
experiments) that has received over 50 citations according to Google scholar.
Our focus in this paper is on the repeatability, reproducibility, and gener-
alizability of the TAGME system; these are obvious desiderata for reliable and
extensible research. The recent SIGIR 2015 workshop on Reproducibility, Inex-
plicability, and Generalizability of Results (RIGOR)2 defined these properties
as follows:
– Repeatability: “Repeating a previous result under the original conditions (e.g.,
same dataset and system configuration).”
– Reproducibility: “Reproducing a previous result under different, but compa-
rable conditions (e.g., different, but comparable dataset).”
– Generalizability: “Applying an existing, empirically validated technique to a
different IR task/domain than the original.”
2 Overview of TAGME
In this section, we provide an overview of the TAGME approach, as well as the
test collections and evaluation metrics used in the TAGME papers [8,9].
2.1 Approach
TAGME performs entity linking in a pipeline of three steps: (i) parsing, (ii)
disambiguation, and (iii) pruning (see Fig. 1). We note that while Ferragina and
Scaiella [8] describe multiple approaches for the last two steps, we limit ourselves
to their final suggestions; these are also the choices implemented in the TAGME
API.
Before describing the TAGME pipeline, let us define the notation used
throughout this paper. Entity linking is the task of annotating an input text
T with entities E from a reference knowledge base, which is Wikipedia here. T
contains a set of entity mentions M , where each mention m ∈ M can refer to a
set of candidate entities E(m). These need to be disambiguated such that each
mention points to a single entity e(m).
Parsing. In the first step, TAGME parses the input text and performs mention
detection using a dictionary of entity surface forms. For each entry (surface
form) the set of entities recognized by that name is recorded. This dictionary
is built by extracting entity surface forms from four sources: anchor texts of
Wikipedia articles, redirect pages, Wikipedia page titles, and variants of titles
(removing parts after the comma or in parentheses). Surface forms consisting of
numbers only or of a single character, or below a certain number of occurrences
(2) are discarded. Further filtering is performed on the surface forms with low
link probability (i.e., < 0.001). Link probability is defined as:
link(m)
lp(m) = P (link|m) = , (1)
f req(m)
where f req(m) denotes the total number of times mention m occurs in Wikipedia
(as a link or not), and link(m) is the number of times mention m appears as a
link.
To detect entity mentions, TAGME matches all n-grams of the input text,
up to n = 6, against the surface form dictionary. For an n-gram contained by
another one, TAGME drops the shorter n-gram, if it has lower link probability
than the longer one. The output of this step is a set of mentions with their
corresponding candidate entities.
where vote(m , e) denotes the agreement between entities of mention m and the
entity e, computed as follows:
e ∈E(m ) relatedness(e, e ) · commonness(e , m )
vote(m , e) = . (3)
|E(m )|
Commonness is the probability of an entity being the link target of a given
mention [13]:
link(e , m )
commonness(e , m ) = P (e |m ) = , (4)
link(m )
where link(e , m ) is the number of times entity e is used as a link destination for
m and link(m ) is the total number of times m appears as a link. Relatedness
measures the semantic association between two entities [17]:
where in(e) is the set of entities linking to entity e and |E| is the total number
of entities.
Once all candidate entities are scored using Eq. (2), TAGME selects the
best entity for each mention. Two approaches are suggested for this purpose:
(i) disambiguation by classifier (DC) and (ii) disambiguation by threshold (DT),
of which the latter is selected as the final choice. Due to efficiency concerns,
entities with commonness below a given threshold τ are discarded from the DT
computations. The set of commonness-filtered candidate entities for mention m
is Eτ (m) = {e ∈ M (e)|commonness(m, e) ≥ τ }. Then, DT considers the top-
440 F. Hasibi et al.
entities for each mention and then selects the one with the highest commonness
score:
At the end of this stage, each mention in the input text is assigned a single
entity, which is the most pertinent one to the input text.
Pruning. The aim of the pruning step is to filter out non-meaningful annota-
tions, i.e., assign NIL to the mentions that should not be linked to any entity.
TAGME hinges on two features to perform pruning: link probability (Eq. (1)) and
coherence. The coherence of an entity is computed with respect to the candidate
annotations of all the other mentions in the text:
e ∈E(T )−{e} relatedness(e, e )
coherence(e, T ) = , (7)
|E(T )| − 1
where E(T ) is the set of distinct entities assigned to the mentions in the input
text. TAGME takes the average of the link probability and the coherence score
to generate a ρ score for each entity, which is then compared to the pruning
threshold ρNA . Entities with ρ < ρNA are discarded, while the rest of them are
served as the final result.
Two test collections are used in [8]: Wiki-Disamb30 and Wiki-Annot30. Both
comprise of snippets of around 30 words, extracted from a Wikipedia snapshot
of November 2009, and are made publicly available.5 In Wiki-Disamb30, each
snippet is linked to a single entity; in Wiki-Annot30 all entity mentions are
annotated. We note that the sizes of these test collections (number of snippets)
deviate from what is reported in the TAGME paper: Wiki-Disamb30 and Wiki-
Annot30 contain around 2M and 185K snippets, while the reported numbers
are 1.4M and 180K, respectively. This suggests that the published test collections
might be different from the ones used in [8].
the ground truth, while the topics metrics (Ptopics and Rtopics ) only consider
entity matches. The TAGME papers [8,9] provide little information about the
evaluation metrics. In particular, the computation of the standard precision and
recall is rather unclear; we discuss it later in Sect. 4.2. Details are missing regard-
ing the two other metrics too: (i) How are overall precision, recall and F-measure
computed for the annotation metrics? Are they micro- or macro-averaged?
(ii) What are the matching criteria for the annotation metrics? Are partially
matching mentions accepted or only exact matches? In what follows, we formally
define the annotation and topics metrics, based on the most likely interpretation
we established from the TAGME paper and from our experiments.
We write G(T ) = {(m̂1 , ê1 ), . . . , (m̂m êm )} for ground truth annotations of
the input text T , and S(T ) = {(m1 , e1 ), . . . , (mn , en )} for the annotations iden-
tified by the system. Neither G(T ) nor S(T ) contains NULL annotations. The
TAGME paper follows [12], which uses macro-averaging in computing annotation
precision and recall:6
T ∈F |G(T ) ∩ S(T )| T ∈F |G(T ) ∩ S(T )|
Ptopics = , Rtopics = . (9)
T ∈F |S(T )| T ∈F |G(T )|
For all metrics the overall F-measure is computed from the overall precision and
recall.
3 Repeatability
By definition (cf. Sect. 1), repeatability means that a system should be imple-
mented under the same conditions as the reference system. In our case, the
repeatability of the TAGME experiments in [8] is dependent on the availability
of (i) the knowledge base and (ii) the test collections (text snippets and gold
standard annotations).
6
As explained later by the TAGME authors, they in fact used micro-averaging. This
contradicts the referred paper [12], which explicitly defines Pann and Rann as being
macro-averaged.
442 F. Hasibi et al.
4 Reproducibility
This section reports on our attempts to reproduce the results presented in the
TAGME paper [8]. The closest publicly available Wikipedia dump is from April
2010,8 which is about five months newer than the one used in [8]. On a side
note we should mention that we were (negatively) surprised by how difficult it
proved to find Wikipedia snapshots from the past, esp. from this period. We have
(re)implemented TAGME based on the description in the TAGME papers [8,9]
and, when in doubt, we checked the source code. For a reference comparison,
we also include the results from (i) the TAGME API and (ii) the Dexter entity
linking framework [3]. Even though the implementation in Dexter (specifically,
the parser) slightly deviates from the original TAGME system, it is still useful
for validation, as that implementation is done by a third (independent) group of
researchers. We do not include results from running the source code provided to
us because it requires the Wikipedia dump in a format that is no longer available
for the 2010 dump we have access to; running it on a newer Wikipedia version
would give results identical to the API. In what follows, we present the challenges
we encountered during the implementation in Sect. 4.1 and then report on the
results in Sect. 4.2.
4.1 Implementation
During the (re)implementation of TAGME, we encountered several technical
challenges, which we describe here. These could be traced back to differences
between the approach described in the paper and the source code provided by
7
It was later explained by the TAGME authors that they actually used only 1.4M
out of 2M snippets from Wiki-Disamb30, as Weka could not load more than that
into memory. From Wiki-Annot30 they used all snippets, the difference is merely
a matter of approximation.
8
https://archive.org/details/enwiki 20100408.
On the Reproducibility of the TAGME Entity Linking System 443
the authors. Without addressing these differences, the results generated by our
implementation are far from what is expected and are significantly worse than
those by the original system.
key(m)
kp(m) = P (keyword|m) = , (10)
df (m)
where key(m) denotes number of Wikipedia articles where mention m is selected
as a keyword, i.e., linked to an entity (any entity), and df (m) is the number of
articles containing the mention m. Since in Wikipedia a link is typically created
only for the first occurrence of an entity (link(m) ≈ key(m)), we can assume
that the numerator of link probability and keyphraseness are identical. This
would mean that TAGME as a matter of fact uses keyphraseness. Nevertheless,
as our goal in this paper is to reproduce the TAGME results, we followed their
implementation of this feature, i.e., link(m)/df (m).9
4.2 Results
We report results for the intermediate disambiguation phase and for the end-
to-end entity linking task. For all reproducibility experiments, we set the ρNA
threshold to 0.2, as it delivers the best results and is also the recommended
value in the TAGME paper.
9
The proper implementation of link probability would result in lower values (as the
denominator would be higher) and would likely require a different threshold value
than what is suggested in [8]. This goes beyond the scope of our paper.
444 F. Hasibi et al.
Method P R F
Original paper [8] 0.915 0.909 0.912
TAGME API 0.775 0.775 0.775
the TAGME API results; the relative difference to the API results is –19 % for
our implementation and –12 % for Dexter in Ftopics score. Ceccarelli et al. [3] also
report on deviations, but they attribute these to the processing of Wikipedia: “we
observed that our implementation always improves over the WikiMiner online
service, and that it behaves only slightly worse then TAGME after the top 5
results, probably due to a different processing of Wikipedia.” The difference
between Dexter and our implementation stems from the parsing step. Dexter
relies on its own parsing method and removes overlapping mentions at the end
of the annotation process. We, on the other hand, follow TAGME and delete
overlapping mentions in the parsing step (cf. Sect. 2.1). By analyzing our results,
we observed that this parsing policy resulted in early pruning of some correct
entities and led accordingly to lower results.
Our experiments show that the end-to-end results reported in [8] are repro-
ducible through the TAGME API, but not by (re)implementation of the approach
by a third partner. This is due to undocumented deviations from the published
description.
5 Generalizability
and another from May 2012 (which is part of the ClueWeb12 collection), and
Dexter’s implementation of TAGME. Including results using the 2012 version
of Wikipedia facilitates a better comparison between the TAGME API and our
implementation, as they both use similar Wikipedia dumps. It also demonstrates
how the version of Wikipedia might affect the results.
Datasets and evaluation metrics. We use two publicly available test collections
developed for the ELQ task: ERD-dev [1] and Y-ERD [11]. ERD-dev includes
99 queries, while Y-ERD offers a larger selection, containing 2398 queries. The
annotations in these test collections are confined to proper noun entities from a
specific Freebase snapshot.10 We therefore remove entities that are not present
in this snapshot in a post-filtering step. In all the experiments, ρNA is set to 0.1,
as it delivers the highest results both for the API and for our implementations,
and is also the recommendation of the TAGME API. Evaluation is performed
in terms of precision, recall, and F-measure (macro-averaged over all queries),
as proposed in [1]; this variant is referred to as strict evaluation in [11].
5.2 Results
Table 3 presents the TAGME generalizability results. Similar to the reproducibil-
ity experiments, we find that the TAGME API provides substantially better
results than any of the other implementations. The most fair comparison between
Dexter and our implementations is the one against TAGME-wp12, as that has
the Wikipedia dump closest in date. For ERD-dev they deliver similar results,
while for Y-ERD Dexter has a higher F-score (but the relative difference is
below 10 %). Concerning different Wikipedia versions, the more recent one per-
forms better on the ERD-dev test collection, while the difference is negligible for
Y-ERD. If we take the larger test collection, Y-ERD, to be the more represen-
tative one, then we find that TAGME API > Dexter > TAGME-wp10, which
Acknowledgement. We would like to thank Paolo Ferragina and Ugo Scaiella for
sharing the TAGME source code with us and for the insightful discussions and clarifi-
cations later on. We also thank Diego Ceccarelli for the discussion on link probability
computation and for providing help with the Dexter API.
References
1. Carmel, D., Chang, M.-W., Gabrilovich, E., Hsu, B.-J.P., Wang, K.: ERD’14:
Entity recognition and disambiguation challenge. SIGIR Forum 48(2), 63–77
(2014)
2. Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Trani, S.: Dexter: An open
source framework for entity linking. In: Proceedings of the Sixth International
Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 17–
20 (2013)
3. Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., Trani, S.: Learning relatedness
measures for entity linking. In: Proceedings of CIKM 2013, pp. 139–148 (2013)
4. Chiu, Y.-P., Shih, Y.-S., Lee, Y.-Y., Shao, C.-C., Cai, M.-L., Wei, S.-L.,
Chen, H.-H.: NTUNLP approaches to recognizing and disambiguating entities in
long and short text at the ERD challenge 2014. In: Proceedings of Entity Recog-
nition & Disambiguation Workshop, pp. 3–12 (2014)
5. Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-
annotation systems. In: Proceedings of WWW 2013, pp. 249–260 (2013)
6. Cornolti, M., Ferragina, P., Ciaramita, M., Schütze, H., Rüd, S.: The SMAPH
system for query entity recognition and disambiguation. In: Proceedings of Entity
Recognition & Disambiguation Workshop, pp. 25–30 (2014)
7. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data.
In: Proceedings of EMNLP-CoNLL 2007, pp. 708–716 (2007)
8. Ferragina, P., Scaiella, U.: TAGME: On-the-fly annotation of short text fragments
(by Wikipedia entities). In: Proceedings of CIKM 2010, pp. 1625–1628 (2010)
9. Ferragina, P., Scaiella, U.: Fast and accurate annotation of short texts with
Wikipedia pages. CoRR (2010). abs/1006.3498
10. Han, X., Sun, L., Zhao, J.: Collective entity linking in web text: A graph-based
method. In: Proceedings of SIGIR 2011, pp. 765–774 (2011)
11. Hasibi, F., Balog, K., Bratsberg, S.E.: Entity linking in queries: tasks and evalua-
tion. In: Proceedings of the ICTIR 2015, pp. 171–180 (2015)
On the Reproducibility of the TAGME Entity Linking System 449
12. Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation
of Wikipedia entities in web text. In: Proceedings of KDD 2009, pp. 457–466 (2009)
13. Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with Wikipedia. In: Pro-
ceedings of the AAAI WikiAI Workshop, pp. 19–24 (2008)
14. Meij, E., Balog, K., Odijk, D.: Entity linking and retrieval for semantic search. In:
Proceedings of WSDM 2014, pp. 683–684 (2014)
15. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge.
In: Proceedings of CIKM 2007, pp. 233–242 (2007)
16. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of CIKM
2008, pp. 509–518 (2008)
17. Milne, D., Witten, I.H.: An effective, low-cost measure of semantic relatedness
obtained from Wikipedia links. In: Proceedings of AAAI Workshop on Wikipedia
and Artificial Intelligence: An Evolving Synergy, pp. 25–30 (2008)
18. Usbeck, R., Röder, M., Ngonga Ngomo, A.-C., Baron, C., Both, A., Brümmer, M.,
Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., Ferragina, P., Lemke, C.,
Moro, A., Navigli, R., Piccinno, F., Rizzo, G., Sack, H., Speck, R., Troncy, R.,
Waitelonis, J., Wesemann, L.: GERBIL: General entity annotator benchmarking
framework. In: Proceedings of WWW 2015, pp. 1133–1143 (2015)
Twitter
Correlation Analysis of Reader’s Demographics
and Tweet Credibility Perception
1 Introduction
Tweets from reliable news sources and trusted authors via known social links are
generally trustworthy. However, when Twitter readers search for tweets regard-
ing a particular topic, the returned messages require readers to determine the
credibility of tweet content. How do readers perceive credibility, and what fea-
tures (available on Twitter) do they use to help them determine credibility?
Since Twitter readers come from all over the world, do demographic attributes
influence their credibility perception?
There are several pieces of research regarding the automated detection
of tweet credibility using various features, especially for news tweets and
rumours [3,7,12,17]. However, these studies focus on building machine learned
classifiers and not on the question of how readers perceive credibility. Other
research that studies reader’s credibility judgments were conducted on web
blogs, Internet news media, and websites [5,6,23,24]. Quantitative studies were
conducted on limited groups of participants to identify particular factors that
influenced readers’ credibility judgments. Since these user studies focused on
certain factors, the subjects for readers’ credibility assessment were controlled and
limited.
We have found that there is a gap in understanding Twitter readers and their
credibility judgments of news tweets. We aim to understand the features readers
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 453–465, 2016.
DOI: 10.1007/978-3-319-30671-1 33
454 S. Mohd Shariff et al.
use when judging, especially when tweets are from authors unfamiliar to them.
Therefore in this study, we address the following research questions:
1. Do Twitter readers’ demographic profiles correlate with their credibility per-
ception of news tweets?
2. Do the tweet features readers use for their credibility perception correlate
with reader’s demographic profiles?
To answer the research questions, we design a user study of 1,510 tweets
returned by 15 search topics, which are judged by 754 participants. The study
explores the correlation between readers’ demographic attributes, credibility
judgments, and features used to judge tweet credibility. We will focus only on
tweet content features as presented by the Twitter platform and available directly
to readers.
2 Related Work
A class of existing studies focus on tweet credibility prediction by supervised
learning using tweet content and textual features, the tweet author’s social net-
work, and the source of retweets. The credibility of newsworthy tweets is deter-
mined by human annotators that are then used to predict the credibility of
previously unseen tweets [3]. The tweet credibility model presented in [7] were
used to rank the tweets by credibility. Both works used a current trending topics
dataset. Other studies focused on the utility of individual features for automat-
ically predicting credibility [17] and on the credibility verification of tweets for
journalists based on the tweet authors’ influence [19].
Another class of research has examined the features influencing readers’ cred-
ibility perception of tweets. Examining only certain tweet features, Morris et
al. [16] studied just under 300 readers from the US. The authors identified that a
tweet written by authors with a topically related display name influenced reader
credibility perception. Similar research was conducted [23], comparing readers
from China and the US. People from different cultural background perceived the
credibility of tweets differently in terms of what and how features were used. The
differences in tweet credibility perception for different topics was also reported
in [20]. The study found eight tweet-content features readers use when judging
the credibility level of tweets.
Some research has considered credibility perception in media other than
Twitter. In the work by [5], they discovered that different website credibility
elements such as interface, expertise and security are influenced by users’ demo-
graphic attributes. Another study found that the manipulation level of news
photos influenced credibility perception of news media [6]. The study showed
that people’s demographics influenced the perception of media credibility.
A Taiwanese-based study of reader’s credibility perception regarding news-
related blogs found belief factors can predict user’s perceived credibility [24].
They also found that reader’s motivation in using news-related blogs as a news
source influenced credibility perception. Demographic variables were also shown
Correlation Analysis of Reader’s Demographics 455
to affect credibility. In another study [11], demographic attributes are also found
to correlate with visual features as information credibility factors for microblogs,
especially by younger people.
3 Methodology
We describe the collection of credibility judgments and the techniques that we
use to analyze the data.
We also measure which cell in the contingency table influences the χ2 value.
The interest or dependence of a cell (c) is defined as I(c)= Oc/Ec. The further
away the value is from 1, the higher influence it has on the χ2 value. Positive
dependence is when the interest value is greater than 1, and a negative depen-
dence is those lower than 1 [1].
In this study the demographic data collected from the readers are used for chi-
square analysis, refer to Table 1. The readers’ demographic data, except for
gender, are also categorized in binary and categorical setting based on other
research [5,6] to examine any correlation of demographic attributes or combina-
tions of demographic attributes with tweet credibility perception. The different
ways of partitioning demographic data are as follows:
– Age: Binary {Young adult ( 39 years old), Older adult ( 40 years old)}
and Categorical {Boomers (51–69 years old), Gen X (36–50 years old), Gen
Y(21–35 years old), Gen Z (6–20 years old)} [14]
– Education: Binary {Below university level, University level} and Categorical
{School level, Some college, Undergraduate, Postgraduate}
– Location: Binary {Eastern hemisphere, Western hemisphere} and Categorical
{Asia-Pacific, Americas, Europe, Africa}
We conduct the correlation analysis for each single demographic attribute for
all the different slicing with credibility judgments or features.
4 Results
A total of 10,571 credibility judgments for 1,510 news tweets were collected
from the user study. Only 9,828 judgments from 819 crowdsource workers were
accepted for this study because only those workers answered the demographic
questions and completed all 12 judgments. For any credibility judgments that
were found to not describe the features used to make the credibility judgment or
gave nonsensical comments, all judgments of the reader were discarded. We also
discarded judgments of two readers from Oceania continent and three readers
that did not have any education background, due to their low values undermine
the required minimal expected frequency to apply χ2 analysis. We were left with
a final dataset for analysis from 754 readers with 9,048 judgments.
Our final collection of data includes readers from 76 countries with the highest
number of participants coming from India (15 %). We then group the countries
into continents due to the countries’ sparsity. Out of the 754 readers, the majority
(69.0 %, n=521) of readers were male, similar to prior work that uses crowdsource
458 S. Mohd Shariff et al.
workers for user study [11]. Most of the readers were in the age group of 20–29
years old (43.4 %, n=327). In regards to the readers’ education background, the
majority had a University degree (38.1 %, n=287). Table 1 shows the readers
demographic profiles.
4.2 Features
The features reported by readers are features of the tweet message itself, content-
based and source-based. For features reported in free text, we applied a sum-
mative content analysis based on the list of features identified beforehand [9].
Table 2 (column 2) lists the features reported by readers when making their
credibility judgments. Since the features are sparse, it is difficult to analyze
their influence in the readers’ credibility judgment. Therefore, we categorize the
features into five categories and will be using the feature categories in all of our
analysis related to the features:
– Author: features regarding the person who posted a tweet, including the
Twitter ID, display name, and the avatar image;
Correlation Analysis of Reader’s Demographics 459
4.3 Findings
The correlation analysis for individual demographic attributes for each data
setting (as described in Subsect. 3.2): Original (O), Binary (B), Categorical (C),
and the credibility perceptions is shown in Table 3. At the original data setting,
Education and Location are significantly correlated with credibility judgment,
χ2 = 49.43, p<0.05 and χ2 = 80.79, p<0.05. Only Location is significantly cor-
related at all levels of partitioning. A post hoc analysis on the interest value
of cells in the contingency table Education × Credibility for the original data
found the cell that contributes most to the χ2 value is readers with a ‘Profes-
sional certification’, who commonly gave ‘not credible’ judgments. In regards to
the contingency table Location × Credibility, we found there was a correlation
between the readers from the African continent and the ‘cannot decide’ credi-
bility perception in the original and the categorical data setting with a positive
dependence. Both cells interest values are far from 1, indicating strong depen-
dence. In the contingency table for Location × Credibility in the binary data
setting, the interest value in each cell is close to 1, therefore there are no strong
dependence.
Table 5. The chi square correlation between demographics and features used in cred-
ibility perception
5 Discussion
6 Conclusion
Although research on Twitter information credibility has been reported, most
work focuses on automatic predicting or detecting tweet credibility. Our focus
is on understanding Twitter readers and what influences their credibility judg-
ments. In this study, we provided new insights in the correlation of reader demo-
graphic attributes with credibility judgments of tweets and the features readers
used to make those judgments. Furthermore, the richness of data collected for
this study – derived from a wide range of demographic profiles and readers across
countries – is the first to offer insights on Twitter reader’s direct perception of
credibility and the features readers use for credibility judgements. For future
work, we plan to examine if the type of news tweets has any influence on a
reader’s credibility perception. We would also like to investigate deeper on the
features readers use, and the type of credibility level relates to those features
and news type.
References
1. Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: generalizing associ-
ation rules to correlations. ACM SIGMOD Rec. 26(2), 265–276 (1997)
2. Cassidy, W.P.: Online news credibility: an examination of the perceptions of news-
paper journalists. J. Comput. Mediated Commun. 12(2), 478–498 (2007)
3. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on twitter. In:
WWW 2011, pp. 675–684. ACM (2011)
4. Cozzens, M.D., Contractor, N.S.: The effect of conflicting information on media
skepticism. Commun. Res. 14(4), 437–451 (1987)
5. Fogg, B., Marshall, J., Laraki, O., Osipovich, A., Varma, C., Fang, N., Paul, J.,
Rangnekar, A., Shon, J., Swani, P., et al.: What makes web sites credible?: a report
on a large quantitative study. In: SIGCHI 2001, pp. 61–68. ACM (2001)
6. Greer, J.D., Gosen, J.D.: How much is too much? assessing levels of digital alter-
ation of factors in public perception of news media credibility. Vis. Commun. Q.
9(3), 4–13 (2002)
7. Gupta, A., Kumaraguru, P.: Credibility ranking of tweets during high impact
events. In: PSOSM 2012, pp. 2–8. ACM (2012)
8. Hahsler, M., Grün, B., Hornik, K.: Introduction to arules: mining association rules
and frequent item sets. SIGKDD Explor. 2, 4 (2007)
9. Hsieh, H.F., Shannon, S.E.: Three approaches to qualitative content analysis. Qual.
Health Res. 15(9), 1277–1288 (2005)
10. Hu, M., Liu, S., Wei, F., Wu, Y., Stasko, J., Ma, K.L.: Breaking news on twitter.
In: SIGCHI 2012, pp. 2751–2754. ACM (2012)
11. Kang, B., Höllerer, T., O’Donovan, J.: Believe it or not? analyzing information
credibility in microblogs. In: ASONAM 2015, pp. 611–616. ACM (2015)
12. Kang, B., O’Donovan, J., Höllerer, T.: Modeling topic specific credibility on twitter.
In: IUI 2012, pp. 179–188. ACM (2012)
13. Liu, Z.: Perceptions of credibility of scholarly information on the web. Inf. Process.
Manage. 40(6), 1027–1038 (2004)
14. McCrindle, M., Wolfinger, E.: The ABC of XYZ: Understanding the Global Gen-
erations. University of New South Wales Press, Sydney (2009)
15. McDonald, J.H.: Handbook of Biological Statistics, vol. 3. Sparky House Publish-
ing, Baltimore (2014)
16. Morris, M.R., Counts, S., Roseway, A., Hoff, A., Schwarz, J.: Tweeting is believing?:
understanding microblog credibility perceptions. In: Proceedings of the ACM 2012
Conference on Computer Supported Cooperative Work, pp. 441–450. ACM (2012)
17. O’Donovan, J., Kang, B., Meyer, G., Höllerer, T., Adalii, S.: Credibility in context:
an analysis of feature distributions in twitter. In: 2012 International Conference on
Privacy, Security, Risk and Trust (PASSAT), and 2012 International Confernece
on Social Computing (SocialCom), pp. 293–301. IEEE (2012)
18. Rassin, E., Muris, P.: Indecisiveness and the interpretation of ambiguous situations.
Pers. Individ. Differ. 39(7), 1285–1291 (2005)
19. Schifferes, S., Newman, N., Thurman, N., Corney, D., Göker, A., Martin, C.: Iden-
tifying and verifying news through social media: developing a user-centred tool for
professional journalists. Digit. J. 2(3), 406–418 (2014)
20. Mohd Shariff, S., Zhang, X., Sanderson, M.: User perception of information credi-
bility of news on twitter. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X.,
de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp.
513–518. Springer, Heidelberg (2014)
Correlation Analysis of Reader’s Demographics 465
21. Sundar, S.S.: Effect of source attribution on perception of online news stories. J.
Mass Commun. Q. 75(1), 55–68 (1998)
22. Tan, P.N., Kumar, V.: Association analysis: basic concepts and algorithms. In:
Introduction to Data Mining, Chap. 6. Addison-Wesley (2005)
23. Yang, J., Counts, S., Morris, M.R., Hoff, A.: Microblog credibility perceptions:
comparing the usa and china. In: Proceedings of the 2013 Conference on Computer
Supported Cooperative Work, pp. 575–586. ACM (2013)
24. Yang, K.C.C.: Factors influencing internet users perceived credibility of news-
related blogs in taiwan. Telematics Inform. 24(2), 69–85 (2007)
Topic-Specific Stylistic Variations for Opinion
Retrieval on Twitter
1 Introduction
Microblogs have emerged as a popular platform for sharing information and
expressing opinion. Twitter attracts 284 million active users per month who
post about 500 million messages every day1 . Due to its increasing popularity,
Twitter has emerged as a vast repository of information and opinion on various
topics. However, all this opinionated information is hidden within a vast amount
of data and it is therefore impossible for a person to look through all data and
extract useful information.
Twitter opinion retrieval aims to identify tweets that are both relevant to a
user’s query and express opinion about it. Twitter opinion retrieval can be used
as a tool to understand public opinion about a specific topic, which is helpful
for a variety of applications. One typical example refers to enterprises that can
capture the views of customers about their product or their competitors. This
information can be then used to improve the quality of their services or products
accordingly. In addition, it is possible for the government to understand the
public view regarding different social issues and act promptly.
1
See: https://about.twitter.com/company/.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 466–478, 2016.
DOI: 10.1007/978-3-319-30671-1 34
Topic-Specific Stylistic Variations for Opinion Retrieval on Twitter 467
2 Related Work
With the rapid growth of social media platforms, sentiment analysis and opinion
retrieval has attracted much attention in the research community [14,15]. Early
research focused on classifying documents as expressing either a positive or a
negative opinion. The relevance of an opinionated document towards a topic
was first considered by Yi et al. [23], while Eguchi and Lavrenko [3] were the
first to consider ranking documents according to the opinion they contain about
a topic. A comprehensive review of opinion retrieval and sentiment analysis can
be found in a survey by Pang and Lee [15].
The increasing popularity of Twitter has recently stirred up research in the
field of Twitter sentiment analysis. One of the first studies was carried out by
Go et al. [4] treated the problem as one of binary classification, classifying tweets
as either positive or negative. Due to the difficulty of manually tagging the sen-
timent of the tweets, they employed distant supervision to train a supervised
468 A. Giachanou et al.
machine learning classifier. The authors used a technique devised by Read [18]
to collect the data, according to which emoticons can be used to differenti-
ate the negative and positive tweets. They compared Naive Bayes (NB), Max-
imum Entropy (MaxEnt) and Support Vector Machines (SVM), among which
SVM with unigrams achieved the best result. Following Go et al. [4], Pak and
Paroubek [12] used emoticons to label training data from which they built a
multinomial Naı̈ve Bayes classifier which used N-gram and POS-tags as features.
Due to the informal language used on Twitter, which frequently contains
unique stylistic features, a number of researchers explored features such as emoti-
cons, abbreviations and emphatic lengthening, studying their impact on senti-
ment analysis. Brody and Diakopoulos [2] showed that the lengthening of words
(e.g., cooool) in microblogs is strongly associated with subjectivity and senti-
ment. Kouloumpis et al. [8] showed that Twitter-specific features such as the
presence or absence of abbreviations and emoticons improve sentiment analysis
performance. None of these approaches considered, however, the possibility that
stylistic features may depend on the topic of the tweet.
Topic-dependent approaches have been considered by researchers in rela-
tion to terms. Jiang et al. [7] used manually-defined rules to detect the syn-
tactic patterns that showed if a term was related to a specific object. They
employed a binary SVM to apply subjectivity and polarity classification and
utilised microblog-specific features to create a graph which reflects the similar-
ities of tweets. Canneyt et al. [20] introduced a topic-specific classifier to effec-
tively detect the tweets that express negative sentiment whereas Wang et al. [21]
leveraged the co-occurrence of hashtags to detect their sentiment polarity.
Twitter opinion retrieval was first considered by Luo et al. [10] who proposed
a learning-to-rank algorithm for ranking tweets based on their relevance and
opinionatedness towards a topic. They used SVMRank to compare different social
and opinionatedness features and showed they can improve the performance of
Twitter opinion retrieval. However, this improvement is over relevance baselines
(BM25 and VSM retrieval models) and not over an opinion baseline. Our work
is different as we propose to incorporate topic-specific stylistic variations into
a ranking function to generate an opinion score for a tweet. To the best of our
knowledge, there is no work exploring the importance of topic-specific stylistic
variations for Twitter opinion retrieval. Another important difference is that we
use both relevance and opinion baselines to compare the proposed topic-specific
stylistic opinion retrieval method.
3 Topic Classification
Topic models aim to identify text patterns in document content. Standard topic
models include Latent Dirichlet Allocation (LDA) [1] and Probabilistic Latent
Semantic Indexing (pLSI) [5]. LDA, one of the most well known topic mod-
els, is a generative document model which uses a “bag of words” approach and
Topic-Specific Stylistic Variations for Opinion Retrieval on Twitter 469
1. Choose θd ∼ Dir(α),
2. Choose φz ∼ Dir(β),
3. For each of the N words wn :
Topic models have been applied in a wide range of areas including Twitter.
Hong and Davison [6] conducted an empirical study to investigate the best way
to train models for topic modeling on Twitter. They showed that topic mod-
els learned from aggregated messages of the same user may lead to superior
performance in classification problems. Zhao et al. [24] proposed the Twitter-
LDA model which considered the shortness of tweets to compare topics discussed
in Twitter with those in traditional media. Their results showed that Twitter-
LDA works better than LDA in terms of semantic coherence. Ramage et al. [17]
applied labeled-LDA in Twitter, a partially supervised learning model based on
hashtags. Inspired by the popularity of LDA, Krestel et al. [9] proposed using
LDA for tag recommendation. Based on the intuition that tags and words are
generated from the same set of latent topics, they used the distributions of latent
topics to represent tags and descriptions and to recommend tags.
In this work, we use LDA [1] to determine the topics of tweets, which are
then used to learn the importance of the stylistic variations for each topic.
where p(t|d) = c(t, d)/|d| is the relative frequency of term t in document d and
opinion(t) shows the opinionatedness of the term.
Since this is one of the most widely used methods to calculate the opinion-
atedness of a document, we also use this method as one of our baselines.
0, if N = nl
IDFP rob (l) = N −nl (3)
log nl , nl
if N =
5 Experimental Setup
for terms with a negative sentiment or from 1 to 5 for terms with a positive
sentiment. We chose this lexicon as it contains affective words that are used
in Twitter. We took the absolute values of the scores since we do not consider
sentiment polarity in our study. We use M inM ax normalisation to convert the
valence score of a term to opinion score. To avoid getting zero scores for terms
with absolute score 1, we consider that the lexicon has also one term with no
sentiment (assigned the score 0), so that 0 is the minimum score.
To calculate the stylistic-based component of our model, we identified, for
each tweet, the number of emoticons, exclamation marks, terms under emphatic
lengthening and opinionated hashtags as follows:
Evaluation. We compare the proposed opinion retrieval method with two base-
lines. The first, BM25, is the method with the best performance in Twitter opin-
ion retrieval according to the results presented in [10]. The Relevance-Baseline
is based purely on topical relevance and does not consider opinion. As a second
baseline, we use the term-based opinion score (Eq. 1). The Opinion-Baseline
considers opinion and therefore it is a more appropriate baseline to compare
our results with. To evaluate the methods, we report Mean Average Precision
(MAP), which is the only metric reported in previous work [10] in Twitter opin-
ion retrieval. To compare the different methods we used the Wilcoxon signed
ranked matched pairs test with a confidence level of 0.05.
4
See http://en.wikipedia.org/wiki/List of emoticons.
474 A. Giachanou et al.
Table 1 shows a list of five topics which were discovered in the collection of
tweets when the number of topics was set to 65. We observe that LDA managed
to group terms that are about the same topic together.
Table 2. Performance results of the SVFLog IDFInv method under topic-based settings
using different combinations of stylistic variations over the baselines. A star(∗) and
dagger(†) indicate statistically significant improvement over the relevance and opinion
baselines respectively.
MAP
Relevance-Baseline 0.2835
Opinion-Baseline 0.3807∗
SVFLog IDFInv -Emot-Excl 0.4314∗ †
SVFLog IDFInv -Emot-Excl-Emph 0.4413∗ †
SVFLog IDFInv -Emot-Excl-Emph-OpHash 0.4344∗ †
0.15
MAP difference
0.05
0.05
0.15
Topics
Fig. 1. Difference in performance between the topic-based SVFLog IDFP rob and the
non topic-based SVFLog IDFP rob model. Positive/negative bars indicate improve-
ment/decline over the non topic-based SVFLog IDFP rob model in terms of MAP.
Table 4. Topics that are helped or hurt the most in the SVFLog IDFP rob model under
topic-based compared to non topic-based settings.
Helped Hurt
Title Δ MAP Title Δ MAP
iran 0.1795 new start-ups −0.1833
Lenovo 0.1185 iran nuclear −0.0480
galaxy note 0.1017 big bang −0.0319
features (URL, Mention, Statuses, Followers) together with BM25 score, and
Query-Depedent opinionatedness (Q D) features.
good indicators for identifying opinionated tweets and that opinion retrieval per-
formance is improved when emoticons, exclamation marks and emphatic length-
ening are taken into account. Additionally, we demonstrated that the importance
of stylistic variations in indicating opinionatedness is indeed topic dependent as
our topic model-based approaches significantly outperformed those that assumed
importance to be uniform over topics.
In future, we plan to extend the topic-based opinion retrieval method by
investigating the effect of assigning different importance weights to each stylistic
variation. We also plan to evaluate the performance of our method on other
datasets that consider opinion retrieval on short texts that share similar stylistic
variations to tweets such as MySpace and YouTube comments.
Acknowledgments. This research was partially funded by the Swiss National Science
Foundation (SNSF) under the project OpiTrack.
References
1. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3,
993–1022 (2003)
2. Brody, S., Diakopoulos, N.: Cooooooollllllllll!!!!!!!!!: using word lengthening to
detect sentiment in microblogs. In: EMNLP 2011, pp. 562–570 (2011)
3. Eguchi, K., Lavrenko, V.: Sentiment retrieval using generative models. In: EMNLP
2006, pp. 345–354 (2006)
4. Go, A., Bhayani, R., Huang, L.: Technical report, Standford (2009)
5. Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR 1999, pp. 50–57
(1999)
6. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In:
SIGKDD Workshop on SMA, pp. 80–88 (2010)
7. Jiang, L., Yu, M., Zhou, M., Liu, X., Zhao, T.: Target-dependent twitter sentiment
classification. In: ACL, HLT 2011, pp. 151–160 (2011)
8. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: The good the
bad and the omg!. In: ICWSM 2011, pp. 538–541 (2011)
9. Krestel, R., Fankhauser, P., Nejdl, W.: Latent dirichlet allocation for tag recom-
mendation. In: RecSys 2009, pp. 61–68 (2009)
10. Luo, Z., Osborne, M., Wang, T.: An effective approach to tweets opinion retrieval.
In: WWW 2013, pp. 1–22 (2013)
11. Nielsen, F.: A new ANEW: Evaluation of a word list for sentiment analysis of
microblogs. In: ESWC 2011 Workshop on ’Making Sense of Microposts’: Big Things
Come in Small Packages, pp. 93–98(2011)
12. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion
mining. In: LREC 2010, pp. 1320–1326 (2010)
13. Paltoglou, G., Buckley, K.: Subjectivity annotation of the microblog 2011 realtime
adhoc relevance judgments. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O.,
Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013.
LNCS, vol. 7814, pp. 344–355. Springer, Heidelberg (2013)
14. Paltoglou, G., Giachanou, A.: Opinion retrieval: searching for opinions in social
media. In: Paltoglou, G., Loizides, F., Hansen, P. (eds.) Professional Search in the
Modern World. LNCS, vol. 8830, pp. 193–214. Springer, Heidelberg (2014)
478 A. Giachanou et al.
15. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr.
2(1–2), 1–135 (2008)
16. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
17. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models.
In: ICWSM 2010, pp. 1–8 (2010)
18. Read, J.: Using emoticons to reduce dependency in machine learning techniques
for sentiment classification. In: ACL Student Research Workshop, pp. 43–48 (2005)
19. Strunk, W.: The Elements of Style. Penguin, New York (2007)
20. Van Canneyt, S., Claeys, N., Dhoedt, B.: Topic-dependent sentiment classification
on twitter. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015.
LNCS, vol. 9022, pp. 441–446. Springer, Heidelberg (2015)
21. Wang, X., Wei, F., Liu, X., Zhou, M., Zhang, M.: Topic sentiment analysis in
twitter: a graph-based hashtag sentiment classification approach. In: CIKM 2011,
pp. 1031–1040 (2011)
22. Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on
streaming document collections. In: SIGKDD 2009, pp. 937–946 (2009)
23. Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: extracting
sentiments about a given topic using natural language processing techniques. In:
ICDM 2003, pp. 427–434 (2003)
24. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing
twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin,
C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol.
6611, pp. 338–349. Springer, Heidelberg (2011)
Inferring Implicit Topical Interests on Twitter
Abstract. Inferring user interests from their activities in the social net-
work space has been an emerging research topic in the recent years. While
much work is done towards detecting explicit interests of the users from
their social posts, less work is dedicated to identifying implicit interests,
which are also very important for building an accurate user model. In this
paper, a graph based link prediction schema is proposed to infer implicit
interests of the users towards emerging topics on Twitter. The underly-
ing graph of our proposed work uses three types of information: user’s
followerships, user’s explicit interests towards the topics, and the related-
ness of the topics. To investigate the impact of each type of information
on the accuracy of inferring user implicit interests, different variants of
the underlying representation model are investigated along with several
link prediction strategies in order to infer implicit interests. Our exper-
imental results demonstrate that using topics relatedness information,
especially when determined through semantic similarity measures, has
considerable impact on improving the accuracy of user implicit interest
prediction, compared to when followership information is only used.
1 Introduction
The growth of social networks such as Twitter has allowed users to share and
publish posts on a variety of social events as they happen, in real time, even
before they are released in traditional news outlets. This has recently attracted
many researchers to analyze posts to understand the current emerging top-
ics/events on Twitter in a given time interval by viewing each topic as a com-
bination of temporally correlated words/terms or semantic concepts [2,4]. For
instance, on 2 December 2010, Russia and Qatar were selected as the locations
for the 2018 and 2022 FIFA World Cups. By looking at Twitter data on this
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 479–491, 2016.
DOI: 10.1007/978-3-319-30671-1 35
480 F. Zarrinkalam et al.
day, a combination of keywords like ‘FIFA World Cup’, ‘Qatar’, ‘England’ and
‘Russia’ have logically formed a topic to represent this event.
The ability to model user interests towards these emerging topics provides
the potential for improving the quality of the systems that work on the basis of
user interests such as news recommender systems [21]. Most existing approaches
build a user interest profile based on the explicit contribution of the user to the
emerging topics [1,15]. However, such approaches struggle to identify a user’s
interests if the user has not explicitly talked about them. Consider the tweets
posted by Mary:
– “Qatar’s bid to host the 2022 World Cup is gaining momentum, worrying the
U.S., which had been the favorite http://on.wsj.com/a8j3if”
– “Russia rests 2018 World Cup bid on belief that big and bold is best | Owen
Gibson (Guardian) http://feedzil.la/g2Mpbs”
Based on the keywords explicitly mentioned by Mary in her tweets, one could
easily infer that she is interested in the Russia and Qatar’s selection as the
hosts for the 2018 and 2022 FIFA World Cups. We refer to such interests that
are directly derivable from a user’s tweets as explicit interests. Expanding on
this example, another topic emerged later in 2010, which was related to Prince
William’s engagement. Looking at Mary’s tweets she never referred to this topic
in her tweet stream. However, it is possible that Mary is British and is interested
in both football and the British Royal family, although never explicitly tweeted
about the latter. If that is in fact the case, then Mary’s user profile would need
to include such an interest. We refer to these concealed user topical interests
as implicit interests, i.e., topics that the user never explicitly engaged with but
might have interest in.
The main objective of our work in this paper is to determine implicit interests
of a user over the emerging topics in a given time interval. To this end, we propose
to turn the implicit interest detection problem into a graph-based link predic-
tion problem that operates over a heterogeneous graph by taking into account
(i) users’ interest profile built based on their explicit contribution towards the
extracted topics, (ii) theory of Homophily [12], which refers to the tendency of
users to connect to users with common interests or preferences; and (iii) rela-
tionship between emerging topics, based on their similar constituent contents
and user contributions towards them. More specifically, the key contributions of
our work are as follows:
– Based on the earlier works [7,21], we model users’ interests over the emerging
topic on Twitter through a set of correlated semantic concepts. Therefore, we
are able to infer finer-grained implicit interests that refer to real-world events.
– We propose a graph-based framework to infer the implicit interests of users
toward the identified topics through a link prediction strategy. Our work con-
siders a heterogeneous graph that allows for including three types of infor-
mation: user followerships, user explicit interests and topic relatedness.
– We perform extensive experimentation to determine the impact of one or a
combination of these information types on accurately predicting the implicit
Inferring Implicit Topical Interests on Twitter 481
The rest of this paper is organized as follows. In the next section, we review
the related work. Our framework to infer users’ implicit interests is introduced
in Sect. 3. Section 4 is dedicated to the details of our empirical experimentation
and our findings. Finally, in Sect. 5, we conclude the paper.
2 Related Work
In this paper, we assume that an existing state of the art technique such as those
proposed in [2,4] can be employed for extracting and modeling the emerging
topics on Twitter as sets of temporally correlated terms/concepts. Therefore,
we will not be engaged with the process of identification of the topics and will
only focus on determining the implicit interest of users towards the topics once
they are identified. Given this focus, we review the works that are related to the
problem of user interest detection from social networks.
There are different works for extracting users’ interests from social networks
through the analysis of the users’ generated textual content. Yang et al. [19]
have modeled the user interests by representing her tweets as a bag of words,
and by applying a cosine similarity measure to determine the similarity between
the users in order to infer common interests. Xu et al. [18] have proposed an
author-topic model where the latent variables are used to indicate if the tweet
is related to the author’s interests.
Since Bag of Words and Topic Modeling approaches are designed for normal
length texts, they may not perform so effectively on short, noisy and informal
text such as tweets. There are insufficient co-occurrence frequencies between
keywords in short posts to enable the generation of appropriate word vector
representations [5]. Furthermore, bag of words approaches overlook the underly-
ing semantics of the text. To address these issues, some recent works have tried
to utilize external knowledge bases to enrich the representation of short texts
[8,13]. Abel et al. [1] have enriched Twitter posts by linking them to related news
articles and then modeled user’s interests by extracting the entities mentioned
in the enriched messages. DBpedia and Freebase are often used for enriching
Tweets by linking their content with unambiguous concepts from these external
knowledge bases. Such an association provides explicit semantics for the content
of a tweet and can hence be considered to be providing additional contextual
information about the tweet [8,10]. The work in [21] has inferred fine grained
user topics of interest by extracting temporally related concepts in a given time
interval.
While most of the works mentioned above have focused on extracting explicit
interests through analysing only textual contents of users, less work has been
dedicated to inferring implicit interests of the users. Some authors have shown
interest in the Homophily theory [12] to extract implicit interests. Based on this
theory, users tend to connect to users with common interests or preferences.
482 F. Zarrinkalam et al.
Mislove et al. [14] have used this theory to infer missing interests of a user based
on the information provided by her neighbors. Wang et al. [16] have extended
this theory by extracting user interests based on implicit links between users
in addition to explicit relations. While these works incorporate the relationship
between users, they do not consider the relationship between the emerging topics
themselves. In our work, we are interested to explore if a holistic view that
considers the semantics of the topics, the user followership information and the
explicit interests of users towards the topics can provide an efficient platform for
identifying users’ implicit interests.
In another line of work, semantic concepts and their relationships defined
in external knowledge bases are leveraged to extract implicit user interests.
Kapanipathi et al. [10] have extracted implicit interests of the user by mapping
her primitive interests to the Wikipedia category hierarchy using a spreading
activation algorithm. Similarly, Michelson and Macskassy [13] have identified the
high-level interests of the user by traversing and analyzing the Wikipedia cate-
gories of entities extracted from the user’s tweets. The main difference between
the problem we tackle here from the previously mentioned works is that we view
each topic of interest as a combination of correlated concepts as opposed to just
a single concept. So the relationship between two topics is not predefined in the
external knowledge base and we need to provide a measure of topic similarity or
relatedness.
where Ru+ is the set of topics that user u is interested in, pj and qi are the learned
topic latent factors, n+u is the number of topics that user u is interested in and
α is a user specified parameter between 0 and 1. According to [24], matrices P
and Q can be learnt by minimizing a regularized optimization problem:
1 β λ γ
minimize ||rui − r̂ui ||2F + (||P ||2F + ||Q||2F ) + ||bu ||22 + ||bi ||22 (3)
2 2 2 2
u,i∈R
where the vectors bu and bi correspond to the vector of user u and topic zi biases.
The optimization problem can be solved using Stochastic Gradient Descent
to learn two matrices P and Q. Given P and Q as latent factors of topics, the
collaborative relatedness of two topics zi and zj is computed as the dot product
between the corresponding factors from P and Q i.e., pi and qj .
Inferring Implicit Topical Interests on Twitter 485
While the collaborative relatedness measure can find the topic relatedness
based on the user’s contributions to the topics, it overlooks the semantic relat-
edness between the two topics. In the third approach, we develop a hybrid relat-
edness measure that considers both the semantic relatedness of the concepts
within each topic as well as users’ contributions towards the emerging topics.
We follow the assumption of [20] for utilizing item attribute information to add
the item relationship regularization term into Eq. (3). Based on this, two topic
latent feature vectors would be considered similar if they are similar accord-
ing to their attribute information. The topic relationship regularization term is
defined as:
|Z| |Z|
δ
Sii (||qi − qi ||2F + ||pi − pi ||2F ) (4)
2 i=1
i =1
4 Experiments
Table 1. The five link prediction strategies chosen for user implicit interest prediction
1
Adamic/Adar score(x, y) = z∈Γ(x)∩Γ(y) log|Γ(z)|
Γ(x): the set of neighbors of vertex x
Common neighbors score(x, y) = Γ(x) ∩ Γ(y)
Jaccard’s coefficient score(x, y) = |Γ(x) ∩ Γ(y)|/|Γ(x) ∪ Γ(y)|
Katz score(x, y) = ∞ =1 β |pathx,y |
<>
model: (i) followership information (F) and (ii) the type of topics relatedness
measure, i.e., semantic (S), collaborative (C) or hybrid (CS). By selecting and
combining the different alternatives, we obtain 7 variants that we will systemat-
ically compare in this section. We include user’s explicit interest information in
all of the seven variants. As some brief example on how to interpret the models,
Model F only uses user followership information in addition to users’ explicit
interests. The SF Model considers topic relationships computed using semantic
relatedness in addition to user followership and user’s explicit interests. The rest
of the models can be interpreted similarly.
In order to make a fair comparison, we repeat the experimentation for all the
selected link prediction strategies introduced in Table 1. The results in terms of
AUROC and AUPR are reported in Table 2. Given AUROC and AUPR values
can be misleading in some cases, we also visually inspect the ROC curves in
addition to the area under the curve values. Due to space limitation and also
the elaborate theorem proved in [6] that a curve dominates in ROC space if and
only if it dominates in PR space, we only present the ROC curves in Fig. 1.
As illustrated in Table 2 and Fig. 1, we can clearly see that the SimRank link
prediction method has not shown a good performance over none of the variants.
Based on our results, SimRank acts as a random predictor because for most
of the models its AUROC value is about 0.5 and its ROC curve is near y=x.
Therefore, in the rest of this section, to investigate the influence of the different
variants of our representation model on the performance of inferring implicit
interests of users we ignore the results of the SimRank strategy.
Fig. 2. Topmost related topics based on Hybrid (left) and semantic (right) measures
When looking at the results in Table 2, one can see that model C shows
slightly weaker results compared to S, which can be the sign of two points:
(i) semantic relatedness of topics is a more accurate indication of the tendency
of users towards topics compared to collaborative relatedness of topics, and
(ii) while C shows a weaker performance, its performance is in most cases only
slightly weaker. This could mean that there is some degree of similarity between
the results obtained by the two methods (C and S) pointing to the fact that even
when using the collaborative relatedness measure, a comparable result to when
the semantic relatedness measure is used can be obtained. Our explanation for
this is that Twitter users seem to follow topics that are from similar domains
or genres. This is an observation that is also reported in [3] and can be seen in
the Who Likes What system. Therefore, when trying to predict a user’s implicit
interest, it would be logical to identify those that are on topics closely related
to the user’s explicit interests. Given this observation, the user’s that are most
similar within the context of collaborative filtering, are likely to also be following
a coherent set of topics (not a variety of topics) and therefore, provide grounds
for a reasonable estimation of the implicit interests.
The observation that S provides the best performance for predicting implicit
interests is more appealing when the computational complexity involved in its
computation is compared with the other methods. The computation of S only
involves the calculation of the semantic similarity of the concepts in each pair
of topics, which is quite an inexpensive operation, whereas the computation of
C and CS require solving an optimization problem through Stochastic Gradient
Descent. Additionally, by comparing C and CS, it can be concluded that adding
490 F. Zarrinkalam et al.
References
1. Abel, F., Gao, Q., Houben, G.-J., Tao, K.: Analyzing user modeling on twitter for
personalized news recommendations. In: Konstan, J.A., Conejo, R., Marzo, J.L.,
Oliver, N. (eds.) UMAP 2011. LNCS, vol. 6787, pp. 1–12. Springer, Heidelberg
(2011)
2. Aiello, L.M., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R.,
Goker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in twitter. IEEE
Trans. Multimedia 15(6), 1268–1282 (2013)
3. Bhattacharya, P., Muhammad, B.Z., Ganguly, N., Ghosh, S., Gummadi, K.P.:
Inferring user interests in the twitter social network. In: RecSys 2014, pp. 357–360
(2014)
4. Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based
on temporal and social terms evaluation. In: IMDMKDD 2010, p. 4 (2010)
5. Cheng, X., Yan, X., Lan, Y., Guo, J.: Btm: Topic modeling over short texts. IEEE
Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014)
6. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves.
In: ICML 2006, pp. 233–240 (2006)
7. Fani, H., Zarrinkalam, F., Zhao, X., Feng, Y., Bagheri, E., Du, W.: Temporal iden-
tification of latent communities on twitter (2015). arXiv preprint arxiv:1509.04227
8. Ferragina, P., Scaiella, U.: Fast and accurate annotation of short texts with
wikipedia pages. IEEE Softw. 29(1), 70–75 (2012)
9. Kabbur, S., Ning, X., Karypis, G.: FISM: factored item similarity models for top-N
recommender systems. In: KDD 2013, pp. 659–667 (2013)
10. Kapanipathi, P., Jain, P., Venkataramani, C., Sheth, A.: User interests identifi-
cation on twitter using a hierarchical knowledge base. In: Presutti, V., d’Amato,
C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol.
8465, pp. 99–113. Springer, Heidelberg (2014)
11. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks.
J. Am. Soc. Inform. Sci. Technol. 58(7), 1019–1031 (2007)
12. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: Homophily in
social networks. Ann. Rev. Sociol. 27, pp. 415–444 (2001)
13. Michelson, M., Macskassy, S.A.: Discovering users’ topics of interest on twitter: A
first look. In: AND 2010, pp. 73–80 (2010)
14. Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P.: You are who you know:
inferring user profiles in online social networks. In: WSDM 2010, pp. 251–260
(2010)
15. Shin, Y., Ryo, C., Park, J.: Automatic extraction of persistent topics from social
text streams. World Wide Web 17(6), 1395–1420 (2014)
16. Wang, J., Zhao, W.X., He, Y., Li, X.: Infer user interests via link structure regu-
larization. TIST 5(2), 23 (2014)
17. Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness
obtained from wikipedia links. In: WikiAI 2008, pp. 25–30 (2008)
18. Xu, Z., Lu, R., Xiang, L., Yang, Q.: Discovering user interest on twitter with a
modified author-topic model. In: WI-IAT 2011, vol. 1, pp. 422–429 (2011)
19. Yang, L., Sun, T., Zhang, M., Mei, Q.: We know what@ you# tag: does the dual
role affect hashtag adoption? In: WWW 2012, pp. 261–270 (2012)
20. Yu, Y., Wang, C., Gao, Y.: Attributes coupling based item enhanced matrix factor-
ization technique for recommender systems. arXiv preprint. (2014). arxiv:1405.0770
21. Zarrinkalam, F., Fani, H., Bagheri, E., Kahani, M., Du, W.: Semantics-enabled
user interest detection from twitter. In: WI-IAT 2015 (2015)
Topics in Tweets: A User Study of Topic
Coherence Metrics for Twitter Data
1 Introduction
Twitter is an important platform for users to express their ideas and preferences.
In order to examine the information environment on Twitter, it is critical for
scholars to understand the topics expressed by users. To do this, researchers have
turned to topic modelling approaches [1,2], such as Latent Dirichlet Allocation
(LDA). In topic models, a document can belong to multiple topics, while a topic
is considered a multinomial probability distribution over terms [3]. The exami-
nation of a topic’s term distribution can help researchers to examine what the
topic represents [4,5]. To present researchers with interpretable and meaning-
ful topics, several topic coherence metrics have been previously proposed [6–8].
However, these metrics were developed based on corpora of news articles and
books, which are dissimilar to corpora of tweets, in that the latter are brief
(i.e. < 140 characters), contain colloquial statements or snippets of conversation,
and use peculiarities such as hashtags. Indeed, while topic modelling approaches
specific to Twitter have been developed (e.g. Twitter LDA [2]), the suitability
of these coherence metrics for Twitter data has not been tested.
In this paper, we empirically investigate the appropriateness of ten auto-
matic topic coherence metrics, by comparing how closely they align with human
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 492–504, 2016.
DOI: 10.1007/978-3-319-30671-1 36
Topics in Tweets: A User Study of Topic Coherence Metrics 493
judgments of topic coherence. Of these ten metrics, three examine the statis-
tical coherence of a topic at the term/document distributions levels, while the
remaining seven consider if the terms within a topic exhibit semantic similar-
ity, as measured by their alignment with external resources such as Wikipedia or
WordNet. In this work, we propose two new coherence metrics based on semantic
similarity, which use a separate background dataset of tweets.
To evaluate which coherence metrics most closely align with human judg-
ments, we firstly use three different topic modelling approaches (namely LDA,
Twitter LDA (TLDA) [2], and Pachinko Allocation Model (PAM) [9]) to gener-
ate topics on corpora of tweets. Then, for pairs of topics, we ask crowdsourcing
workers to choose what they perceive to be the more coherent topic. By con-
sidering the pairwise preferences of the workers, we then identify the coherence
metric that is best aligned with human judgments.
Our contributions are as follows: (1) we conduct a large-scale empirical crowd-
sourced user study to identify the coherence of topics generated by three different
topic modelling approaches upon two Twitter datasets; (2) we use these pairwise
coherence preferences to assess the suitability of 10 topic coherence metrics for
Twitter data; (3) we propose two new topic coherence metrics, and show that
our proposed coherence metric based on Pointwise Mutual Information using a
Twitter background dataset is the most similar to human judgments.
The remainder of this paper is structured as follows: Sect. 2 provides an
introduction to topic modelling; Sect. 3 reports the related work of evaluating
topic models; Sect. 4 describes 10 topic coherence metrics; Sect. 5 shows how we
compare automatic metrics to human judgments; Sect. 6 describes the Twitter
datasets we use in the user study (Sect. 7), while the experimental setup and the
results are discussed in Sects. 8 and 9. Finally, we provide concluding summaries
in Sect. 10.
In this section, we describe the topic coherence metrics that we use to automat-
ically evaluate the topics generated by topic modelling approaches. There are
two types of coherence metrics: (1) metrics based on semantic similarity (intro-
duced in [7,8]) and (2) metrics based on statistical analysis (introduced in [6]).
We propose two new metrics based on semantic similarity, which use a Twitter
background dataset.
Topics in Tweets: A User Study of Topic Coherence Metrics 495
1
10 10
Coherence(topic) = SS(wi , wj ) (1)
45 i=1 j=i+1
p(wi , wj )
P M I(wi , wj ) = log (2)
p(wi ), p(wj )
WordNet. WordNet groups words into synsets [14]. There are a number of
semantic similarity and relatedness methods in the existing literature. Among
them, the method designed by Leacock et al. [15] (denoted as LCH) and that
designed by Jiang et al. [16] (denoted as JCN) are especially useful for discovering
lexical similarity [8]. Apart from these two methods, Newman et al. [8] also
showed that the method from Lesk et al. [17] (denoted as LESK) performs well
in capturing the similarity of word pairs. Therefore, we select these 3 WordNet-
based methods to calculate the semantic similarities of the topic’s word pairs,
and produce a topic coherence score.
Wikipedia. Wikipedia has been previously used as background data to calcu-
late the semantic similarity of words [18,19]. In this paper, we select two pop-
ular approaches in the existing literature on calculating the semantic similarity
of words: Pointwise Mutual Information (PMI) and Latent Semantic Analy-
sis [20] (LSA). PMI is a popular method to capture semantic similarity [7,8,18].
Newman et al. [7,8] reported that the performance of PMI was close to human
judgments when assessing the topic’s coherence. Here the PMI data (denoted
as W-PMI) is computed by using Eq. (2) consisting of the PMI score of word
pairs from Wikipedia. On the other hand, since it has been reported that the
performance of PMI is no better than LSA on capturing the semantic similarity
of word pairs [19], in this paper we also use LSA to obtain the similarity of
the word pairs. In the LSA model, a corpus is represented by a term-document
matrix. The cells represent the frequency of a term occurring in a document. To
reduce the dimension of this matrix, a Singular Value Decomposition is applied
on the matrix using the k largest singular values. After the decomposition, each
term is represented by a dense vector in the reduced LSA space. The semantic
similarity of terms can be computed by the distance metrics (e.g. cosine similar-
ity) between the terms’ vectors. We use Wikipedia articles as background data
and calculate the LSA space (denoted as W-LSA), which is a collection of term
vectors in 300 dimensions described in [21].
496 A. Fang et al.
Properties of how the term or documents are assigned to the topics can be
indicative of the coherence of a topic model. In this section, we describe the
term/document distributions of 3 types of meaningless topics defined in [6]:
a uniform distribution over terms; a semantically vacuous distribution over
terms; and a background distribution over documents. We explain how these
permit the measurement of the coherence of a topic.
Uniform Term Distribution. In a topic’s term distribution, if all terms tend
to have an equal and constant probability, this topic is unlikely to be meaningful
nor easily interpreted. A typical uniform term distribution φuni is defined in
Eq. (3), where i is the term index and N k is the total number of terms in topic k.
1
φuni = {P (w1 ), P (w2 ), ..., P (wN k )}, P (wi ) = (3)
Nk
1
Note that many hashtags are not recorded in Wikipedia.
Topics in Tweets: A User Study of Topic Coherence Metrics 497
TLDA and PAM. For each topic in each topic pair, an automatic coherence
metric gives a coherence score to each topic respectively. Thus, for each com-
parison unit, there is a group of data pairs. We apply the Wilcoxon signed-rank
test to calculate the significance level of the difference between the two groups
of data sample. For each comparison unit, an automatic coherence metric deter-
mines the better topic model between two approaches (e.g. LDA > TLDA),
which results in a ranking order of the three topic modelling approaches. For
instance, given the preferences LDA > TLDA, LDA > PAM & TLDA > PAM,
we can obtain the ranking order LDA(1st ) > TLDA(2nd ) > PAM(3rd ). However,
while it is possible for the preference results of comparison units not to permit a
ranking order to be obtained – i.e. a Condorcet paradox such as TLDA > LDA,
LDA > PAM & PAM > TLDA – we did not observe any such paradoxes in our
experiments.
Human Evaluation. Similarly as above, we also rank the three topic modelling
approaches using the topic coherence assessments from humans described in
Sect. 7. This obtained ranking order generated from humans is compared to
that generated from the ten automatic coherence metrics to ascertain the most
suitable coherence metric when assessing a topic’s coherence.
6 Twitter Datasets
In our experiments, we use two Twitter datasets to compare the topic coherence
metrics. The first dataset we use consists of tweets posted by 2,853 newspaper
journalists in the state of New York from 20 May 2015 to 19 Aug 2015, denoted
as NYJ. To construct this dataset, we tracked the journalists’ Twitter handles
using the Twitter Streaming API2 . We choose this dataset due to the high
volume of topics discussed by journalists on Twitter. The second dataset contains
tweets related to the first TV debate during the UK General Election 2015. This
dataset was collected by searching the TV debate-related hashtags and keywords
(e.g. #TVDebate and #LeaderDebate) using the Twitter Streaming API. We
choose this dataset because social scientists want to understand what topics
people discuss. Table 2 reports the details of these two datasets. We describe our
user study and experimental setups in Sects. 7 and 8, respectively.
2
http://dev.twitter.com.
Topics in Tweets: A User Study of Topic Coherence Metrics 499
Fig. 1. The designed user interface and the associated tweets for two topics.
to the United States, we limit the workers country to the United States only.
The TV debate dataset contains topics that can be easily understood, and thus
we set the workers country to English speaking countries (e.g. United Kingdom,
United States, etc.). Overall, 77 different trusted workers for the NYJ dataset
and 91 for the TV debate dataset were selected, respectively.
Human Ground-Truth Ranking Order. As described above, we obtain 5
human judgments for each topic pair. A topic receives one vote if it is preferred
by one worker. Thus, we assign each topic in each topic pair a fraction of the
5 votes received. A higher number of votes indicates that the topic is judged as
being more coherent. Hence, for each comparison unit, we obtain a number of
data pairs. Then, we apply the methodology described in Sect. 5 to obtain the
human ground-truth ranking order of the three topic modelling approaches, i.e.
1st , 2nd , 3rd .
8 Experimental Setup
In this section, we describe the experimental setup for generating the topics and
implementing the automatic metrics.
Generating Topics. We use Mallet4 and Twitter LDA5 to deploy the three
topic modelling approaches on the two datasets (described in Sect. 6). The LDA
hyper-parameters α and β are set to 50/K and 0.01 respectively, which work well
for most corpora [3]. In TLDA, we follow [2] and set γ to 20. We set the number
of topics K to a higher number, 100, for the NYJ dataset as it contains many
topics. The TV debate dataset contains fewer topics, particularly as it took place
only over a 2 hour period, and politicians were asked to respond to questions on
specific themes and ideas6 . Hence, we set K to 30 for the TV debate dataset.
Each topic modelling approach is run 5 times for the two datasets. Therefore, for
each topic modelling approach, we obtain 500 topics in the NYJ dataset and 150
topics in the TV debate dataset. We use the methodology described in Sect. 5
4
http://mallet.cs.umass.edu.
5
http://github.com/minghui/Twitter-LDA.
6
http://goo.gl/JtzJDz.
Topics in Tweets: A User Study of Topic Coherence Metrics 501
to generate 100 topic pairs for each comparison unit. For example, for comparison
Unit(LDA,TLDA), we generate 50 topic pairs of Pairs(LDA→TLDA) and 50
topic pairs of Pairs(TLDA→LDA).
Metrics Setup. Our metrics using WordNet (LCH, JCN & LESK) are imple-
mented using the WordNet::Similarity package. We use the Wikipedia LSA space
and the PMI data from the SEMILAR platform7 to implement the W-LSA and
W-PMI metrics. Since there are too many terms and tweets in our Twitter back-
ground dataset, we remove stopwords, terms occurring in less than 20 tweets,
tweets with less than 10 terms and retweeting tweets. These steps help to reduce
the computational complexity of LSA and PMI using this Twitter background
dataset. After this pre-processing, the number of remaining tweets is 30,151,847.
Table 3 shows the size of T-LSA space and the number of word pairs in T-PMI.
Table 3. The size of LSA space and the number of word pairs.
9 Results
We first compare the ranking order of the three topic modelling approaches
using the automatic coherence metrics and human judgments. Then we show
the differences between each of the automatic metric and human judgments.
Table 4 reports the average coherence score of the three topic models using
the ten automatic metrics (displayed in white background). We also average the
fraction of human votes of the three topic models, shown in Table 4 as column
“human” (shown in grey background). We apply the methodology introduced
in Sect. 5 to obtain the ranking orders shown in Table 4 as column “rank”. By
comparing the human ground-truth ranking orders of the three topic modelling
approaches, we observe that the three topic modelling approaches perform dif-
ferently over the two datasets.
Firstly, we observe that the ranking order from our proposed PMI-based
metric using the Twitter background dataset (T-PMI) best matches the human
ground-truth ranking order across our two Twitter datasets. This indicates that
T-PMI can best capture the performance differences of the three topic mod-
elling approaches. However, our other proposed metric T-LSA does not allow
statistically distinguishable differences between topic modelling approaches to
be identified (denoted by “×”). Second, for metrics based on semantic simi-
larity, both W-PMI and W-LSA produce the same or a similar8 ranking order
7
http://semanticsimilarity.org.
8
Part of the order matches the order from humans.
502 A. Fang et al.
Table 4. The results of the automatic topic coherence metrics on the two datasets
and the corresponding ranking orders. “×” means no statistically significant differences
(p ≤0.05) among the three topic modelling approaches. Two topic modelling approaches
have the same rank if there are no significant differences between them.
as humans on the two datasets. However, both W-PMI and W-LSA perform no
better than T-PMI metric. On the other hand, for metrics based on statistical
analysis, the B metric (statistical analysis on the document distribution) can
also lead to a similar performance as W-LSA or W-PMI compared to human
judgments. Moreover, our results show that the remaining metrics perform no
better than T-PMI, W-PMI & W-LSA metrics according to the ranking orders,
i.e. their ranking orders do not match the human ground-truth ranking order.
To further compare the automatic coherence metrics and human judgments,
we use the sign test to determine whether the 10 automatic metrics perform dif-
ferently than human judgments. Specifically, for an automatic metric or human
judgments, we obtain 100 preference data points from 100 topic pairs for a com-
parison unit(e.g. Unit(T1 ,T2 )), where “1”/“−1” represents that the topic from
T1 /T2 is preferred and “0” means no preference. Then, we hypothesise that there
are no differences between the preference data points from an automatic met-
ric and that from humans for a comparison unit (null hypothesis), and thus
we calculate the p-values reported in Table 5. Each metric gets 6 tests (3 tests
from the NYJ dataset and 3 tests from the TV debate dataset). If p ≤ 0.05, the
null hypothesis is rejected, which means that there are differences between the
preferences of the same comparison unit between a given metric and humans.
We observe that the null hypotheses of 6 tests of T-PMI metric are not
rejected across the two datasets. This suggests that T-PMI is the most aligned
coherence metric with human judgments since there are no differences between
T-PMI and human judgments for all the comparison units (shown in Table 5,
p ≥0.05). Moreover, only one test of W-PMI shows preference differences in
a comparison unit (i.e. Unit(TLDA,PAM) in the NYJ dataset, where the null
hypothesis is rejected) while W-LSA gets two tests rejected. Apart from these
Topics in Tweets: A User Study of Topic Coherence Metrics 503
NYJ
LCH JCN LESK W-LSA W-PMI T-LSA T-PMI U V B
LDA vs. TLDA 0.104 0.133 0.039 0.783 0.779 0.097 0.410 4.1e-11 0.787 2.2e-13
TLDA vs. PAM 2.7e-9 3.8e-10 0.0 1.8e-7 1.1e-4 1.7e-10 1.0 0.007 8.1e-13 0.007
LDA vs. PAM 2.2e-13 3.4e-11 7.2e-14 0.001 0.210 3.0e-11 0.145 1.0 2.4e-10 0.003
TV debate
LCH JCN LESK W-LSA W-PMI T-LSA T-PMI U V B
LDA vs. TLDA 0.010 0.104 0.075 0.999 0.401 0.651 0.999 1.2e-6 2.0e-5 0.011
TLDA vs. PAM 0.003 0.007 0.005 0.211 0.568 0.010 0.783 4.7e-5 0.003 3.6e-12
LDA vs. PAM 0.174 0.007 0.576 0.671 0.391 0.791 0.882 0.391 0.895 0.202
three metrics, the tests of the other metrics indicate that there are significant
differences between these metrics and human judgments in most of compari-
son units. In summary, we find that the T-PMI metric demonstrates the best
alignment with human preferences.
10 Conclusions
In this paper, we used three topic modelling approaches to evaluate the effec-
tiveness of ten automatic topic coherence metrics for assessing the coherence of
topic models generated from two Twitter datasets. Moreover, we proposed two
new topic coherence metrics that use a separate Twitter dataset as background
data when measuring the coherence of topics. By using crowdsourcing to obtain
pairwise user preferences of topical coherences, we determined how closely each
of the ten metrics align with the human judgments. We showed that our pro-
posed PMI-based metric (T-PMI) provided the highest levels of agreement with
the human assessments of topic coherence. Therefore, we recommend its use in
assessing the coherence of topics generated from Twitter. If Twitter background
data is not available, then we suggest one use PMI-based and LSA-based metrics
using Wikipedia as a background (c.f. W-PMI & W-LSA). Among the metrics
not requiring background data, the B metric (statistical analysis on the docu-
ment distribution) is the most aligned with user preferences. For future work,
we will investigate how to use the topic coherence metrics such that the topic
modelling approaches can be automatically tuned to generate topics with high
coherence.
References
1. Hong, L., Davison, B.D.: Empirical study of topic modeling in Twitter. In: Pro-
ceedings of SOMA (2010)
2. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., Li, X.: Comparing
Twitter and traditional media using topic models. In: Clough, P., Foley, C., Gurrin,
C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol.
6611, pp. 338–349. Springer, Heidelberg (2011)
504 A. Fang et al.
3. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant.
Anal. 427(7), 424–440 (2007)
4. Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In:
Proceedings of SIGKDD (2007)
5. Fang, A., Ounis, I., Habel, P., Macdonald, C., Limsopatham, N.: Topic-centric
classification of Twitter user’s political orientation. In: Proceedings of SIGIR (2015)
6. AlSumait, L., Barbará, D., Gentle, J., Domeniconi, C.: Topic significance ranking
of LDA generative models. In: Proceedings of ECMLPKDD (2009)
7. Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In:
Proceedings of ADCS (2009)
8. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic
coherence. In: Proceedings of NAACL (2010)
9. Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of
topic correlations. In: Proceedings of ICML (2006)
10. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
11. Li, W., Blei, D., McCallum, A.: Nonparametric bayes pachinko allocation. In: Pro-
ceedings of UAI (2007)
12. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for
topic models. In: Proceedings of ICML (2009)
13. Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., Blei, D.M.: Reading tea leaves:
how humans interpret topic models. In: Proceedings of NIPS (2009)
14. Fellbaum, C.: WordNet. Wiley Online Library, New York (1998)
15. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for
word sense identification. WordNet Electr. Lexical Database 49(2), 265–283 (1998)
16. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and
lexical taxonomy. In: Proceedings of ICRCL (1997)
17. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries:
how to tell a pine cone from an ice cream cone. In: Proceedings of SIGDOC (1986)
18. Rus, V., Lintean, M.C., Banjade, R., Niraula, N.B., Stefanescu, D.: SEMILAR:
the semantic similarity toolkit. In: Proceedings of ACL (2013)
19. Recchia, G., Jones, M.N.: More data trumps smarter algorithms: comparing point-
wise mutual information with latent semantic analysis. Behav. Res. Meth. 41(3),
647–656 (2009)
20. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analy-
sis. Discourse Processes 25(2–3), 259–284 (1998)
21. Stefănescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on wikipedia
and TASA. In: Proceedings of LREC (2014)
22. Carterette, B., Bennett, P.N., Chickering, D.M., Dumais, S.T.: Here or there. In:
Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR
2008. LNCS, vol. 4956, pp. 16–27. Springer, Heidelberg (2008)
23. Mackie, S., McCreadie, R., Macdonald, C., Ounis, I.: On choosing an effective
automatic evaluation metric for microblog summarisation. In: Proceedings of IIiX
(2014)
Retrieval Models
Supporting Scholarly Search with Keyqueries
1 Introduction
We tackle the problem of automatically supporting a scholar’s search for related
work. Given a research task, the term “related work” refers to papers on similar
topics. Scholars collect and analyze related work in order to get a better under-
standing of their research problem and already existing approaches; a survey of
the strengths and weaknesses of related work forms the basis for placing new
ideas into context. In this paper, we show how the concept of keyqueries [9] can
be employed to support search for related work.
Search engines like Google Scholar, Semantic Scholar, Microsoft Academic
Search, or CiteSeerX provide a keyword-based access to their paper collections.
However, since researchers usually have limited knowledge when they start to
investigate a new topic, it is difficult to find all the related papers with one
or two queries against such interfaces. Keyword queries help to identify a few
promising initial papers, but to find further papers, researchers usually bootstrap
their search from information in these initial papers.
Every paper provides two types of information useful for finding related work:
content and metadata. Content (title, abstract, body text) is a good resource
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 507–520, 2016.
DOI: 10.1007/978-3-319-30671-1 37
508 M. Hagen et al.
2 Related Work
Methods for identifying related work can be divided into citation-graph-based
and content-based approaches. Only few of the content-based methods use queries
such that we also investigate query formulation techniques for similar tasks.
we will avoid in our query formulation. Other combined approaches use topic
models for citation weighting [7] or overcoming the cold-start-problem of papers
without ratings in online reference management communities like CiteULike or
Mendeley [31]. The CiteSight system [20] is supposed to recommend references
while writing a manuscript using both graph- and content-based features to rec-
ommend papers the author cited in the past, or cited papers from references the
author already added. Since our use case is different and considering only cited
papers appears very restrictive, we do not employ this approach either.
As a representative of the combined approaches, we use Google Scholar’s
feature “related articles” to form an often used and very strong baseline. Even
though the underlying algorithms are proprietary, it is very reasonable that
content-based features (e.g., text similarity) and citation-based features (e.g.,
number of citations) are combined.
Since the existing query techniques for related work search are rather simplis-
tic, we briefly review querying strategies for other problems that inspired our
approach.
There are several query-by-document approaches that derive “fingerprint”
queries for a document in near-duplicate, text reuse, or similarity detec-
tion [2,5,32]. Hagen and Stein [12] further improve these query formulation
strategies trying to satisfy a so-called covering property and the User-over-
Ranking hypothesis [27]. Recently, Gollub et al. also introduced the concept
of keyqueries for describing a document’s content [9]. A query is a keyquery for
a document if it returns the document in the top-ranked results when submitted
to a reference search engine. Instead of just representing a paper by its keyqueries
(as suggested by Gollub et al.), we further generalize the idea and assume that
the other top-ranked results returned by a paper’s keyqueries are highly related
to the paper. We will adjust the keyquery formulation to our case of poten-
tially more than one input paper and combine it with the covering property of
Hagen and Stein [12] to derive a keyquery cover for the input papers.
3.1 Baselines
The baselines are chosen from the literature to mimic the strategies scholars
employ: formulating queries, following citations, and Google Scholar’s “related
articles.”
Supporting Scholarly Search with Keyqueries 511
the topics represented by D’s vocabulary W . The covering property [12] states
that (1) in a proper set Q of queries for W , each term w ∈ W should be contained
in at least one query q ∈ Q (i.e., the queries “cover” W in a set-theoretic sense)
and (2) Q should be simple (i.e., qi ⊆ qj for any qi , qj ∈ Q with i = j), to avoid
redundancy. The formal problem we tackle then is:
Keyquery Cover
Given: (1) A vocabulary W extracted from a set D of documents
(2) Levels k and l describing keyquery generality
Task: Find a simple set Q ⊆ 2W of queries that are keyquery for every
d ∈ D with respect to k and l and that together cover W .
The parameters k and l are typically set to 10, 50, or 100 but it will not
always be possible to find a covering set of queries that are keyqueries for all
documents in D. In such a case, we strive for queries that are keyqueries for a
|D| − 1 subset of D.
Solving Related Work Search. The pseudocode of our algorithm using key-
query covers for related work search is given as Algorithm 2. Using the Key-
query Cover algorithm as a subroutine, the basic idea is to first try to find
keyqueries for all given input papers, and to add the corresponding results to the
set C of candidates. If there are too few candidates after this step, Keyquery
Cover is solved for combinations with |I| − 1 input papers, then with |I| − 2
input papers, etc.
The vocabulary combination (line 3) is based on the top-20 keyphrases per
paper extracted by KP-Miner [8], the best unsupervised keyphrase extractor for
research papers in SemEval 2010 [18]. The terms (keyphrases) in the combined
vocabulary list W are ranked by the following strategy: First, all terms that
appear in all papers ranked according to their mean rank in the different lists.
Below these, all terms contained in (|D| − 1)-sized subsets ranked according to
their mean ranks, etc.
In the next steps (lines 4–6), the Keyquery Cover-instance is solved for
the subset D and its combined vocabulary W , the found candidate papers are
added to the candidate set C. In case that enough candidates are found, the
algorithm stops (line 7). Otherwise, some other input subset is used in the next
iteration. If not enough candidates can be found with keyqueries for more than
one paper, the keyqueries for the single papers form the fallback option (also
applies to single-paper inputs).
In our experiments, we will set k, l = 10 and c = 100 · |I| (to be comparable,
also the baselines are set to retrieve 100 candidate papers per input paper).
4 Evaluation
User Study Design. Topics (information needs) and relevance judgments for a
Cranfield-style analysis of the Related Work Search approaches are obtained
from computer science students and scholars (other qualifications do not match
the corpus characteristics). A study participant first specifies a topic by selecting
a set of input papers (could just be one) and then judges the found papers of
the different approaches with respect to their relevance. These judgments are
the basis for our experimental comparison.
In order to ensure a smooth work flow, we have built a web interface that also
allows a user to participate without being on site. The study itself consists of two
steps with different interfaces. In the first step, a participant is asked to enter a
research task they are familiar with and describe it with one or two sentences.
Note that we request a familiar research task because expert knowledge is later
required in order to judge the relevance of the suggested papers. After task
description, the participants have to enter titles of input papers related to their
task. While the user is entering a title, a background process automatically
suggests title auto completions from our corpus. Whenever a title was not chosen
from the suggestions but was manually entered, it is again checked whether the
specified paper exists in our corpus or not. If the paper cannot be found in the
collection, the user is notified to enter another title. After this two-phase topic
formation (written description + input papers), the participants have to name
at least one paper that they expect to be found (again with the help of auto
completion). Last, the users should describe how they have chosen the input and
expected papers to get some feedback of whether for instance Google Scholar
was used which might bias the judgments.
After a participant has completed the topic formation, the different recom-
mendation algorithms are run on the input papers and the pooled set of the
top-10 results of each approach is displayed in random order for judgment (i.e.,
at most 40 papers to judge). For each paper, the fields title, authors, publication
516 M. Hagen et al.
venue, and publication year are shown. Additionally, links to fade in the abstract
and, if available in our corpus, to the respective PDF-file are listed. Thus, the
participants can check the abstract or the full text if needed.
The participants assessed two criteria: relevance and familiarity. Relevance
was rated on a 4-point scale (highly relevant, relevant, marginally, not relevant)
while familiarity was a two-level judgment (familiar or not). Combining the two
criteria, we can identify good papers not known to the participant before our
study. This is especially interesting since we asked for research topics the par-
ticipants are familiar with and in such a scenario algorithms identifying relevant
papers not known before are very promising.
General Retrieval Performance. We measure nDCG using the 4-point scale: high
as 3, relevant as 2, marginal as 1, and not relevant as 0, and precision considering
highly and relevant as the relevant class. The mean nDCG@10 and prec@10
over all topics are given in the second and third columns of Table 1 (Left). The
top-10 of our new keyquery-cover-based approach KQC are the most relevant
Supporting Scholarly Search with Keyqueries 517
Table 1. (Left) Performance values achieved by the different algorithms averaged over
all topics. (Right) Percentage of top-10 rank overlap averaged over all topics.
among the individual methods on a par with Google Scholar and Sofia Search;
both differences are not significant according to a two-sided paired t-test (p =
0.05). The overall best result is achieved by the KQC+Sofia+Google combination
that significantly outperforms its components. Another observation is that most
methods significantly outperform the Nascimento et al. baseline.
Recall of Expected Papers. To analyze how many papers were retrieved that the
scholars entered as expected in the first step of our study, we use the recall of the
expected papers denoted as rece . Since the computation of rece is not dependent
on the obtained relevance judgments, we can compute recall for any k. We choose
k = 50 since we assume that a human would often not consider many more than
the top-50 papers of a single related work search approach. The fourth column of
Table 1 (Left) shows the average rece @50-values for each algorithm over all topics.
The best rece @50 among the individual methods is the 0.43 of Google Scholar. The
combinations including KQC and Google Scholar achieve an even better result
of 0.48. An explanation for the advantage of Google Scholar—beyond the proba-
bly good black-box model underlying the “related articles” functionality—can be
found in the participants’ free text fields of topic formation. For several topics, the
study participants stated that they used Google Scholar to come up with the set
of expected papers and this obviously biases the results.
Recall of Unfamiliar but Relevant Papers. We also measure how many “gems”
the different algorithms recommend. That is, how many highly or relevant papers
are found that the user was not familiar with before our study. Providing such
papers is a very interesting feature that might eventually help a researcher to find
“all” relevant literature on a given topic. Again we measure recall, but since we
have familiarity judgments for the top-10 only, we measure the recall recur @10 of
unexpected but relevant papers listed in the rightmost column of Table 1 (Left).
Again, the combination KQC+Sofia+Google finds the most of the unfamiliar
but relevant papers.
Result Overlap. Since no participant had to judge 40 different papers, the top-
10 results cannot be completely distinct; Table 1 (Right) shows the percentage
of overlap for each combination. On average, the top-10 retrieved papers of two
518 M. Hagen et al.
different algorithms share 2–4 papers meaning that the approaches retrieve a
rather diverse set of related papers. This again suggests combinations of differ-
ent approaches as the best possible system and combining the best query-based
method (our new KQC), Google Scholar, and Sofia Search indeed achieves the best
overall performance. Having in mind the “sparsity” of our crawled paper corpus’
citation graph (only about 60 % of the cited/citing papers could be crawled), Sofia
Search probably could diversify the results even more in a corpus containing all ref-
erences. The combination of the three systems in our opinion also very well mod-
els the human way of looking for related work such that the KQC+Sofia+Google
combination could be viewed as very close to automated human behavior.
A Word on Efficiency. The most costly part of our approach is the number
of submitted queries. In our study, about 79 queries were submitted per topic;
results for already submitted queries were cached such that no query was sub-
mitted twice (e.g., once when trying to find a keyquery for all input papers
together, another time for a smaller subset again). On the one hand, 79 queries
might be viewed more costly than the about 27 queries submitted by the
Nascimento et al. baseline (10·|I| queries) or the about 30 requests submitted for
the Google Scholar suggestions (11 · |I| requests; one to find an individual input
paper, ten to retrieve the 100 related articles). Also Sofia Search on a good index
of the citation graph is much faster than our keyquery-based approach. On the
other hand, keyqueries could be pre-computed at indexing time for every paper
such that at retrieval time only a few postlists from a reverted index [25] have
to be merged. This would substantially speed up the whole process, rendering
a reverted-index-based variant of our keyquery approach an important step for
deploying the first real prototype.
5 Conclusion
We have presented a novel keyquery-based approach to related work search.
The addressed common scenario is a scholar who has already found a hand-
ful of papers in an initial research and wants to find “all” the other related
papers—often a rather tedious task. Our problem formalization of Related
Work Search is meant to provide automatic support in such situations. As
for solving the problem, our new keyquery-based technique focuses on the con-
tent of the already found papers complementing most of the existing approaches
that exploit the citation graph only. Our overall idea is to get the best of both
worlds (i.e., queries and citations) from appropriate method combinations.
And in fact, in our effectiveness evaluations of a Cranfield-style experiment
on a collection of about 200,000 computer science papers, the combination of key-
queries with the citation-based Sofia Search and Google Scholar’s related article
suggestions performed best on 42 topics (i.e., sets of initial papers). The top-10
results of each approach were judged by the expert who suggested the topic.
Based on these relevance judgments, we have evaluated the different algorithms
and identified promising combinations based on the rather different returned
Supporting Scholarly Search with Keyqueries 519
References
1. Beel, J., Langer, S., Genzmehr, M., Gipp, B., Breitinger, C., Nürnberger, A.:
Research paper recommender system evaluation: a quantitative literature survey.
In: RepSys Workshop, pp. 15–22 (2013)
2. Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: WSDM, pp. 262–
271 (2009)
3. Bethard, S., Jurafsky, D.: Who should I cite: learning literature search models from
citation behavior. In: CIKM, pp. 609–618 (2010)
4. Caragea, C., Silvescu, A., Mitra, P., Giles, C.L.: Can’t see the forest for the trees?
a citation recommendation system. In: JCDL, pp. 111–114 (2013)
5. Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar
content using search engine query interface. In: CIKM, pp. 701–710 (2009)
6. Ekstrand, M.D., Kannan, P., Stemper, J.A., Butler, J.T., Konstan, J.A., Riedl, J.:
Automatically building research reading lists. In: RecSys, pp. 159–166 (2010)
7. El-Arini, K., Guestrin, C.: Beyond keyword search: discovering relevant scientific
literature. In: KDD, pp. 439–447 (2011)
8. El-Beltagy, S.R., Rafea, A.A.: KP-Miner: a keyphrase extraction system for English
and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)
9. Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content
descriptors for the web. In: SIGIR, pp. 981–984 (2013)
10. Golshan, B., Lappas, T., Terzi, E.: Sofia search: a tool for automating related-work
search. In: SIGMOD, pp. 621–624 (2012)
11. Hagen, M., Glimm, C.: Supporting more-like-this information needs: finding sim-
ilar web content in different scenarios. In: Kanoulas, E., Lupu, M., Clough, P.,
Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol.
8685, pp. 50–61. Springer, Heidelberg (2014)
12. Hagen, M., Stein, B.: Candidate document retrieval for web-scale text reuse detec-
tion. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol.
7024, pp. 356–367. Springer, Heidelberg (2011)
13. He, Q., Kifer, D., Pei, J., Mitra, P., Giles, C.L.: Citation recommendation without
author supervision. In: WSDM, pp. 755–764 (2011)
14. He, Q., Pei, J., Kifer, D., Mitra, P., Giles, C.L.: Context-aware citation recommen-
dation. In: WWW, pp. 421–430 (2010)
520 M. Hagen et al.
15. Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recom-
mending citations: translating papers into references. In: CIKM, pp. 1910–1914
(2012)
16. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques.
ACM Trans. Inf. Syst. 20(4), 422–446 (2002)
17. Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative Bayesian models
for linked corpus. In: AAAI, pp. 1340–1345 (2010)
18. Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: SemEval-2010 task 5: auto-
matic keyphrase extraction from scientific articles. In: SemEval 2010, pp. 21–26
(2010)
19. Küçüktunç, O., Saule, E., Kaya, K., Catalyürek, Ü.V.: TheAdvisor: a webservice
for academic recommendation. In: JCDL, pp. 433–434 (2013)
20. Livne, A., Gokuladas, V., Teevan, J., Dumais, S., Adar, E.: CiteSight: supporting
contextual citation recommendation using differential search. In: SIGIR, pp. 807–
816 (2014)
21. Lu, Y., He, J., Shan, D., Yan, H.: Recommending citations with translation model.
In: CIKM, pp. 2017–2020 (2011)
22. Lykke, M., Larsen, B., Lund, H., Ingwersen, P.: Developing a test collection for
the evaluation of integrated search. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz,
U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS,
vol. 5993, pp. 627–630. Springer, Heidelberg (2010)
23. Nallapati, R., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models for
text and citations. In: KDD, pp. 542–550 (2008)
24. Nascimento, C., Laender, A.H.F., Soares da Silva, A., Gonçalves, M.A.: A source
independent framework for research paper recommendation. In: JCDL, pp. 297–306
(2011)
25. Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and
expansion. In: CIKM, pp. 1049–1058 (2010)
26. Robertson, S.E., Zaragoza, H., Taylor, M.J.: Simple BM25 extension to multiple
weighted fields. In: CIKM, pp. 42–49 (2004)
27. Stein, B., Hagen, M.: Introducing the user-over-ranking hypothesis. In: Clough, P.,
Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR
2011. LNCS, vol. 6611, pp. 503–509. Springer, Heidelberg (2011)
28. Sugiyama, K., Kan, M.-Y.: Exploiting potential citation papers in scholarly paper
recommendation. In: JCDL, pp. 153–162 (2013)
29. Tang, J., Zhang, J.: A discriminative approach to topic-based citation recommenda-
tion. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD
2009. LNCS, vol. 5476, pp. 572–579. Springer, Heidelberg (2009)
30. Tang, X., Wan, X., Zhang, X.: Cross-language context-aware citation recommen-
dation in scientific articles. In: SIGIR, pp. 817–826 (2014)
31. Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific
articles. In: KDD, pp. 448–456 (2011)
32. Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P.G., Koudas, N., Papadias, D.: Query
by document. In: WSDM, pp. 34–43 (2009)
Pseudo-Query Reformulation
Fernando Diaz(B)
1 Introduction
Most information retrieval systems operate by performing a single retrieval in
response to a query. Effective results sometimes require several manual reformu-
lations by the user [11] or semi-automatic reformulations assisted by the system
[10]. Although the reformulation process can be important to the user (e.g. in
order to gain perspective about the domain of interest), the process can also lead
to frustration and abandonment.
In many ways, the core information retrieval problem is to improve the initial
ranking and user satisfaction and, as a result, reduce the need for reformulations,
manual or semi-automatic. While there have been several advances in learning to
rank given a fixed query representation [15], there has been somewhat less atten-
tion, from a formal modeling perspective, given to automatically reformulating
the query before presenting the user with the retrieval results. One notable excep-
tion is pseudo-relevance feedback (PRF), the technique of using terms found in
the top retrieved documents to conduct a second retrieval [4]. PRF is known
to be a very strong baseline. However, it incurs a very high computational cost
because it issues a second, much longer query for retrieval.
In this paper, we present an approach to automatic query reformulation
which combines the iterated nature of human query reformulation with the auto-
matic behavior of PRF. We refer to this process as pseudo-query reformulation
(PQR). Figure 1 graphically illustrates the intuition behind PQR. In this figure,
each query and its retrieved results are depicted as nodes in a graph. An edge
exists between two nodes, qi and qj , if there is a simple reformulation from qi
to qj ; for example, a single term addition or deletion. This simulates the incre-
mental query modifications a user might conduct during a session. The results
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 521–532, 2016.
DOI: 10.1007/978-3-319-30671-1 38
522 F. Diaz
in this figure are colored so that red documents reflect relevance. If we assume
that a user is following a good reformulation policy, then, starting at q0 , she
will select reformulations (nodes) which incrementally increase the number of
relevant documents. This is depicted as the path of shaded nodes in our graph.
We conjecture that a user navigates from qi to qj by using insights from the
retrieval results of qi (e.g. qj includes a highly discriminative term in the results
for qi ) or by incorporating some prior knowledge (e.g. qj includes a highly dis-
criminative term in general). PQR is an algorithm which behaves in the same
way: issuing a query, observing the results, inspecting possible reformulations,
selecting a reformulation likely to be effective, and then iterating.
q1 q*
q4
q0 q2
q5
q6
q3
Fig. 1. Query reformulation as graph search. Nodes represent queries and associated
retrieved results. Relevant documents are highlighted in red. Edges exist between nodes
whose queries are simple reformulations of each other. The goal of pseudo-query refor-
mulation is to, given a seed query q0 by a user, automatically navigate to a better
query (Color figure online).
2 Related Work
Kurland et al. present several heuristics for iteratively refining a language model
query by navigating document clusters in a retrieval system [13]. The technique
Pseudo-Query Reformulation 523
3 Motivation
As mentioned earlier, users often reformulate an initial query in response to the
system’s ranking [11]. Reformulation actions include adding, deleting, and sub-
stituting query words, amongst other transformations. There is evidence that
manual reformulation can improve the quality of a ranked list for a given infor-
mation need [11, Table 5]. However, previous research has demonstrated that
humans are not as effective as automatic methods in this task [18].
In order to estimate an upper bound on the potential improvement from refor-
mulation, we propose a simulation of an optimal user’s reformulation behavior.
Our simulator models an omniscient user based on query-document relevance
judgments, known as qrels. Given a seed query, the user first generates a set of
candidate reformulations based on one-word additions and deletions. The user
then selects the query whose retrieval has the highest performance based on the
qrels. The process then iterates up to some search depth. If we consider queries
524 F. Diaz
as nodes in a graph and edges between queries with single word additions or
deletions, then the process can be considered a depth-limited graph search by
an oracle.
We ran this simulation on a sample of queries described in Sect. 6. The
results of these experiments (Table 1) demonstrate the range of performance
for PQR. Our oracle simulator performs quite well, even given the limited depth
of our search. Performance is substantially better than the baseline, with rel-
ative improvements greater than those in published literature. To some extent
this should be expected since the oracle can leverage relevance information. Sur-
prisingly, though, the algorithm is able to achieve this performance increase by
adding and removing a small set of up to four terms.
4 Pseudo-Query Reformulation
We would like to develop algorithms that can approximate the behavior of our
optimal policy without having access to any qrels or an oracle. As such, PQR
follows the framework of the simulator from Sect. 3. The algorithm recursively
performs candidate generation and candidate scoring within each recursion up to
some depth and returns the highest scored query encountered.
we select the top n terms in the relevance model, θRt , associated with qt [14]. The
relevance model is the retrieval score-weighted linear interpolation of retrieved
document languages models. We adopt this approach for its computational ease
and demonstrated effectiveness in pseudo-relevance feedback.
qt qt
q0 qt-1 q0 qt-1
Fig. 2. Drift signal classes. Signals for qt include comparisons with reference queries
qt−1 and q0 to prevent query drift.
Drift signals compare the current query qt with its parent qt−1 and the
initial query q0 (Fig. 2). These signals can serve to anchor our prediction and
avoid query drift, situations where a reformulation candidate appears to be high
quality but is topically very different from the desired information need. One way
to measure drift is to compute the difference in the query signals for these pairs.
Specifically, we measure the aggregate IDF, SC, and QS values of the deleted,
preserved, and introduced keywords. We also generate two signals comparing
the results sets of these pairs of queries. The first measures the similarity of
the ordering of retrieved documents. In order to do this, we compute the τ -
AP between the rankings [22]. The τ -AP computes a position-sensitive version
of Kendall’s τ suitable for information retrieval tasks. The ranking of results
for a reformulation candidate with a very high τ -AP will be indistinguishable
from those of the reference query; the ranking of results for a reformulation
candidate with a very low τ -AP will be quite different from the reference query.
Our second result set signal measures drift by inspecting the result set language
models. Specifically, it computes B(θRt−1 , θRt ), the Bhattacharyya correlation
between the result sets.
information need. In practice, we train this model using true performance val-
ues of candidates encountered throughout a search process started at q0 ; running
this process over a number of training q0 ’s results in a large set of training can-
didates. We are agnostic about the precise functional form of our model and opt
for a linear ranking support vector machine [15] due to its training and evalua-
tion speed, something we found necessary when conducting experiments at scale.
Precisely how this training set is collected will be described in the next section.
QuerySearch(q, d, b, dmax , m)
1 if d = dmax
2 then return q
3 Qq ← GenerateCandidates(q)
4 µ̃ ← PredictPerformance(Qq )
5 Q̃q ← TopQueries(Qq , µ̃, b)
6 Q̂q ← TopQueries(Qq , µ̃, m)
7 for qi ∈ Q̃q
8 do Q̂q ← Q̂q ∪ QuerySearch(qi , d + 1, b, dmax , m)
9 µ̂ ← PredictPerformance(Q̂q )
10 return TopQueries(Q̂q , µ̂, m)
5 Training
The effectiveness of the search algorithm (Sect. 4.3) critically depends on the reli-
ability of the performance predictor (Sect. 4.2). We can gather training data for
this model by sampling batches of reformulations from the query graph. In order
to ensure that our sample includes high quality reformulations, we run the ora-
cle algorithm (Sect. 3) while recording the visited reformulations and their true
performance values. Any bias in this data collection process can be addressed
by gathering additional training reformulations with suboptimal performance
predictor models (i.e. partially trained models).
6 Methods
We use three standard retrieval corpora for our experiments. Two news corpora,
trec12 and robust, consist of large archives of news articles. The trec12 dataset
consists of the Tipster disks 1 and 2 with TREC ad hoc topics 51–200. The robust
dataset consists of Tipster disks 4 and 5 with TREC ad hoc topics 301–450 and
601–700. Our web corpus consists of the Category B section of the Clue Web 2009
dataset with TREC Web topics 1–200. We tokenized all corpora on whitespace
and then applied Krovetz stemming and removed words in the SMART stopword
list. We further pruned the web corpus of all documents with a Waterloo spam
score less than 70. We use TREC title queries in all of our experiments. We
randomly partitioned the queries into three sets: 60 % for training, 20 % for
validation, and 20 % for testing. We repeated this random split procedure five
times and present results averaged across the test set queries.
All indexing and retrieval was conducted using indri 5.7. Our SVM mod-
els were trained using liblinear 1.95. We evaluated final retrievals using NIST
trec eval 9.0. In order to support large parameter sweeps, each query reformu-
lation in PQR performed a re-ranking of the documents retrieved by q0 instead
of a re-retrieval from the full index. Pilot experiments found that the effective-
ness of re-retrieval was comparable with that of re-ranking though re-retrieval
incurred much higher latency. Aside from the performance prediction model, our
algorithm has the following free parameters: the number of term-addition can-
didates per query (n), the number of candidates to selection per query (b), and
the maximum search depth (dmax ). Combined, the automatic reformulation and
the multi-pass training resulted in computationally expensive processes whose
runtime is sensitive to these parameters. Consequently, we fixed our parame-
ter settings at relatively modest numbers (n = 10, b = 3, dmax = 4) and leave
a more thorough analysis of sensitivity for an extended manuscript. Although
these numbers may seem small, we remind the reader that this results in roughly
800 reformulations considered within the graph search for a single q0 . The num-
ber of candidates to merge (m) is tuned throughout training on the validation
set v0 and ranges from five to twenty. All runs, including baselines, optimized
NDCG@30. We used QL and RM3 as baselines. QL used Dirichlet smoothing
with parameter tuned on the full training set using a range of values from 500
Pseudo-Query Reformulation 529
through 5000. We tuned the RM3 parameters on the full training set. The range
of feedback terms considered was {5, 10, 25, 50, 75, 100}; the range of feedback
documents was {5, 25, 50, 75, 100}; the range of λ was [0, 1] with a step size of 0.1.
7 Results
We present the results for our experiments in Table 2. Our first baseline, QL,
reflects the performance of q0 alone and represents an algorithm which is rep-
resentationally comparable with PQR insofar as it also retrieves using a short,
unweighted query. Our second baseline, RM3, reflects the performance of a strong
algorithm that also uses the retrieval results to improve performance, although
with much richer representational power (the optimal number of terms often
hover near 75–100). As expected, RM3 consistently outperforms QL in terms
of MAP. And while the performance is superior across all metrics for trec12,
RM3 is statistically indistinguishable from QL for higher precision metrics on
our other two data sets. The random policy, which replaces our performance pre-
dictor with random scores, consistently underperforms both baselines for robust
and web. Interestingly, this algorithm is statistically indistinguishable from QL
for trec12, suggesting that this corpus may be easier.
Table 2. Comparison of PQR to QL and RM3 baselines for our datasets. Statistically
significant difference with respect to QL (: better; : worse) and RM3 (: better;
♦: worse) using a Student’s paired t-test (p < 0.05 with a Bonferroni correction). The
best performing run is presented in bold.
Next, we turn to the performance of PQR. Across all corpora and across
almost all metrics, PQR significantly outperforms QL. While this baseline might
be considered low, it is a representationally fair comparison with PQR. So, this
530 F. Diaz
result demonstrates the ability of PQR to find more effective reformulations than
q0 . The underperformance of the random algorithm signifies that the effective-
ness of PQR is attributable to the performance prediction model as opposed to
a merely walking on the reformulation graph. That said, PQR is statistically
indistinguishable from QL for higher recall metrics on the web corpus (NDCG
and MAP). In all likelihood, this results from the optimization of NDCG@30,
as opposed to higher recall metrics. This outcome is amplified when we compare
PQR to RM3. For the robust and web datasets, we notice PQR significantly
outperforming RM3 for high precision metrics but showing weaker performance
for high recall metrics. The weaker performance of PQR compared to RM3 for
trec12 might be explained by the easier nature of the corpus combined with the
richer representation of the RM3 model. Nevertheless, we stress that, for those
metrics we optimized for and for the more precision-oriented collections, PQR
dominates all baselines.
We can inspect the coefficient values in the linear model to determine the
importance of individual signals in performance prediction. In Table 3, we present
the most important signals for each of our experiments. Because our results are
averaged over several runs, we selected the signals most often occurring amongst
the highest weighted in these runs, using the final selected model (see Sect. 5).
Interestingly, many of the top ranked signals are our drift features which com-
pare the language models and rankings of the candidate result set with those of
its parent and the first query. This suggests that the algorithm is successfully
preventing query drift by promoting candidates that retrieve results similar to
the original and parent queries. On the other hand, the high weight for Clar-
ity suggests that PQR is simultaneously balancing ranked list refinement with
ranked list anchoring.
8 Discussion
strong initial retrieval (i.e. QL). As mentioned earlier, the strength of the ran-
dom run separately provides evidence of the initial retrieval’s strength. Now, if
the initial retrieval uncovered significantly more relevant documents, then RM3
will estimate a language model very close to the true relevance model, boosting
performance. Since RM3 allows a long, rich, weighted query, it follows that it
would outperform PQR’s constrained representation. That said, it is remarkable
that PQR achieves comparable performance to RM3 on many metrics with at
most |q0 | + dmax words.
Despite the strong performance for high-precision metrics, the weaker per-
formance for high-recall metrics was somewhat disappointing but should be
expected given our optimization target (NDCG@30). Post-hoc experiments
demonstrated that optimizing for MAP boosted the performance of PQR to
0.1728 on web, resulting in statistically indistinguishable performance with RM3.
Nevertheless, we are not certain that human query reformulation of the type
encountered in general web search would improve high recall metrics since users
in that context rarely inspect deep into the ranked list.
One of the biggest concerns with PQR is efficiency. Whereas our QL baseline
ran in a 100–200 ms, PQR ran in 10–20 s, even using the re-ranking approach.
However, because of this approach, our post-retrieval costs scale modestly as
corpus size grows, especially compared to massive query expansion techniques
like RM3. To understand this observation, note that issuing a long RM3 query
results in a huge slowdown in performance due to the number of postings lists
that need to be evaluated and merged. We found that for the web collection, RM3
performed quite slow, often taking minutes to complete long queries. PQR, on
the other hand, has the same overhead as RM3 in terms of an initial retrieval and
fetching document vectors. After this step, though, PQR only needs to access
the index for term statistic information, not a re-retrieval. Though even with
our speedup, PQR is unlikely to be helpful for realtime, low-latency retrieval.
However, there are several situations where such a technique may be permissible,
including ‘slow search’ scenarios where users tolerate latency in order to receive
better results [20], offline result caching, and document filtering.
9 Conclusion
We have presented a new formal model for information retrieval. The positive
results on three separate corpora provide evidence that PQR is a framework
worth investigating further. In terms of candidate generation, we considered
only very simple word additions and deletions while previous research has demon-
strated the effectiveness of applying multiword units (e.g. ordered and unordered
windows). Beyond this, we can imagine applying more sophisticated operations
such as filters, site restrictions, or time ranges. While it would increase our query
space, it may also allow for more precise and higher precision reformulations. In
terms of candidate scoring, we found that our novel drift signals allowed for effec-
tive query expansion. We believe that PQR provides a framework for developing
other performance predictors in a grounded retrieval model. In terms of graph
search, we believe that other search strategies might result in more effective
coverage of the space.
532 F. Diaz
References
1. Bendersky, M.: Information Retrieval with Query Hypergraphs. Ph.D. thesis, Uni-
versity of Massachusetts Amherst (2012)
2. Bhattacharyya, A.: On a measure of divergence between two statistical populations
defined by probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
3. Cao, G., Nie, J.Y., Gao, J., Robertson, S.: Selecting good expansion terms for
pseudo-relevance feedback. In: SIGIR (2008)
4. Croft, W.B., Harper, D.J.: Using probabilistic models of document retrieval with-
out relevance information. J. Documentation 35(4), 285–295 (1979)
5. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In:
SIGIR (2002)
6. Diaz, F.: Performance prediction using spatial autocorrelation. In: SIGIR (2007)
7. Hauff, C.: Predicting the Effectiveness of Queries and Retrieval Systems. Ph.D.
thesis, University of Twente (2010)
8. Hauff, C., Azzopardi, L., Hiemstra, D.: The combination and evaluation of query
performance prediction methods. In: Boughanem, M., Berrut, C., Mothe, J., Soule-
Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 301–312. Springer, Heidelberg
(2009)
9. He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In:
SPIRE (2004)
10. Huang, C.K., Chien, L.F., Oyang, Y.J.: Relevant term suggestion in interactive
web search based on contextual information in query session logs. JASIST 54,
638–649 (2003)
11. Huang, J., Efthimiadis, E.N.: Analyzing and evaluating query reformulation strate-
gies in web search logs. In: CIKM (2009)
12. Kumaran, G., Carvalho, V.: Reducing long queries using query quality predictors.
In: SIGIR (2009)
13. Kurland, O., Lee, L., Domshlak, C.: Better than the real thing?: iterative pseudo-
query processing using cluster-based language models. In: SIGIR (2005)
14. Lavrenko, V., Croft, W.B.: Relevance based language models. In: SIGIR (2001)
15. Liu, T.Y.: Learning to Rank for Information Retrieval. Springer, Heidelberg (2009)
16. Lv, Y., Zhai, C.: Adaptive relevance feedback in information retrieval. In: CIKM
(2009)
17. Macdonald, C., Santos, R.L., Ounis, I.: On the usefulness of query features for
learning to rank. In: CIKM (2012)
18. Ruthven, I.: Re-examining the potential effectiveness of interactive query expan-
sion. In: SIGIR (2003)
19. Sheldon, D., Shokouhi, M., Szummer, M., Craswell, N.: Lambdamerge: merging
the results of query reformulations. In: WSDM (2011)
20. Teevan, J., Collins-Thompson, K., White, R.W., Dumais, S.: Slow search. Com-
mun. ACM 57(8), 36–38 (2014)
21. Xue, X., Croft, W.B., Smith, D.A.: Modeling reformulation using passage analysis.
In: CIKM (2010)
22. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for
information retrieval. In: SIGIR (2008)
VODUM: A Topic Model Unifying Viewpoint,
Topic and Opinion Discovery
1 Introduction
The surge of opinionated on-line texts raised the interest of researchers and the
general public alike as an incredibly rich source of data to analyze contrastive
views on a wide range of issues, such as policy or commercial products. This large
volume of opinionated data can be explored through text mining techniques,
known as Opinion Mining or Sentiment Analysis. In an opinionated document,
a user expresses her opinions on one or several topics, according to her view-
point. We define the key concepts of topic, viewpoint, and opinion as follows.
A topic is one of the subjects discussed in a document collection. A viewpoint
is the standpoint of one or several authors on a set of topics. An opinion is a
wording that is specific to a topic and a viewpoint. For example, in the manually
crafted sentence Israel occupied the Palestinian territories of the Gaza strip, the
topic is the presence of Israel on the Gaza strip, the viewpoint is pro-Palestine
and an opinion is occupied. Indeed, when mentioning the action of building Israeli
communities on disputed lands, the pro-Palestine side is likely to use the verb
to occupy, whereas the pro-Israel side is likely to use the verb to settle. Both
sides discuss the same topic, but they use a different wording that conveys an
opinion.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 533–545, 2016.
DOI: 10.1007/978-3-319-30671-1 39
534 T. Thonet et al.
Model is used Topical words and opinion Viewpoint assignments Model is independent of
Ref without supervision words are partitioned are learned structure-specific properties
[7] + − − +
[3] + + − +
[4, 8, 9, 12] − − + +
[10, 11] + − + −
VODUM + + + +
model was proposed in [12] to jointly model topics and viewpoints. However,
both TAM and JTV inferred parameters were only integrated as features into
a SVM classifier to identify document-level viewpoints. TAM was extended to
perform contrastive viewpoint summarization [9], but this extension was still
weakly supervised as it leveraged a sentiment lexicon to identify viewpoints.
The task of viewpoint identification was also studied for user generated data
such as forums, where users can debate on controversial issues [10,11]. These
works proposed Topic Models that however rely on structure-specific properties
exclusive to forums (such as threads, posts, users, interactions between users),
which cannot be applied to infer general documents’ viewpoint.
The specific properties of VODUM compared to related work are summarized
in Table 1. VODUM is totally unsupervised. It separately models topical words
and opinion words. Document-level viewpoint assignments are learned. VODUM
is also structure-independent and thus broadly applicable. These properties are
further detailed in Sect. 3.
Topical Words and Opinion Words Separation. In our model, topical words and
opinion words are partitioned based on their part-of-speech, in line with several
viewpoint modeling and Opinion Mining works [3,5,13]. Here, nouns are assumed
to be topical words; adjectives, verbs and adverbs are assumed to be opinion
words. While this assumption seems coarse, let us stress that a more accurate
definition of topical and opinion words (e.g., by leveraging sentiment lexicons)
could be used, without requiring any modification of our model. The part-of-
speechtagging pre-processing step is further described in Sect. 4.2. The part-of-
speech category is represented as an observed variable x which takes a value of
0 for topical words and 1 for opinion words. Topical words and opinion words
are then drawn from distributions φ0 and φ1 , respectively.
536 T. Thonet et al.
α θ z w φ0 β0
V T
η π v x φ1 β1
Nm VT
Md
D
Sentence-level Topic Variables. Most Topic Models define word-level topic vari-
ables (e.g., [2,3,8]). We hypothesize that using sentence-level topic variables,
denoted by z, better captures the dependency between the opinions expressed
VODUM: A Topic Model Unifying Viewpoint, Topic and Opinion Discovery 537
in a sentence and the topic of the sentence. Indeed, coercing all words from
a sentence to be related to the same topic reinforces the co-occurrence prop-
erty leveraged by Topic Models. As a result, the topics induced by topical word
distributions φ0 and opinion word distributions φ1 are more likely to be aligned.
As for other probabilistic Topic Models, the exact inference of VODUM is not
tractable. We thus rely on approximate inference to compute parameters φ0 ,
φ1 , θ, and π, as well as the document-level viewpoint assignments v. We chose
collapsed Gibbs sampling as it was shown to be quicker to converge than approx-
imate inference methods such as variational Bayes [1].
not require the actual computation of the posterior probability, which is usu-
ally intractable for Topic Models. Only the marginal probability distributions of
latent variables (i.e., the probability distribution of one latent variable given all
other latent variables and all observed variables) need to be computed in order to
perform collapsed Gibbs sampling. For each sample, the collapsed Gibbs sampler
iteratively draws assignments for all latent variables using their marginal proba-
bility distributions, conditioned on the previous sample’s assignments. The mar-
ginal probability distributions used to sample the topic assignments and view-
point assignments in our collapsed Gibbs sampler are described in (1) and (2),
respectively. The derivation is omitted due to space limitation. The notation used
in the equations is defined in Table 2. Additionally, indexes or superscripts −d
and −(d, m) exclude the d-th document and the m-th sentence of the d-th doc-
ument, respectively. Similarly, indexes or superscripts d and (d, m) include only
the d-th document and the m-th sentence of the d-th document, respectively.
A superscript (·) denotes a summation over the corresponding superscripted
index.
(j),−(d,m)
ni +α
p(zd,m = j|vd = i, v −d , z −(d,m) , w, x) ∝ (·),−(d,m)
ni + Tα
(k),(d,m) (k),(d,m)
n −1 n −1
0,j
(k),−(d,m) 1,i,j
(k),−(d,m)
n0,j + β0 + a n1,i,j + β1 + a
k∈W0 a=0 k∈W1 a=0 (1)
× ×
(·),(d,m) (·),(d,m)
n −1 n −1
0,j 1,i,j
(·),−(d,m) (·),−(d,m)
n0,j + W0 β0 + b n1,i,j + W1 β1 + b
b=0 b=0
VODUM: A Topic Model Unifying Viewpoint, Topic and Opinion Discovery 539
n(i),−d + η
p(vd = i|v −d , z, w, x) ∝
n(·),−d + V η
(j),d (k),d
n1,i,j −1
T ni −1
(j),−d
ni +α+a
T (k),−d
n1,i,j + β1 + a (2)
j=1 a=0 k∈W1 a=0
× (·),d
× (·),d
ni −1 (·),−d
n1,i,j −1
(·),−d
ni + Tα + b j=1 n1,i,j + W1 β1 + b
b=0 b=0
(i) (j)
n +η (j) ni + α
π (i) = (·)
(3) θi = (·)
(4)
n +Vη ni + T α
⎧ (k)
⎧ (k)
⎨ n0,j +β0
if k ∈ W0 ⎨ n1,i,j +β1
if k ∈ W1
(k) (·) (k) (·)
φ0,j = n0,j +W0 β0 (5) φ1,i,j = n1,i,j +W1 β1 (6)
⎩ ⎩
0 otherwise 0 otherwise
4 Experiments
Note that an issue similar to (H1) was already addressed in [10,11]. The
authors did not evaluate, however, the impact of this assumption on the view-
point identification task. The rest of this section is organised as follows. In
Sect. 4.1, we detail the baselines we compared VODUM against. Section 4.2
describes the dataset used for the evaluation and the experimental setup. In
Sect. 4.3, we report and discuss the results of the evaluation.
4.1 Baselines
word removal and Porter stemming to the collection. We also performed part-
of-speech tagging and annotated data with the binary part-of-speech categories
that we defined in Sect. 3.1. Category 0 corresponds to topical words and contains
common nouns and proper nouns. Category 1 corresponds to opinion words and
contains adjectives, verbs, adverbs, qualifiers, modals, and prepositions. Tokens
labeled with other part-of-speech were filtered out.
We implemented our model VODUM and the baselines based on the JGibb-
LDA3 Java implementation of collapsed Gibbs sampling for LDA. The source
code of our implementation and the formatted data (after all pre-processing
steps) are available at https://github.com/tthonet/VODUM.
In the experiments, we set the hyperparameters of VODUM and baselines
to the following values. The hyperparameters in VODUM were manually tuned:
α = 0.01, β0 = β1 = 0.01, and η = 100. The rationale behind the small α
(θ’s hyperparameter) and the large η (π’s hyperparameter) is that we want a
sparse θ distribution (i.e., each viewpoint has a distinct topic distribution) and a
smoothed π distribution (i.e., a document has equal chance to be generated under
each of the viewpoints). We chose the same hyperparameters for the degenerate
versions of VODUM. The hyperparameters of TAM were set according to [9]:
α = 0.1, β = 0.1, δ0 = 80.0, δ1 = 20.0, γ0 = γ1 = 5.0, ω = 0.01. For JTV, we
used the hyperparameters’ values described in [12]: α = 0.01, β = 0.01, γ = 25.
We manually adjusted the hyperparameters of LDA to α = 0.5 and β = 0.01.
For all experiments, we set the number of viewpoints (for VODUM, VODUM-D,
VODUM-O, VODUM-W, VODUM-S, and JTV) and the number of aspects (for
TAM) to 2, as documents from the Bitterlemons collection are assumed to reflect
either the Israeli or Palestinian viewpoint.
4.3 Evaluation
We performed both quantitative and qualitative evaluation to assess the quality
of our model. The quantitative evaluation relies on two metrics: held-out per-
plexity and viewpoint identification accuracy. It compares the performance of
our model VODUM according to these metrics against the aforementioned base-
lines. In addition, the qualitative evaluation consists in checking the coherence
of topical words and the related viewpoint-specific opinion words inferred by our
model. These evaluations are further described below.
0.85
900
Models
VODUM
0.80
TAM
JTV
800
LDA
0.75
Average perplexity
0.70
700
Accuracy
0.65
600
0.60
500
0.55
0.50
400
JTV
VODUM
LDA
VODUM−D
VODUM−O
VODUM−W
VODUM−S
TAM
10 20 30 40 50
of the test set (i.e., a set of held-out documents). A lower perplexity for the test
set, which is equivalent to a higher likelihood for the test set, can be interpreted
as a better generalization performance of the model: the model, learned on the
training set, is less “perplexed” by the test set. In this experiment, we aimed
to investigate (H5) and compared the generalization performance of our model
VODUM against the state-of-the-art baselines. We performed a 10-fold cross-
validation as follows. The model is trained on nine folds of the collection for
1,000 iterations and inference on the remaining, held-out test fold is performed
for another 1,000 iterations. For both training and test, we only considered the
final sample, i.e., the 1,000th sample. We finally report the held-out perplexity
averaged on the final samples of the ten possible test folds. As the generaliza-
tion performance depends on the number of topics, we computed the held-out
perplexity of models for 5, 10, 15, 20, 30, and 50 topics.
The results of this experiment (Fig. 3) support (H5): for all number of top-
ics VODUM has a significantly lower perplexity than TAM, JTV, and LDA.
This implies that VODUM’s ability to generalize data is better than baselines’.
JTV presents slightly lower perplexity than TAM and LDA, especially for larger
number of topics. TAM and LDA obtained comparable perplexity, TAM being
slightly better for lower number of topics and LDA being slightly better for
higher number of topics.
our model VODUM against our baselines, in order to investigate (H1), (H2),
(H3), (H4), and (H5). As the Bitterlemons collection contains two different view-
points (Israeli or Palestinian), viewpoint identification accuracy is here equiv-
alent to binary clustering accuracy: each document is assigned to viewpoint 0
or to viewpoint 1. The VIA is then the ratio of well-clustered documents. As
reported in [9], the viewpoint identification accuracy presents high variance for
different executions of a Gibbs sampler, because of the stochastic nature of the
process. For each model evaluated, we thus performed 50 executions of 1,000
iterations, and kept the final (1,000th ) sample of each execution, resulting in
a total of 50 samples. In this experiment, we set the number of topics for the
different models as follows: 12 for VODUM, VODUM-D, VODUM-O, VODUM-
W, and VODUM-S. The number of topics for state-of-the-art models was set
according to their respective authors’ recommendation: 8 for TAM (according
to [9]), 6 for JTV (according to [12]). For LDA, the number of topics was set to
2: as LDA does not model viewpoints, we evaluated to what extent LDA is able
to match viewpoints with topics.
VODUM, VODUM-D, VODUM-O, and VODUM-W provide documents’
viewpoint assignment for each sample. We thus directly used these assignments
to compute the VIA. However, VODUM-S only has sentence-level viewpoint
assignments. We assigned each document the majority viewpoint assignment of
its sentences. When the sentences of a document are evenly assigned to each
viewpoint, the viewpoint of the document was chosen randomly. We proceeded
similarly with TAM, JTV, and LDA, using their majority word-level aspect,
viewpoint, and topic assignments, respectively, to compute the document-level
viewpoint assignments. The results of the experiments are given in Fig. 4. The
boxplots show that our model VODUM overall performed the best in the view-
point identification task. More specifically, VODUM outperforms state-of-the-art
models, thus supporting (H5). Among state-of-the-art models, TAM obtained the
best results. We also observe that JTV did not outperform LDA in the viewpoint
identification task. This may be due to the fact that the dependency between
topic variables and viewpoint variables was not taken into account when we
used JTV to identify document-level viewpoints – word-level viewpoint assign-
ments in JTV are not necessarily aligned across topics. The observations of the
degenerate versions of VODUM support (H1), (H2), (H3), and (H4). VODUM-O
and VODUM-W performed very poorly compared to other models. The sepa-
ration of topical words and opinion words, as well as the use of sentence-level
topic variables – properties that were removed from VODUM in VODUM-O and
VODUM-W, respectively – are then both absolutely necessary in our model to
accurately identify documents’ viewpoint, which confirms (H2) and (H3). The
model VODUM-S obtained reasonable VIA, albeit clearly lower than VODUM.
Document-level viewpoint variables thus lead to a better VIA than sentence-
level viewpoint variables, verifying (H4). Among the degenerate versions of
VODUM, VODUM-D overall yielded the highest VIA, but still slightly lower
than VODUM. We conclude that the assumption made in [10,11], stating that
544 T. Thonet et al.
Table 3. Most probable topical and opinion (stemmed) words inferred by VODUM
for the topic manually labeled as Middle East conflicts. Opinion words are given for
each viewpoint: Israeli (I) and Palestinian (P).
Middle East conflicts israel palestinian syria jihad war iraq dai suicid destruct iran
Topical words
Middle East conflicts islam isra terrorist recent militari intern like heavi close american
Opinion words (I)
Middle East conflicts need win think sai don strong new sure believ commit
Opinion words (P)
We expect to extend the work presented here in several ways. As the accuracy
of viewpoint identification shows a high variance between different samples, one
needs to design a method to automatically collect the most accurate sample or to
deduce accurate viewpoint assignments from a set of samples. VODUM can also
integrate sentiment labels to create a separation between positive and negative
opinion words, using sentiment lexicons. This could increase the discrimination
between different viewpoints and thus improve viewpoint identification. A view-
point summarization framework can as well benefit from VODUM, selecting the
most relevant sentences from each viewpoint and for each topic by leveraging
VODUM’s inferred parameters.
References
1. Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for
topic models. In: Proceedings of UAI 2009, pp. 27–34 (2009)
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Proceedings of
NIPS 2001, pp. 601–608 (2001)
3. Fang, Y., Si, L., Somasundaram, N., Yu, Z.: Mining contrastive opinions on political
texts using cross-perspective topic model. In: Proceedings of WSDM 2012, pp. 63–
72 (2012)
4. Lin, W.H., Wilson, T., Wiebe, J., Hauptmann, A.: Which side are you on? Identi-
fying perspectives at the document and sentence levels. In: Proceedings of CoNLL
2006, pp. 109–116 (2006)
5. Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions
on the web. In: Proceedings of WWW 2005, pp. 342–351 (2005)
6. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf.
Retrieval 2(1–2), 1–135 (2008)
7. Paul, M.J., Girju, R.: Cross-cultural analysis of blogs and forums with mixed-
collection topic models. In: Proceedings of EMNLP 2009, pp. 1408–1417 (2009)
8. Paul, M.J., Girju, R.: A two-dimensional topic-aspect model for discovering multi-
faceted topics. In: Proceedings of AAAI 2010, pp. 545–550 (2010)
9. Paul, M.J., Zhai, C., Girju, R.: Summarizing contrastive viewpoints in opinionated
text. In: Proceedings of EMNLP 2010, pp. 66–76 (2010)
10. Qiu, M., Jiang, J.: A latent variable model for viewpoint discovery from threaded
forum posts. In: Proceedings of NAACL HLT 2013, pp. 1031–1040 (2013)
11. Qiu, M., Yang, L., Jiang, J.: Modeling interaction features for debate side cluster-
ing. In: Proceedings of CIKM 2013, pp. 873–878 (2013)
12. Trabelsi, A., Zaiane, O.R.: Mining contentious documents using an unsupervised
topic model based approach. In: Proceedings of ICDM 2014, pp. 550–559 (2014)
13. Turney, P.D.: Thumbs up or thumbs down? Semantic orientation applied to unsu-
pervised classification of reviews. In: Proceedings of ACL 2002, pp. 417–424 (2002)
Applications
Harvesting Training Images for Fine-Grained
Object Categories Using Visual Descriptions
1 Introduction
Visual object recognition has advanced greatly in recent years, partly due to
the availability of large-scale image datasets such as ImageNet [4]. However, the
availability of image datasets for fine-grained object categories, such as particular
types of flowers and birds [10,16], is still limited. Manual annotation of such
training images is a notoriously onerous task and requires domain expertise.
Thus, previous work [2,3,6–9,12,14] has automatically harvested image
datasets by retrieving images from online search engines. These images can then
be used as training examples for a visual classifier. Typically the work starts
with a keyword search of the desired category, often using the category name
e.g. querying Google for “butterfly”. As category names are often polysemous
and, in addition, a page relevant to the keyword might also contain many pic-
tures not of the required category, images are also filtered and reranked. While
some work reranks or filters images using solely visual features [3,6,9,14], others
M. Everingham—who died in 2012—is included as a posthumous author of this
paper for his intellectual contributions during the course of this work.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 549–560, 2016.
DOI: 10.1007/978-3-319-30671-1 40
550 J. Wang et al.
have shown that features from the web pages containing the images, such as the
neighbouring text and metadata information, are useful as well [2,7,8,12] (see
Sect. 1.1 for an in-depth discussion). However, prior work has solely focused on
basic level categories (such as “butterfly”) and not been used for fine-grained
categories (such as a butterfly species like “Danaus plexippus”) where the need
to avoid manual annotation is greatest for the reasons mentioned above.
Our work therefore focuses on the automatic harvesting of training images for
fine-grained object categories. Although fine-grained categories pose particular
challenges for this task (smaller number of overall pictures available, higher risk
of wrong picture tags due to needed domain expertise, among others), at least
for natural categories they have one advantage: their instances share strong
visual characteristics and therefore there exist ‘visual descriptions’, i.e. textual
descriptions of their appearances, in nature guides, providing a resource that
goes far beyond the usual use of category names. See Fig. 1 for an example.
We use these visual descriptions for harvesting images for fine-grained object
categories to (i) improve search engine querying compared to category name
search and (ii) rerank images by comparing their accompanying web page text
to the independent visual descriptions from nature guides as an expert source.
We show that the use of these visual descriptions can improve precision over
name-based search. To the best of our knowledge this is the first work using
visual descriptions for harvesting training images for object categorization.1
Harvesting Training Images. Fergus et al. [6] were one of the first to pro-
pose training a visual classifier by automatically harvesting (potentially noisy)
1
Previous work [1, 5, 15] has used visual descriptions for object recognition without
any training images but not for the discovery of training images itself.
Harvesting Training Images for Fine-Grained Object Categories 551
training images from the Web, in their case obtained by querying Google Images
with the object category name. Topic modelling is performed on the images,
and test images are classified by how likely they are to belong to the best topic
selected using a validation set. However, using a single best topic results in
low data diversity. Li et al. [9] propose a framework where category models are
learnt iteratively, and the image dataset simultaneously expanded at each itera-
tion. They overcome the data diversity problem by retaining a small but highly
diverse ‘cache set’ of positive images at each iteration, and using it to incre-
mentally update the model. Other related work includes using multiple-instance
learning to automatically de-emphasise false positives [14] and an active learning
approach to iteratively label a subset of the images [3].
Harvesting Using Text and Images. The work described so far involves filter-
ing only by images; the sole textual data involved are keyword queries to search
engines. Berg and Forsyth [2] model both images and their surrounding text from
Google web search to harvest images for ten animal categories. Topic modelling is
applied to the text, and images are ranked based on how likely their corresponding
text is to belong to each topic. Their work requires human supervision to iden-
tify relevant topics. Schroff et al. [12] propose generating training images with-
out manual intervention. Class-independent text-based classifiers are trained to
rerank images using binary features from web pages, e.g. whether the query term
occurs in the website title. They demonstrated superior results to [2] on the same
dataset without requiring any human supervision. George et al. [7] build on [12]
by retrieving images iteratively, while Krapac et al. [8] add contextual features
(words surrounding the image etc.) on top of the binary features of [12].
Like [2,7,8,12], our work ranks images by their surrounding text. However,
we tackle fine-grained object categories which will allow the harvesting of train-
ing images to scale to a large number of categories. In addition, we do not only
use the web text surrounding the image but use the visual descriptions in out-
side resources to rank accompanying web-text by their similarity to these visual
descriptions. In contrast to the manual topic definition in [2], this method does
then not require human intervention during harvesting.
1.2 Overview
We illustrate harvesting training images for ten butterfly categories of the Leeds
Butterfly Dataset [15], using the provided eNature visual descriptions. Figure 2
shows the pipeline for our method, starting from the butterfly species’ name and
visual description. We obtain a list of candidate web pages via search engine
queries (Sect. 2). These are parsed to produce a collection of images and text
blocks for each web page, along with their position and size on the page (Sect. 3).
Image-text correspondence aligns the images with text blocks on each web page
(Sect. 4). The text blocks are then matched to the butterfly description (Sect. 5),
and images ranked based on how similar their corresponding text blocks are to
the visual description (Sect. 6). The ranked images are evaluated in Sect. 7, and
conclusions offered in Sect. 8.
552 J. Wang et al.
Fig. 2. General overview of the proposed framework, which starts from the butterfly
species name (Latin and English) and description, and outputs a ranked list of images.
phrases and adjective phrases, obtained via phrase chunking as in [15]. The
number of seed phrases per category ranges from 5 to 17 depending on the
length of the description; an example list is shown in Fig. 3. We query Google
with the butterfly name augmented with each seed phrase individually, and with
all possible combinations of seed phrase pairs and triplets (e.g. ‘Vanessa atalanta’
bright blue patch pink bar white spots).
Two sets of seeded queries are used: one with the Latin and one with the
English butterfly name. For each category, all candidate pages from the base and
the seeded queries (54 to 1670 queries per category, mean 592) are pooled. For
de-duplication, only one copy of pages with the same web address is retained.
Description: FW tip extended, clipped. Above, black with orange-red to vermilion bars across
FW and on HW border. Below, mottled black, brown, and blue with pink bar on FW. White spots
at FW tip above and below, bright blue patch on lower HW angle above and below.
Seed phrases: black brown and blue; bright blue patch; fw tip; hw border; lower hw angle; orange
red to vermilion bars; pink bar; white spot
Fig. 3. Seed phrases for Vanessa atalanta extracted from its visual description.
consider as text blocks all text within block-level elements (including tables and
table cells) and those delimited by any images or the <br> element. All images
and text blocks are extracted from web pages, along with their height, width,
and (x, y) coordinates as would be rendered by a browser. The renderer viewport
size is set as 1280 × 1024 across all experiments.
4 Image-text Correspondence
The list of images and text blocks with their positional information is then
used to align text blocks to images (see Fig. 4 for an illustration). An image
can correspond to multiple text blocks since we do not want to discard any
good candidate visual descriptions by limiting ourselves to only one nearest
neighbouring text. On the other hand, each text block may only be aligned to
its closest image; multiple images are allowed only if they both share the same
distance from the text block. This relies on the assumption that the closest image
is more likely to correspond to the text blocks than those further away.
An image is a candidate for alignment with a text block only if all or part
of the image is located directly above, below or either side of the text block. All
candidate images must have a minimum size of 120 × 120. For each text block,
we compute the perpendicular distance between the closest edges of the text
block and each image, and select the image with the minimum distance subject
to the constraint that the distance is smaller than a fixed threshold (100 pixels
in our experiments). Text blocks without a corresponding image are discarded.
5 Text Matching
The text matching component computes how similar a text block is to the visual
description from our outside resource, using IR methods. We treat the butter-
fly’s visual description as a query, and the set of text blocks as a collection of
documents. The goal is to search for documents which are similar to the query
and assign each document a similarity score.
There are many different ways of computing text similarity, and we only
explore one of the simplest in this paper, namely a bag of words, frequency-based
vector model. It is a matter of future research to establish whether more sophis-
ticated methods (such as compositional methods) will improve performance fur-
ther. We represent each document as a vector of term frequencies (tf ). Separate
vocabularies are used per query, with the vocabulary size varying between 1649
to 9445. The vocabulary consists of all words from the document collection,
except common stopwords and Hapax legomena (words occurring only once).
Terms are case-normalised, tokenised by punctuation and Porter-stemmed [11].
We use the lnc.ltc weighting scheme of the SMART system [13], where the query
vector uses the log-weighted term frequency with idf-weighting, while the doc-
ument vector uses the log-weighted term frequency without idf-weighting. The
relevance score between a query and a document vector is computed using the
cosine similarity measure.
7 Experimental Results
We evaluate the image rankings via precision at selected recall levels. We com-
pare our reranked images using visual descriptions to the Google ranking pro-
duced by name search only.
Statistics and Filtering Evaluation. Table 1 provides the statistics for our anno-
tations. The table shows the level of noise, where many images on the web
pages are unrelated to the butterfly category. Filtering via metadata dramati-
cally reduces the number of negative images without too strongly reducing the
number of positive ones. The cases where the number of negative images is high
after filtering are due to the categories being visually similar to other butterflies,
which often have been confused by the web page authors.
Baselines. We use the four base queries (using predominantly category names)
as independent baselines for evaluation. For each base query, we rank each image
according to the rank of its web page returned by Google followed by its order of
appearance on the web page. Images are filtered via category name appearance in
metadata just as in our method. We also compare the results with two additional
baselines, querying Google Images with (i) “Latin name”; (ii) “English name”
+ butterfly. These are ranked using the ranks returned by Google Images.
Fig. 5. Precision at selected levels of recall for the proposed method for ten butterfly
categories, compared to baseline queries. The recall (x-axis) is in terms of number of
images. For clarity we only show the precisions at selected recalls of up to 50 images.
558 J. Wang et al.
Monarch Butterfly (Danaus plexippus) Description 3 1/2-4” (89-102 mm). Very large,
with FW long and drawn out. Above, bright, burnt-orange with black veins and black
1 margins sprinkled with white dots; FW tip broadly black interrupted by larger white
and orange spots. Below, paler, duskier orange. 1 black spot appears between HW cell
and margin on male above and below. Female darker with black veins smudged.
Description : Family: Nymphalidae, Brush-footed Butterflies view all from this family
Description 3 1/2-4” (89-102 mm). Very large, with FW long and drawn out. Above,
bright, burnt-orange with black veins and black margins sprinkled with white dots; FW
tip broadly black interrupted by larger white and orange spots. Below, paler, duskier
2
orange. 1 black spot appears between HW cell and margin on male above and below.
Female darker with black veins smudged. Similar Species Viceroy smaller, has shorter
wings and black line across HW. Queen and Tropic Queen are browner and smaller.
Female Mimic has large white patch across black FW tips. . . .
The wings are bright orange with black veins and black margin decorated with white
3
spots. Female’s veins are thicker.
male bright orange w/oval black scent patch (for courtship) on HW vein above, and
5
abdominal “hair-pencil;” female dull orange, more thickly scaled black veins
Description: This is a very large butterfly with a wingspan between 3 3/8 and 4 7/8
inches. The upperside of the male is bright orange with wide black borders and black
veins. The hindwing has a patch of scent scales. The female is orange-brown with wide
6 black borders and blurred black veins. Both sexes have white spots on the borders and
the apex. There are a few orange spots on the tip of the forewings. The underside is
similar to the upperside except that the tips of the forewing and hindwing are yellow-
brown and the white spots are larger. The male is slightly larger than the female.
General description: Wings orange with black-bordered veins and black borders enclos-
ing small white spots. Male with small black scent patch along inner margin. Ventral
7 hindwing as above but paler yellow-orange and with more prominent white spots in
black border. Female duller orange with wider black veins; lacks black scent patch on
dorsal hindwing.
A large butterfly, mainly orange with black wing veins and margins, with two rows of
8 white spots in the black margins. The Monarch is much lighter below on the hindwing,
and males have a scent patch - a dark spot along the vein - in the center of the hindwing.
Wingspan: 3 1/2 to 4 inches Wings Open: Bright orange with black veins and black
borders with white spots in the male. The male also has a small oval scent patch along
9
a vein on each hind wing. The female is brownish-orange with darker veins Wings
Closed: Forewings are bright orange, but hind wings are paler
...
The Monarch’s wingspan ranges from 3–4 inches. The upper side of the wings is tawny-
orange, the veins and margins are black, and in the margins are two series of small
white spots. The fore wings also have a few orange spots near the tip. The underside is
16 similar but the tip of the fore wing and hind wing are yellow-brown instead of tawny-
orange and the white spots are larger. The male has a black patch of androconial scales
responsible for dispersing pheromones on the hind wings, and the black veins on its
wing are narrower than the female’s. The male is also slightly larger.
Fig. 6. Top ranked images for Danaus plexippus, along with their corresponding
descriptions. A red border indicates that the image was misclassified. The first
description is almost identical to the eNature description (Color figure online).
Harvesting Training Images for Fine-Grained Object Categories 559
The main mistakes made by our method can be attributed to (i) the web
pages themselves; (ii) our algorithm.
In the first case, the ambiguity of some web page layouts causes a misalign-
ment between text blocks and images. In addition, errors arise from mistakes
made by the page authors, for example confusing the Monarch (Danaus plexip-
pus) with the Viceroy butterfly (Limenitis archippus).
For mistakes caused by our algorithm, the first involves the text similarity
component. Apart from similar butterflies having similar visual descriptions,
some keywords in the text can also be used to describe non-butterflies, e.g.
“pale yellow” can be used to describe a caterpillar or butterfly wings. The second
mistake arises from text-image misalignment as a side-effect of the filtering step:
there were cases where a butterfly image does not contain the butterfly name in
its metadata while a caterpillar image on the same page does. Since the butterfly
image is discarded, the algorithm matches a text block with its next nearest
image – the caterpillar. This could have been rectified by not matching text
blocks associated with a previously discarded image, but it can be argued that
such text blocks might still be useful in certain cases, e.g. when the discarded
image is an advertisement and the next closest image is a valid image.
Figure 6 shows the top ranked images for Danaus plexippus, along with the
retrieved textual descriptions. All descriptions at early stages of recall are indeed
of Danaus plexippus. This shows that our proposed method performs exception-
ally well given sufficient textual descriptions. The two image misclassifications
that still are present are from image-text misalignment, as described above.
8 Conclusion
We have proposed methods for automatically harvesting training images for
fine-grained object categories from the Web, using the category name and visual
descriptions. Our main contribution is the use of visual descriptions for querying
candidate web pages and reranking the collected images. We show that this
method often outperforms the frequently used method of just using the category
name on its own with regards to precision at early stages of recall. In addition,
it retrieves further textual descriptions of the category.
Possible future work could explore different aspects: (i) exploring better lan-
guage models and similarity measures for comparing visual descriptions and web
page text; (ii) training generic butterfly/non-butterfly visual classifiers to further
filter or rerank the images; (iii) investigating whether the reranked training set
can actually induce better visual classifiers.
Acknowledgements. The authors thank Paul Clough and the anonymous reviewers
for their feedback on an earlier draft of this paper. This work was supported by the
EU CHIST-ERA D2K 2011 Visual Sense project (EPSRC grant EP/K019082/1) and
the Overseas Research Students Awards Scheme (ORSAS) for Josiah Wang.
560 J. Wang et al.
References
1. Ba, J.L., Swersky, K., Fidler, S., Salakhutdinov, R.: Predicting deep zero-shot
convolutional neural networks using textual descriptions. In: Proceedings of the
IEEE International Conference on Computer Vision (2015)
2. Berg, T.L., Forsyth, D.A.: Animals on the web. In: Proceedings of the IEEE Con-
ference on Computer Vision & Pattern Recognition, vol. 2, pp. 1463–1470 (2006)
3. Collins, B., Deng, J., Li, K., Fei-Fei, L.: Towards scalable dataset construction:
an active learning approach. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV
2008, Part I. LNCS, vol. 5302, pp. 86–98. Springer, Heidelberg (2008)
4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale
hierarchical image database. In: Proceedings of the IEEE Conference on Computer
Vision & Pattern Recognition, pp. 248–255 (2009)
5. Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: Zero-shot learning using
purely textual descriptions. In: Proceedings of the IEEE Conference on Computer
Vision & Pattern Recognition (2013)
6. Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from
Google’s image search. In: Proceedings of the IEEE International Conference on
Computer Vision, vol. 2, pp. 1816–1823 (2005)
7. George, M., Ghanem, N., Ismail, M.A.: Learning-based incremental creation of web
image databases. In: Proceedings of the 12th IEEE International Conference on
Machine Learning and Applications (ICMLA 2013), pp. 424–429 (2013)
8. Krapac, J., Allan, M., Verbeek, J., Jurie, F.: Improving web-image search results
using query-relative classifiers. In: Proceedings of the IEEE Conference on Com-
puter Vision & Pattern Recognition, pp. 1094–1101 (2010)
9. Li, L.J., Wang, G., Fei-Fei, L.: OPTIMOL: Automatic Object Picture collecTion
via Incremental MOdel Learning. In: Proceedings of the IEEE Conference on Com-
puter Vision & Pattern Recognition, pp. 1–8 (2007)
10. Nilsback, M.E., Zisserman, A.: Automatedower classification over a large numberof
classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics
and Image Processing, pp. 722–729 (2008)
11. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
12. Schroff, F., Criminisi, A., Zisserman, A.: Harvesting image databases from the
Web. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 754–766 (2011)
13. Singhal, A., Salton, G., Buckley, C.: Length normalization in degraded text col-
lections. In: Proceedings of Fifth Annual Symposium on Document Analysis and
Information Retrieval, pp. 149–162 (1996)
14. Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: Multiple-
instance learning for weakly supervised object categorization. In: Proceedings of
the IEEE Conference on Computer Vision & Pattern Recognition (2008)
15. Wang, J., Markert, K., Everingham, M.: Learning models for object recognition
from natural language descriptions. In: Proceedings of the British Machine Vision
Conference, pp. 2.1-2.11. BMVA Press (2009)
16. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.:
Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001. California Institute
of Technology (2010)
17. Zhou, N., Fan, J.: Automatic image-text alignment for large-scale web image index-
ing and retrieval. Pattern Recogn. 48(1), 205–219 (2015)
Do Your Social Profiles Reveal What Languages
You Speak? Language Inference
from Social Media Profiles
Yu Xu1 ✉ , M. Rami Ghorab1, Zhongqing Wang2, Dong Zhou3, and Séamus Lawless1
( )
1
ADAPT Centre, Knowledge and Data Engineering Group,
School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
{xuyu,rami.ghorab,seamus.lawless}@scss.tcd.ie
2
Natural Language Processing Lab, Soochow University, Suzhou, China
[email protected]
3
School of Computer Science and Engineering,
Hunan University of Science and Technology, Xiangtan, China
[email protected]
Abstract. In the multilingual World Wide Web, it is critical for Web applica‐
tions, such as multilingual search engines and targeted international advertise‐
ments, to know what languages the user understands. However, online users are
often unwilling to make the effort to explicitly provide this information. Addi‐
tionally, language identification techniques struggle when a user does not use all
the languages they know to directly interact with the applications. This work
proposes a method of inferring the language(s) online users comprehend by
analyzing their social profiles. It is mainly based on the intuition that a user’s
experiences could imply what languages they know. This is nontrivial, however,
as social profiles are usually incomplete, and the languages that are regionally
related or similar in vocabulary may share common features; this makes the
signals that help to infer language scarce and noisy. This work proposes a
language and social relation-based factor graph model to address this problem.
To overcome these challenges, it explores external resources to bring in more
evidential signals, and exploits the dependency relations between languages as
well as social relations between profiles in modeling the problem. Experiments
in this work are conducted on a large-scale dataset. The results demonstrate the
success of our proposed approach in language inference and show that the
proposed framework outperforms several alternative methods.
1 Introduction
As a result of globalization and cultural openness, it has become common for modern-
day humans to speak multiple languages (polyglots) [1, 2]. Knowledge of what
languages the online user comprehends1 is becoming important for many Web applica‐
tions to enable effective information services. For example, knowing the user’s language
1
In this paper, comprehand means the user is able to grasp information in that language to a good
extent.
information enables search engines to deliver the multilingual search services, machine
translation tools to identify optional target translation languages, and advertisers to serve
targeted international ads.
However, online users often choose not to explicitly provide their language infor‐
mation in real-world applications, even when they have a facilitated means to do so. For
example, based on our analysis of 50,575 user profiles on LinkedIn, we found that only
11 % of users specified the languages that they speak, even though there are input fields
available for this; Ghorab et al. [3] carried out an analysis where it was found that many
users of The European Library2 entered search queries in non-English languages and
browsed documents in those languages, without bothering to use the drop down menu
that allows them to change the interface language. Therefore, studies proposed to auto‐
matically acquire the users’ language information through the Language Identification
(LID) techniques [4]; they detect what languages a user comprehends by identifying
what languages of texts the user read or wrote in interaction history. However, the chal‐
lenge faced by this approach is the common cold-start problem, where there is only
limited history of interactions available for a new user.
This work proposes the use of social profiles to infer the languages that a user
comprehends. This idea is mainly based on three points of observation: (1) today most
users maintain a profile on a number of Social Networking Sites (SNSs), such as Face‐
book, LinkedIn, which could include basic personal information like education and work
experience; (2) the social profile provides first-hand information about the user to Third
Party Applications (TPAs) of SNSs. For example, the popular social login techniques
mostly authorize a TPA to access a user’s social profile [5]; (3) there is a chance that
the information in a user’s social profile may implicitly suggest what languages the user
comprehends. For example, if a user has conducted academic studies in Germany, this
could imply that the user at least understands German to some extent.
Using automatic techniques to infer the user’s language can serve to overcome the
cold-start problem, and benefit numerous Web applications as mentioned previously.
This work is also a first step towards automatically inferring the user’s level of expertise
in languages that they comprehend. Furthermore, this research can be integrated with
other work in user profiling where numerous other characteristics of the user can be
automatically inferred, such as the user’s gender [6], location [7].
It is straightforward to cast the task of user language inference from social profiles
as a standard text classification problem. In other words, predicting what languages a
user comprehends relies on features defined from textual information of the social
profiles, e.g. unigram features. However, the social profiles are usually incomplete, with
critical information sometimes missing. For example, the location information of the
work experience, which, in this research, is shown to be important evidence for language
inference, is only provided by about half of the users in the collected dataset. Moreover,
some languages can be mutually intelligible (i.e. speakers of different but related
languages can readily understand each other without intentional study) or regionally
related (i.e. multiple languages are spoken in one region). These languages may share
many common features, which makes it hard to identify the discriminative features
2
http://www.theeuropeanlibrary.org.
Do Your Social Profiles Reveal What Languages You Speak? 563
between them. Therefore, solely relying on the textual features to infer these languages
may yield unsatisfactory results.
To address these challenges, this work investigates three factors to better model the
problem: (1) Textural attributes in social profiles. They provide fundamental evidence
about what languages a user may comprehend. This work also attempts to exploit
external resources to enhance the textural attributes. It aims to import more information
that is associated with the user and also may reflect user language information.
(2) Dependency relations between languages. Languages may be related to each other
in certain ways, e.g. mutually intelligible. This relation could reflect the possibility that
a user comprehends other languages based upon a language we know the user under‐
stands. (3) Social relations between users through their social profiles. It is reasonable
to assume that users with similar academic or professional backgrounds may compre‐
hend one or many of the same languages. This relation information between users could
be extracted from their social profiles. Finally, a language and social relation-based
factor graph model (LSR-FGM) is proposed which predicts the user language informa‐
tion under the collective influence of the three factors.
Experiments are conducted on a large-scale LinkedIn profile dataset. Results show
that LSR-FGM clearly outperforms several alternative models and is able to obtain an
F1-score of over 84 %. Experiments also demonstrate that every factor contributes from
a different perspective in the process of inference.
2 Related Work
uses the textual content of static social profiles. Although a few studies were based on
the content of user profiles, they mainly aimed to extract certain target information from
the profile, like special skills [19], summary sentences [20], rather than to infer hidden
knowledge about the user.
Due to the natural structure of social networks, the factor graph model attracts much
attention from researchers in representing and mining social relationships between users
in social networks [21, 22]. Tang et al. proposed to utilize the user interactions to infer
the social relationship between users in social networks [23]. They defined multiple
factors that related to the user relationship, and proposed a partially labeled pairwise
factor graph model to infer a target social relationship between users. In social network
analyses, Tang et al. defined several types of conformity from different levels, and
modeled the effects of these conformities to users’ online behavior using a factor graph
model [24]. Our work is different from previous studies in two aspects. First, our work
is based on self-managed profiles instead of the naturally connected social networks.
Second, this paper proposes a factor graph model that collectively exploits local textual
attributes in profiles and multiple types of relations between profiles to infer the user’s
language information.
This section first gives some background information on the inference problem and then
introduces two main challenges in modeling the problem as well as corresponding solu‐
tions. Finally, it gives the formal problem definition.
3.1 Background
In general, a social profile consists of multiple fields, each of which details a particular
aspect of information about the user, such as education background, hobbies. Different
platforms may use different fields to construct user profiles. Without loss of generality,
this work considers three commonly used fields (SNSs, like Facebook, Google +,
contain the three fields. But the mini-profiles in Twitter do not apply to our model.) in
the social profile for the language information inference problem:
• Summary: Unstructured text where the users give a general introduction about them‐
selves. Because there is no structure restriction, the focus of this field varies from
user to user.
• Education Background: Structured text that details each study experience of the
user by subsections. Each subsection could include attributes like school name, study
major, etc.
• Work Experience: Structured text details the user’s work experience by subsections.
Each subsection could include attributes like company name, role, work period, etc.
In practice, the language information of some users is readily obtainable through
certain means, e.g., it can be explicitly stated in the user’s profile; or it can be easily
predicted from the user’s interactions with the system. Therefore the problem is how
Do Your Social Profiles Reveal What Languages You Speak? 565
to infer the language information of the remaining users based on: (1) textual infor‐
mation of all the profiles; (2) known language information of the other part of users.
3
https://en.wikipedia.org/wiki/List_of_multilingual_countries_and_regions.
4
https://en.wikipedia.org/wiki/Lexical_similarity.
566 Y. Xu et al.
(2) Social relation between users. Although new users have no direct friendship/follow‐
ship with other users, they can be related through available information of their social
profiles. This work focuses on the same-experience relation, i.e., two profiles share a study
experience (studied the same major in the same institute) or work experience (worked as
the same role in the same company), to help inference. For example, the fact that two
users shared the same work experience may imply that they know a common language
because communication is needed between employees in the department of the company.
Thus, it is reasonable to assume that the users with the same-experience social relation are
likely to know a common language.
Therefore, this paper proposes a model that collectively considers the three factors
outlined above: enhanced textual attributes, language relation and social relation, to
model the problem of language inference using social profiles.
This section details the construction of the Language and Social Relation-based Factor
Graph Model (LSR-FGM) and proposes a method to learn the model.
The LSR-FGM collectively incorporates the three factors outlined above to better model
the problem of user language inference. Its basic idea is to define these relations among
Do Your Social Profiles Reveal What Languages You Speak? 567
users and languages using different factor functions in a graph. Thus, an objective
function can be defined based on the joint probability of all factor functions. Learning
the LSR-FGM is to estimate the model parameters, which can be achieved by maxi‐
mizing the log-likelihood objective function based on the observation information.
Below, we introduce the construction of the objective function in detail.
For simplicity, given a correlation node vi, we use vi(u) and vi(l) to represent the user
and language of this node respectively. Note each correlation node vi in V is also associ‐
ated with an attribute vector xi which is from the attribute vector of user vi(u), and X is the
attribute matrix corresponding to V. Then, we have a graph G = (V, E, Y, X), in which the
value of a label yi depends on both the local attribute vector xi and the connections related
to vi. Thus, we have the following conditional probability distribution over G:
(1)
According to the Bayes’ rule and assuming X ⊥ E in LSR-FGM, we can further have:
(2)
in which P(X|Y) represents the probability of generating attributes X associated to all corre‐
lation nodes given their labels Y, and P(Y|E) denotes the probability of labels given all
connections between correlation nodes. It is reasonable to assume that the generative prob‐
ability of attributes given the label value of each correlation node is conditionally inde‐
pendent. Thus we can factorize Eq. (2) again:
(3)
where P(xi|yi) is the probability of generating attribute vector xi given label yi. Now the
problem is how to instantiate the probability P(Y|E) and P(xi|yi). In principle, they can be
instantiated in different ways. This work models them in a Markov random field, so the two
probabilities can be instantiated based on the Hammersley-Clifford theorem [25]:
(4)
(5)
in which, Z1 and Z2 are normalization factors. In Eq. (4), d is the length of the attribute
vector; a feature function f(xij,yi) is defined for each attribute j (the jth attribute) of corre‐
lation node vi for the language vi(l), and is the weight of attribute j for language
vi(l). In Eq. (5), ELANG and EEXP are edges between nodes in V through language
568 Y. Xu et al.
Local Textual Feature Functions: The unigram features of the textual information in
social profiles are used to build the attribute vector space, and they are also used as binary
features in the local feature function for each target language. For instance, if the profile
of a user contains the jth word of the attribute vector space and specifies she knows
language l, a feature f(l,j)(x-ij = 1, yi = 1) is defined and its value is 1; otherwise 0. This
feature definition strategy is commonly used in graphical models like Conditional Random
Field (CRF). Therefore, the conditional probability distribution P(X|Y) over G can be
obtained:
(6)
Language Dependency Relation Factor: Any two nodes in V are connected through
language dependency relation if they are from the same user. If nodes vi and vj have a
language dependency connection, a language dependency relation factor is defined:
(7)
(8)
where γij represents the influence weight of node vj on vi through social relation factor.
Finally, LSR-FGM can be constructed based on the above formulation. By
combining Eqs. (3)-(8), we can define the objective likelihood function:
(9)
Do Your Social Profiles Reveal What Languages You Speak? 569
The last issue is to learn the LSR-FGM and to infer unknown label values YU in G. Learning
the LSR-FGM is to estimate a parameter configuration of θ from a given partially labeled
G, which maximizes the log-likelihood objective function L(θ) = log Pθ (YL|X, E), i.e.,
(10)
This work uses a gradient descent method to solve the objective function. Taking γ
as an example to explain how to learn the parameters, first the gradient of each γik with
regard to the objective function L(θ) can be obtained:
(11)
in which, is the expectation of factor function h(vi, vk), i.e., the average
value of h(vi, vk) over the all same-experience connections in the training data; the second
term is the expectation of h(vi, vk) under the distribution given by the esti‐
mated model. Similarly, the gradients of α and β can be derived.
As the graphical structure of G can be arbitrary and may contain cycles, it is intract‐
able to directly calculate the second expectation. This work adopts the Loopy Belief
Propagation (LBP) to approximate the gradients considering its ease of implementation
and effectiveness [23]. In each iteration of the learning process, the LBP is employed
twice, one for estimating the marginal distribution of unknown variables yi and another
for the marginal distribution over all connections. Then, the parameters θ are updated
with the obtained gradients and a given learning rate in each iteration.
It is clear that a LBP is employed to infer the unknown YU in the learning process.
Therefore, after convergence of the learning algorithm, all nodes in YU are labeled which
maximizes the marginal probabilities. Correspondingly, the language information of
unlabeled users is inferred.
This section first describes the dataset construction and the strategy of importing location
information associated with the profile from external resources. It then introduces the
comparative methods and, finally, presents and discusses the experimental results.
570 Y. Xu et al.
This work compares the LSR-FGM with the following methods of inferring what
languages users comprehend based on their social profiles:
1. RM (Rule-based method): For each language, this method maintains a full list of
countries/regions where this language is used as an official language. These lists are
constructed based on the Wikipedia page5 that lists the official language(s) of each
country. This method makes an inference decision that a user comprehends a target
language only if one of the country/region names in the corresponding list appears
in her social profile.
2. RM-L: This method is almost the same as RM but the input attribute matrix is
enhanced with the external location information, i.e., the additional location attribute
of the institute.
5
http://en.wikipedia.org/wiki/List_of_official_languages.
Do Your Social Profiles Reveal What Languages You Speak? 571
3. SVM (Support Vector Machine): This method uses the attribute vector of correlation
nodes to train a classification model for each language, and predicts the language
information by employing the model. The method is implemented with the SVM-
light package6 (linear kernel).
4. SVM-L: This method is almost the same as SVM but the input attribute matrix is
enhanced with the external location information.
The LSR-FGM infers user language information by collectively considering the local
textual attributes, user-user social relations and language-language dependency rela‐
tions. The enhanced attribute matrix is applied on this method.
Table 1. Performance of language inference with different methods on different languages (%)
6
http://svmlight.joachims.org/.
572 Y. Xu et al.
80.1 % of F1-score on the French. It demonstrates that the mix of related languages
(refers to French, German and Spanish) in the target language set increases the inference
difficulties to these languages. Then we compare the LSR-FGM with other methods on
different languages. Table 1 shows that LSR-FGM outperforms all the four methods (in
terms of F1-score) but with varying degree of improvement on different languages. For
example, to the target language Chinese, LSR-FGM achieves a +2.42 % (F1-score)
significant improvement, compared with SVM-L (p-value < 0.05). By comparison,
LSR-FGM significantly outperforms SVM-L by 11.38 % (F1-score) on French
(p-value < 0.05). This difference in performance between Chinese and French reflects
two aspects of information as discussed in Sect. 3. First, again it shows that the discrim‐
inative features of Hindi and Chinese are easier to catch since they hardly share common
characteristics with other languages. Second, the relations between languages and
profiles significantly contribute to distinguishing the related languages.
This subsection examines the contributions of the defined three factors in the LSR-FGM.
Table 2 gives the overall performance (i.e. on all target languages) of the LSR-FGM
considering different factors. Specifically, the two relation factors: language dependency
relation and user social relation are removed and only the attribute factor is kept, and
then each of the relation factors is added into the model separately. The experimental
results show that both of the relation factors are useful for language inference task. It
also indicates that they do contribute from different perspectives. This is demonstrated
by the fact that the LSR-FGM with all three factors outperforms the instances which
only consider one relation factor. For example, the same-experience factor could help
for those profiles in which only a few study/work experiences are given and not enough
discriminative features are available for inferring language information; the factor of
language dependency relation would contribute for multilingual users whose profiles
only contain enough evidence about certain languages.
6 Conclusion
This work studies the novel problem of inferring what languages a user comprehends
based on their social profiles. This work precisely defines the problem and proposes a
language and social relation-based factor graph model. This model collectively
considers three factors in the inference process: textual attributes of the social profile;
dependency relations between target languages; and social relations between users.
Experiments on a real-world large-scale dataset show the success of the proposed model
in inferring user language information using social profiles, and demonstrate that each
one of the three factors makes a stand-alone contribution in the model from a different
aspect. In addition, this work proposes to obtain information reflecting users’ language
information from external resources in order to help the inference, which is shown to
be effective in the experiments. Future work involves exploiting more information
related to the user, and exploring more features from the available information to infer
the actual level of expertise that a user has in a language(s).
References
1. Tucker, R.: A global perspective on bilingualism and bilingual education. In: Georgetown
University Round Table on Languages and Linguistics, pp. 332–340 (1999)
2. Diamond, J.: The benefits of multilingualism. Sci. Wash. 330(6002), 332–333 (2010)
3. Ghorab, M., Leveling, J., Zhou, D., Jones, G.J., Wade, V.: Identifying common user behaviour
in multilingual search logs. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mandl, T., Mostefa,
D., Peñas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, pp. 518–525. Springer,
Heidelberg (2010)
4. Oakes, M., Xu, Y.: A search engine based on query logs, and search log analysis at the
university of Sunderland. In: Proceedings of the 10th Cross Language Evaluation Forum
(2009)
5. Kontaxis, G., Polychronakis, M., et al.: Minimizing information disclosure to third parties in
social login platforms. Int. J. Inf. Secur. 11(5), 321–332 (2012)
6. Burger, J.D., et al.: Discriminating gender on Twitter. In: EMNLP, pp. 1301–1309 (2011)
7. Li, R., Wang, S., Deng, H., et al.: Towards social user profiling: unified and discriminative
influence model for inferring home locations. In: SIGKDD, pp. 1023–1031 (2012)
8. Dunning, T.: Statistical identification of language. Technical Report MCCS 940–273,
Computing Research Laboratory, New Mexico State University (1994)
574 Y. Xu et al.
9. Xia, F., Lewis, W.D., Poon, H.: Language ID in the context of harvesting language data off
the web. In: EACL, pp. 870–878 (2009)
10. Martins, B., et al.: Language identification in web pages. In: SAC, pp. 764–768 (2005)
11. Stiller, J., Gäde, M., Petras, V.: Ambiguity of queries and the challenges for query language
detection. In: The proceedings of Cross Language Evaluation Forum (2010)
12. Carter, S., et al.: Microblog language identification: Overcoming the limitations of short,
unedited and idiomatic text. Lang. Resour. Eval. 47(1), 195–215 (2013)
13. Qiu, F., Cho, J.: Automatic identification of user interest for personalized search. In: WWW,
pp. 727–736 (2006)
14. White, R.W., Bailey, P., Chen, L.: Predicting user interests from contextual information.
In: SIGIR, pp. 363–370 (2009)
15. Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click
behavior. In: IUI, pp. 31–40 (2010)
16. Xu, S., et al.: Exploring folksonomy for personalized search. In: SIGIR, pp. 155–162 (2008)
17. Provost, F., Dalessandro, B., Hook, R., et al.: Audience selection for on-line brand advertising:
privacy-friendly social network targeting. In: SIGKDD, pp. 707–716 (2009)
18. Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P.: You are who you know: inferring
user profiles in online social networks. In: WSDM, pp. 251–260 (2010)
19. Maheshwari, S., Sainani, A., Reddy, P.: An approach to extract special skills to improve the
performance of resume selection. In: Kikuchi, S., Sachdeva, S., Bhalla, S. (eds.) DNIS 2010.
LNCS, vol. 5999, pp. 256–273. Springer, Heidelberg (2010)
20. Wang, Z., Li, S., Kong, F., Zhou, G.: Collective personal profile summarization with social
networks. In: EMNLP, pp. 715–725 (2013)
21. Yang, Z., Cai, K., et al.: Social context summarization. In: SIGIR, pp. 255–264 (2011)
22. Dong, Y., Tang, J., Wu, S., et al.: Link prediction and recommendation across heterogeneous
social networks. In: ICDM, pp. 181–190 (2012)
23. Tang, W., Zhuang, H., Tang, J.: Learning to infer social ties in large networks. In: Gunopulos,
D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III. LNCS,
vol. 6913, pp. 381–397. Springer, Heidelberg (2011)
24. Tang, J., Wu, S., Sun, J.: Confluence: Conformity influence in large social networks.
In: SIGKDD, pp. 347–355 (2013)
25. Hammersley, J.M., Clifford, P.: Markov fields on finite graphs and lattices. Unpublished
Manuscript (1971)
Retrieving Hierarchical Syllabus Items for Exam
Question Analysis
1 Introduction
Educators use exams to evaluate their students’ understanding of material, to
measure whether teaching methodologies help or hurt, or to be able to compare
students across different programs. While there are many issues with exams
and evaluations that could be and are being explored, we are interested in the
question of coverage – whether an evaluation is complete, in the sense that it
covers all the aspects or concepts that the designer of the evaluation hoped
to cover.
We consider classifying multiple-choice questions into a known concept hier-
archy. In our use case, an educator would upload or enter an exam into our
system, and each question would be assigned to a category from the hierarchy.
The results would allow the educator to understand and even visualize how the
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 575–586, 2016.
DOI: 10.1007/978-3-319-30671-1 44
576 J. Foley and J. Allan
questions that make up the exam cover the overall hierarchy, making it possible
to determine if this coverage achieves their goals for the examination: are all
important topics covered?
This problem is traditionally treated as one of manual question creation
and labeling, where an official, curated set of tests has been created and are
to be used widely or repeatedly. Educators who use that exam are guaranteed
“appropriate” coverage of the material. However, this centralized approach is
only a partial solution to the problem of understanding coverage of exams since
every institution and almost every teacher or professor is likely to have their
own assignments, their own quizzes, their own exams. The global exam does not
help those educators understand how their own material fits into the known set
of topics.
For this study, our dataset is a medium-sized corpus of test questions clas-
sified into the American Chemical Society (ACS) hierarchy developed by their
exams institute [12]. This dataset has been used for educational research [11,17],
but as these are actual exams used by educators, it is not available publicly.
The problem is interesting because the hierarchy is crisply but very sparsely
described and the questions are very short, on par with the size of microblog
entries. In existing text classification datasets with a hierarchical components
(e.g., Wikipedia categories, the Enron email folder dataset [14], and the Yahoo!
Directory or Open Directory Project [26]) all of the labeled documents are quite
dense, the categories were created with various levels of control, and the resulting
categories are likely to be overlapping. In contrast, all of our information is
sparse, the categories themselves were designed by experts in the field, and part
of their goal was to have questions fall into a single category.
In this study, we explore methods for classifying exam questions into a con-
cept hierarchy using information retrieval methods. We show that the best tech-
nique leverages both document expansion and concept-aware ranking methods,
but that exploiting the structure of the questions is helpful but not shown to be
an advantage in conjunction with our other approaches on this dataset.
Ideally this work would be repeated on additional sets of questions with their
own hierarchy to show its broad applicability; unfortunately, such questions are
carefully guarded1 and difficult to come by so demonstrating the results on
another dataset must be left for future work.
Although our evaluation dataset is not open, we believe the results will apply
to any comparable collection of exam questions categorized into a known hierar-
chy and we hope that our success in this task will encourage other educators and
institutions to open up their data and new problems to our community. Our key
approach leverages structure present in this kind of dataset that is not available
in standard retrieval collections, but we hope to explore its generality in future
work.
1
Even most standardized tests require test-takers to sign agreements not to distribute
or mention the questions, even after the exam is taken.
Retrieving Hierarchical Syllabus Items for Exam Question Analysis 577
2 Related Work
The problem we tackle in this study is classification of short text passages into a
hierarchical concept hierarchy, sometimes with interaction. The classification of
short texts is relevent even though we do not have sufficiently balanced training
labels for our task. Additional prior work involves interactive techniques as well
as hierarchical retrieval models.
Our domain is exam questions in chemistry. We have found very little existing
work within this domain of education-motivated IR. Omar et al. [20] develop a
rule-based system for classifying questions into a taxonomy of learning objectives
(do students have knowledge, do they comprehend, etc.) rather than topics. They
work with a small set of computer programming exam questions to develop the
rules but do not actually evaluate their utility for any task.
The problem of question classification [18,30] seems related but refers to
categorizing informational questions into major categories such as who, where,
what, or when.
There is a huge body of literature on the well known problem of text classifi-
cation, with a substantial amount devoted to classifying short passages of text.
We sketch the approaches of a sample of that work to give an idea of the major
approaches. Rather than attempt to cover it here, we refer the interested reader
to the survey by Aggarwal and Zhai [1].
Sun et al. [26] considers a problem similar to ours, classifying short web
page descriptions into the Open Directory Project’s hierarchy. In their work,
classification is done in two steps: the 15 categories most similar to the text
are selected from the larger set of over 100,000 categories, and then an SVM
is used to build a classifier for just those 15 categories so that the text can be
categorized. Their category descriptions are selected by tf·idf comparison as well
as using “explicit semantic analysis” [8]. Following related earlier work by Xue
et al. [29], they represent an inner node of the hierarchy by its own content as
well as that of its descendants. We represent leaf nodes by the content of their
ancestors as well as their descendants, and try this in conjunction with document
expansion.
Ren et al. [23] consider the problem of classifying a stream of tweets into an
overlapping concept hierarchy. They treat the problem as classification rather
than ranking, and do not explore interactive possibilities. They expand the short
texts using embedded links and references to named entities and address topic
drive using time-aware topic modeling, approaches that have little utility when
processing exam questions. Banerjee et al. [3] effectively expand text by retriev-
ing articles from Wikipedia and using the titles of those articles as features. By
contrast, we expand text using an unlabeled set of questions – that is, compara-
ble instances of the items we are classifying, having found wikipedia to not be
helpful in such a focused domain.
578 J. Foley and J. Allan
The hierarchical retrieval models we propose and evaluate in this work draw
inspiration from hierarchical classification. They also share some similarities with
cluster-based retrieval [16], in the way that a document is represented by its
terms and those of its cluster, we will represent nodes based on their features
and the features belonging to their parents. Hierarchical language models show
up in the task of expert finding as well, given the hierarchy of employees in the
company [2,21]. Our task differs from expert retrieval in that the elements of
our hierarchy are precisely defined by their own descriptions, but do not interact
with documents in any way.
Lee et al. present an early work on leveraging a hierarchy in the form of a
knowledge-base graph, constructed mostly of “is-a” relationships [15]. Ganesan
et al. present a work on exploiting hierarchical relationships between terms or
objects to compute similarity between objects that are expressed in terms of
elements in the hierarchy [9], while relevant, this would be of more use if we
were trying to match exams to other exams.
Each of the nodes described in the hierarchy has a succinct description, but
only the nodes in Table 1 have titles, i.e.
X. Visualization. Chemistry constructs meaning interchangeably at the par-
ticulate and macroscopic levels.
X.A.2.a. Schematic drawings can depict key concepts at the particulate level
such as mixtures vs. pure substance, compounds vs. elements, or dissociative
processes.
The there are ten nodes at the first level, as already discussed, 61 at the level
below that, 124 at the third level, and 258 leaf nodes. Of the middling nodes,
there are between 1 and 10 children assigned to each, with most of the weight
belonging to 1, 2 and 3 (72, 59, and 37 respectively). The average length of a
node description is 18.3 terms, and there are 16.2 distinct terms per node.
The exam question has three parts. The context “sulfuric acid is an impor-
tant catalyzer” presents the background for the question, giving the background
details that are needed to know what the question means and how to pick an
answer. The question statement itself “how do I recover the remaining sulfuric
acid?” is the actual statement. In many cases, a single context will occur with
several different questions, a factor that complicates simple comparison of the
entire exam question. Finally, the exam question has the answers, usually multi-
ple choice and usually with only know of them a correct answer. We did not find
question fields to be helpful in the presence of our other, less-domain-specific
ideas.
The ACS dataset includes 1593 total questions, distributed across 23 exams,
with an average of 69 questions per exam. One exam has only 58 questions, and
the largest exam has 80.
The most frequently tested concepts are tested tens of times over all these
exams, the most frequent occurring 47 times – on average twice per exam for 23
exams. This most common node belongs to the “experimental” sub-tree, and dis-
cusses the importance of schematic drawings in relation to key concepts. It is one
of the more general nodes we have inspected. The other most frequent concepts
include “quantitative relationships and conversions,” “moles,” and “molarity”.
The labeled data itself is highly skewed overall. There are 65 nodes that have
ten or more questions labeled to belong to them. There are 62 nodes that only
have a single question and another 29 that only have two questions – the number
of rarely-tested nodes are the reason we choose to eschew supervised approaches
in this work.
4 Evaluation Measures
Our task is ultimately to classify an exam question into the correct leaf node of
the concept hierarchy. In part to support reasonable interactive assistance, we
treat this as a ranking problem. That is, rather than identify a single category
for a question, we generate a ranked list of them and evaluate where the correct
category appears in the list.
An individual question’s ranking is measured by two metrics. We use recip-
rocal rank (RR), the inverse of the rank at which the correct category is
found. If there are multiple correct categories (uncommon), the first one encoun-
tered in the list determines RR. We also use normalized discounted cumulative
gain (NDCG) as implemented in the Galago search engine3 and formulated by
2
User Fiire; http://chemistry.stackexchange.com/questions/4250. This example
displayed in lieu of the proprietary ACS data.
3
http://lemurproject.org/galago.php.
Retrieving Hierarchical Syllabus Items for Exam Question Analysis 581
where e is a single exam from E, the set of 23 exams, QE is the set of questions on
exam e, and m(q) is either RR, NDCG or P@1 for a query. This mean of averages
is a macro-averaged score. We investigated whether micro-averaging (with each
question treated equally rather than as part of an exam) made a difference, but
there was no effect on the outcome of any experiment. As a result, we only report
the score as described above.
Recall that the exam questions include three parts: the context, the statement,
and the answers. We hypothesized that this structure could be leveraged to
improve matching of exams. Indeed, the context can appear in multiple questions
that are categorized differently, so although it is important, it also may be a
distractor. We define the QCM similarity between two questions as:
Hierarchical Node Scoring. The score of a leaf node given a query is given
by a retrieval model. As mentioned above, we use the SDM approach for these
experiments. However, if a query matches a leaf node well but does not match
the parent of the leaf node, the match is suspect and should be down-weighted.
To accommodate that, we use a hierarchical SDM scoring approach.
4
The beta version of chemistry.stackexchange.com.
Retrieving Hierarchical Syllabus Items for Exam Question Analysis 583
We first define an operator that returns the ancestors of a node, A(N ), exclud-
ing the root itself. This operator is defined inductively, using the operator P (n)
that returns the parent of node n.
∅ N is a root node
A(N ) =
N ∪ A(P (N )) otherwise
We choose to exclude the root node because it has no description in our hierarchy.
Given the set of ancestors A(N ) of any node N , we can assign a joint score
to the nodes based upon its score and that of its ancestors. If SDM(q, N ) is the
SDM score for node N with query q, then:
H-SDM(q, N ) = SDM(q, n)
n∈A(N )
6 Interactive Methods
In this section we consider the possibility that the person using our algorithm
would provide a small amount of information – perhaps indicating which top-
level sub-tree is appropriate for the instance being considered, which we con-
sider to be hierarchical relevance feedback, where we consider typical relevance
feedback as considering our first 10 results. We expect that while users cannot
remember hundreds of nodes in total, a working familiarity with the first level of
the hierarchy (See Table 1) will be easier to learn and leverage in an interactive
setting.
For each question to be classified, we simulate hierarchy feedback by removing
concepts from our ranked list if they are not under the same top-level node in
the concept hierarchy as the question. That is, we are simulating the case where
a user selects the correct top-level category, so any candidates in other sub-
trees can be automatically discarded. Table 3 shows that this simple approach
(“Hierarchy”) provides a substantial gain over no interaction, though it is not
as helpful as having the correct question selected from the top 10.
While the results of this experiment may seem obvious, in lieu of having
a user study to determine which of these techniques is easier, quantifying the
gains that can be made with this kind of feedback is important. In the case
of users familiar with the hierarchy, we expect that we can get a positive gain
using both techniques, and for users who are less familiar with the hierarchy (we
doubt anyone remembers all 258 leaf nodes), the ranking methods will hopefully
provide a much smaller candidate set.
7 Conclusion
In this work we explored the challenge our users face of classifying exam questions
into a concept hierarchy, but we explore it from an IR perspective due to the
Retrieving Hierarchical Syllabus Items for Exam Question Analysis 585
scarcity of labels available, and our desire to incorporate feedback. This problem
was difficult because the exam questions are short and often quite similar and
because the concepts in the hierarchy had quite short descriptions. We explored
existing approaches, such as document expansion and typical retrieval models,
as well as our own methods – especially a hierarchical transform for existing
retrieval models that works well, and a model of question structure that provides
gains over most baselines.
We hope that our promising results encourage more collaboration between
education and information retrieval research, specifically in the identification
and exploration of new tasks and datasets that may benefit both fields.
In future work, we hope to explore this problem with other subjects, more
exams, and with expert humans in the loop to field-test the feasibility and help-
fulness our overall retrieval methods, and our interactive methods.
Acknowledgments. The authors thank Prof. Thomas Holme of Iowa State University’s
Department of Chemistry for making the data used in this study available and Stephen
Battisti of UMass’ Center for Educational Software Development for help accessing and
formatting the data.
This work was supported in part by the Center for Intelligent Information Retrieval
and in part by NSF grant numbers IIS-0910884 and DUE-1323469. Any opinions, findings
and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect those of the sponsors.
References
1. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal,
C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012)
2. Balog, K., Azzopardi, L., de Rijke, M.: A language modeling framework for expert
finding. Inf. Process. Manage. 45(1), 1–19 (2009)
3. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia.
In: SIGIR 2007, New York, NY, USA. ACM (2007)
4. Bekkerman, R., Raghavan, H., Allan, J., Eguchi, K.: Interactive clustering of text
collections according to a user-specified criterion. In: Proceedings of IJCAI, pp.
684–689 (2007)
5. de Melo, G., Weikum, G.: Taxonomic data integration from multilingual wikipedia
editions. Knowl. Inf. Syst. 39(1), 1–39 (2014)
6. Dumais, S., Chen, H.: Hierarchical classification of web content. In: SIGIR 2000,
pp. 256–263. ACM, New York, NY, USA (2000)
7. Efron, M., Organisciak, P., Fenlon, K.: Improving retrieval of short texts through
document expansion. In: SIGIR 2012, pp. 911–920. ACM (2012)
8. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-
based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)
9. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain struc-
ture to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003)
10. Hoi, S.C., Jin, R., Lyu, M.R.: Large-scale text categorization by batch mode active
learning. In: WWW 2006, pp. 633–642. ACM (2006)
586 J. Foley and J. Allan
11. Holme, T.: Comparing recent organizing templates for test content between ACS
exams in general chemistry and AP chemistry. J. Chem. Edu. 91(9), 1352–1356
(2014)
12. Holme, T., Murphy, K.: The ACS exams institute undergraduate chemistry anchor-
ing concepts content Map I: general chemistry. J. Chem. Edu. 89(6), 721–723
(2012)
13. Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant
documents. In: SIGIR 2000, pp. 41–48. ACM (2000)
14. Klimt, B., Yang, Y.: Introducing the enron corpus. In: CEAS (2004)
15. Lee, J.H., Kim, M.H., Lee, Y.J.: Information retrieval based on conceptual distance
in IS-A hierarchies. J. Doc. 49(2), 188–207 (1993)
16. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: SIGIR
2004, pp. 186–193. ACM, New York, NY, USA (2004)
17. Luxford, C.J., Linenberger, K.J., Raker, J.R., Baluyut, J.Y., Reed, J.J., De Silva,
C., Holme, T.A.: Building a database for the historical analysis of the general
chemistry curriculum using ACS general chemistry exams as artifacts. J. Chem.
Edu. 92, 230–236 (2014)
18. Metzler, D., Croft, W.: Analysis of statistical question classification for fact-based
questions. Inf. Retr. 8(3), 481–504 (2005)
19. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In:
SIGIR 2005, pp. 472–479. ACM (2005)
20. Omar, N., Haris, S.S., Hassan, R., Arshad, H., Rahmat, M., Zainal, N.F.A.,
Zulkifli, R.: Automated analysis of exam questions according to bloom’s taxon-
omy. Procedia - Soc. Behav. Sci. 59, 297–303 (2012)
21. Petkova, D., Croft, W.B.: Hierarchical language models for expert finding in enter-
prise corpora. Int. J. Artif. Intell. Tools 17(01), 5–18 (2008)
22. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval.
In: SIGIR 1998, pp. 275–281. ACM (1998)
23. Ren, Z., Peetz, M.-H., Liang, S., van Dolen, W., de Rijke, M.: Hierarchical multi-
label classification of social text streams. In: SIGIR 2014, pp. 213–222. ACM,
New York, NY, USA (2014)
24. Settles, B.: Active learning literature survey. Technical report, University of
Wisconsin-Madison, Computer Sciences Technical report 1648, January 2010
25. Singhal, A., Pereira, F.: Document expansion for speech retrieval. In: SIGIR 1999,
pp. 34–41. ACM (1999)
26. Sun, X., Wang, H., Yu, Y.: Towards effective short text deep classification. In:
SIGIR 2011, pp. 1143–1144. ACM, New York, NY, USA (2011)
27. Tao, T., Wang, X., Mei, Q., Zhai, C.: Language model information retrieval with
document expansion. In: NAACL 2006, pp. 407–414. ACL (2006)
28. Wang, P., Hu, J., Zeng, H.-J., Chen, Z.: Using wikipedia knowledge to improve
text classification. Knowl. Inf. Syst. 19(3), 265–281 (2009)
29. Xue, G.-R., Xing, D., Yang, Q., Yu, Y.: Deep classification in large-scale text
hierarchies. In: SIGIR 2008, pp. 619–626. ACM, New York, NY (2008)
30. Zhang, D., Lee, W.S.: Question classification using support vector machines. In:
SIGIR 2003, pp. 26–32. ACM, New York, NY, USA (2003)
Collaborative Filtering
Implicit Look-Alike Modelling in Display Ads
Transfer Collaborative Filtering to CTR Estimation
1 Introduction
higher targeting accuracy and thus brings more customers to the advertisers [29].
The current user profiling methods include building keyword and topic distribu-
tions [1] or clustering users onto a (hierarchical) taxonomy [29]. Normally, these
inferred user interest segments are then used as target restriction rules or as
features leveraged in predicting users’ ad response [32].
However, the two-stage profiling-and-targeting mechanism is not optimal
(despite its advantages of explainability). First, there is no flexible relationship
between the inferred tags or categories. Two potentially correlated interest seg-
ments are regarded as separated and independent ones. For example, the users
who like cars tend to love sports as well, but these two segments are totally
separated in the user targeting system. Second, the first stage, i.e., the user
interest segments building, is performed independently and with little attention
of its latter use of ad response prediction [7,29], which is suboptimal. Third, the
effective tag system or taxonomy structure could evolve over time, which makes
it much difficult to update them.
In this paper, we propose a novel framework to implicitly and jointly learn
the users’ profiles on both the general web browsing behaviours and the ad
response behaviours. Specifically, (i) Instead of building explicit and fixed tag
system or taxonomy, we propose to directly map each user, webpage and ad
into a latent space where the shape of the mapping is automatically learned. (ii)
The users’ profiles on general browsing and ad response behaviour are jointly
learned based on the heterogeneous data from these two scenarios (or tasks).
(iii) With a maximum a posteriori framework, the knowledge from the user
browsing behaviour similarity can be naturally transferred to their ad response
behaviour modelling, which in turn makes an improvement over the prediction
of the users’ ad response. For instance, our model could automatically discover
that the users with the common behaviour on www.bbc.co.uk/sport will tend
to click automobile ads. Due to its implicit nature, we call the proposed model
implicit look-alike modelling.
Comprehensive experiments on a real-world large-scale dataset from a com-
mercial display ad platform demonstrate the effectiveness of our proposed model
and its superiority over other strong baselines. Additionally, with our model, it is
straightforward to analyse the relationship between different features and which
features are critical and cost-effective when performing transfer learning.
2 Related Work
and latent vector models [20,30] are proposed to catch the data non-linearity
and interactions between features. Recently the authors in [12] proposed to first
learn combination features from gradient boosting decision trees (GBDT) and,
based on the tree leaves as features, learn a factorisation machine (FM) [23] to
build feature interactions to improve ad click prediction performance.
Collaborative Filtering (CF) on the other hand is a technique for person-
alised recommendation [26]. Instead of exploring content features, it learns the
user or/and item similarity based on their interactions. Besides the user(item)-
based approaches [25,28], latent factor models, such as probabilistic latent seman-
tic analysis [10], matrix factorisation [13] and factorisation machines [23], are
widely used model-based approaches. The key idea of the latent factor models is
to learn a low-dimensional vector representation of each user and item to catch the
observed user-item interaction patterns. Such latent factors have good generali-
sation and can be leveraged to predict the users’ preference on unobserved items
[13]. In this paper, we explore latent models of collaborative filtering to model user
browsing patterns and use them to infer users’ ad click behaviour.
Transfer Learning deals with the learning problem where the learning data
of the target task is expensive to get, or easily outdated, via transferring the
knowledge learned from other tasks [21]. It has been proven to work on a variety
of problems such as classification [6], regression [16] and collaborative filtering
[15]. Different from multi-task learning, where the data from different tasks are
assumed to drawn from the same distribution [27], transfer learning methods
may allow for arbitrary source and target tasks. In online advertising field, the
authors in a recent work [7] proposed a transfer learning scheme based on logistic
regression prediction models, where the parameters of ad click prediction model
were restricted with a regularisation term from the ones of user web browsing
prediction model. In this paper, we consider it as one of the baselines.
– Web Browsing Prediction (CF Task). Each user’s online browsing behav-
iour is logged as a list containing previously visited publishers (domains or
URLs). A common task of using the data is to leverage collaborative filtering
(CF) [23,28] to infer the user’s profile, which is then used to predict whether
the user is interested in visiting any given new publisher. Formally, we denote
the dataset for CF as Dc and an observation is denoted as (xc , y c ) ∈ Dc ,
where xc is a feature vector containing the attributes from the user and the
publisher and y c is the binary label indicating whether the user visits the
publisher or not.
592 W. Zhang et al.
This paper focuses on the latter task: ad response prediction. We, however,
observe that although they are different prediction tasks, the two tasks share
a large proportion of users, publishers and their features. We can thus build a
user-publisher interest model jointly from the two tasks. Typically we have a
large number of observations about user browsing behaviours and we can use
the knowledge learned from publisher CF recommendation to help infer display
advertising CTR estimation.
3.2 CF Prediction
For the CF task, we use a factorisation machine [23] as our prediction model. We
further define the features xc ≡ (xu , xp ), where xu ≡ {xui } is the set of features
for a user and xp ≡ {xpj } is the set of features for a publisher1 . The parameter
1
All the features studied in our work are one-hot encoded binary features.
Implicit Look-Alike Modelling in Display Ads 593
c c
Θ ≡ (w0c , wc , V c ), where w0c ∈ R is the global bias term and wc ∈ RI +J is the
weight vector of the I c -dimensional user features and J c -dimensional publisher
features. Each user feature xui or publisher feature xpj is associated with a K-
c c
dimensional latent vector vic or vjc . Thus V c ∈ R(I +J )×K .
With such setting, the conditional probability for CF in Eq. (1) can be refor-
mulated as:
P (y c |xc ; Θ) = P (y c |xu , xp ; w0c , wc , V c ). (2)
(xc ,y c )∈D c (xu ,xp ,y c )∈D c
c
Let ŷu,p be the predicted probability of whether user u will be interested in
visiting publisher p. With the FM model, the likelihood of observing the label
y c given the features (xu , xp ) and parameters is
c c
P (y c |xu , xp ; w0c , wc , V c ) = (ŷu,p
c
)y · (1 − ŷu,p
c
)(1−y ) , (3)
c
where the prediction ŷu,p is given by an FM with a logistic function:
c
ŷu,p = σ w0c + wic xui + wjc xpj + vic , vjc xui xpj , (4)
i j i j
where σ(x) = 1/(1 + e−x ) is the sigmoid function and ·, · is the inner product
K
of two vectors: vi , vj ≡ f =1 vi,f · vj,f , which models the interaction between
a user feature i and a publisher feature j.
μwc , σw
2
c µV c , σ V
2
cI σw
2
d σV
2
d I µV r,a , σV r,a I μwr,a , σwr,a
2 2
xc xr xr,a wlr,a
Ic + Jc Ir + Jr Lr
w0c yc w0r yr
|D | c
|D r |
CF Task CTR Task
r
where ŷu,p,a is modelled by interactions among 3-side features
r
ŷu,p,a = σ w0r + wir xui + wjr xpj + wlr xal + (7)
i j l
vir , vjr xui xpj + vir , vlr xui xal + vjr , vlr xpj xal .
i j i l j l
wr ∼ N (wc , σw
2
d I), (8)
2
where σw d is the assumed variance of the Gaussian generation process between
each pair of feature weights of CF and CTR tasks and the weight generation is
assumed to be independent across features. Similarly, the latent vectors of CTR
task are assumed to be generated from the counterparts of CF task:
work together to infer our CF task target y c , i.e., whether the user would visit a
specific publisher or not. The right part illustrates the CTR task. Correspond-
ing to CF task, wr and V r here represent user and publisher features’ weights
and latent vectors, while wr,a and V r,a are separately depicted to represent ad
features’ weights and latent vectors. All these factors work together to predict
CTR task target y r , i.e., whether the user would click the ad or not. On top
of that, for each (user or publisher) feature i of the CF task, its weight wic and
latent vector vic act as a prior of the counterparts wir and vir in CTR task while
learning the model.
Considering the datasets of the two tasks might be seriously unbalanced, we
choose to focus on the averaged log-likelihood of generating each data instance
from the two tasks. In addition, we add a hyperparameter α for balancing the
task relative importance. As such, the joint conditional likelihood in Eq. (1) is
written as
|Dαc | |D
1−α
r|
P (y c |xc ; Θ) · P (y r |xr ; Θ) (10)
(xc ,y c )∈D c (xr ,y r )∈D r
Moreover, from the graphic model, the prior of model parameters can be
specified as
on which task the instance belongs to, as given in Eq. (11). The detailed gradient
for each specific parameter can be calculated routinely and thus are omitted here
due to the page limit.
4 Experiments
4.1 Dataset
2
It is common to perform negative down sampling to balance the labels in ad CTR
estimation [9]. Calibration methods [3] are then leveraged to eliminate the model
bias.
Implicit Look-Alike Modelling in Display Ads 597
fixed, we train the CTR task using Eqs. (11) and (13). Note that α in Eq. (11)
is still a hyperparameter for this method.
– DisjointLR: The transfer learning model proposed in [7] is considered as state-
of-the-art transfer learning methods in display advertising. In this work, both
source and target tasks adopt logistic regression as a behaviour prediction
model, which uses the linear model to minimise the logistic loss from each
observation sample:
In our context of regarding the CF task as source task and CTR task as target
task, the learning objectives are listed below:
∗
CF task : wc = arg min Lwc (xc , y c ) + λ||wc ||22 (16)
wc
(xc ,y c )∈D c
∗ ∗
CTR task : wr = arg min Lwr (xr , y r ) + λ||wr − wc ||22 . (17)
r w
(xr ,y r )∈D r
Besides the difference between the linear LR and non-linear FM, this method
is a two-stage learning scheme, where the first stage Eq. (16) is disjoint with
the second stage Eq. (17). Thus we denoted it as DisjointLR.
– Joint: Our proposed model, as summarised in Eq. (1), which performs the
transfer learning when jointly learning the parameters on the two tasks.
4.5 Result
Basic Setting Performance. Figure 2 presents the AUC and RMSE perfor-
mance of Base, Disjoint and Joint and the improvement of Joint against the
hyperparameter α in Eq. (11) based on the basic experiment setting. As can
be observed clearly, for a large region of α, i.e., [0.1, 0.7], Joint consistently
outperforms the baselines Base and Disjoint on both AUC and RMSE, which
demonstrates the effectiveness of our model to transfer knowledge from webpage
browsing data to ad click data. Note that when α = 0, the CF side model wc
does not learn but Joint still outperforms Disjoint and Base. This is due to the
different prior of wr and V r in Joint compared with those of Disjoint and Base.
In addition, when α = 1, i.e., no learning on CTR task, the performance of Joint
reasonably gets back to initial guess, i.e., both AUC and RMSE are 0.5.
Table 1 shows the transfer learning performance comparison between Joint
and the state-of-the-art DisjointLR with both models setting optimal hyperpa-
rameters. The improvement of Joint over DisjointLR indicates the success of (1)
the joint optimisation on the two tasks to perform knowledge transfer and (2)
the non-linear factorisation machine relevance model on catching feature inter-
actions.
Appending Side Information Performance. From the Joint model as in
Eq. (11) we see when α is large, e.g., 0.8, the larger weight is allocated on the
CF task to optimise the joint likelihood. As such, if a large-value α leads to
the optimal CTR estimation performance, it means the transfer learning takes
effect. With such method, we try adding different features into the Joint model
to obtain the optimal hyperparameter α leading to the highest AUC to check
whether a certain feature helps transfer learning. On the contrary, if a low-value
or 0 α leads to the optimal performance of Joint model when adding a certain
feature, it means such feature has no effect of performing transfer learning.
Table 2 collects the AUC improvement of the Joint model for the conducted
experiments. We observe that user browsing hour, ad slot position in the web-
page are the most valuable features that help transfer learning, while the user
screen size does not bring any transfer value. When adding all these features
into Joint model, the optimal α is around 0.5 for AUC improvement and 0.6
for RMSE drop (see Fig. 3), which means these features along with the basic
Implicit Look-Alike Modelling in Display Ads 599
user, webpage IDs provide an overall positive value of knowledge transfer from
webpage browsing behaviour to ad click behaviour.
5 Conclusion
Acknowledgement. We would like to thank Adform for allowing us to use their data
in experiments. We would also like to thank Thomas Furmston for his feedback on the
paper. Weinan thanks Chinese Scholarship Council for the research support.
References
1. Ahmed, A., Low, Y., Aly, M., Josifovski, V., Smola, A.J.: Scalable distributed
inference of dynamic user interests for behavioral targeting. In: KDD (2011)
2. Broder, A., Fontoura, M., Josifovski, V., Riedel, L.: A semantic approach to con-
textual advertising. In: SIGIR, pp. 559–566. ACM (2007)
3. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning
algorithms. In: ICML, pp. 161–168. ACM (2006)
4. Chapelle, O.: Modeling delayed feedback in display advertising. In: KDD, pp. 1097–
1105. ACM (2014)
5. Chapelle, O., et al.: A simple and scalable response prediction for display adver-
tising. ACM Trans. Intell. Syst. Technol. (TIST) 5(4), 61 (2013)
6. Dai, W., Xue, G.R., Yang, Q., Yu, Y.: Transferring naive bayes classifiers for text
classification. In: AAAI (2007)
7. Dalessandro, B., Chen, D., Raeder, T., Perlich, C., Han Williams, M., Provost, F.:
Scalable hands-free transfer learning for online advertising. In: KDD (2014)
8. Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale bayesian click-
through rate prediction for sponsored search advertising in microsoft’s bing search
engine. In: ICML, pp. 13–20 (2010)
9. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R.,
Bowers, S., et al.: Practical lessons from predicting clicks on ads at facebook. In:
ADKDD, pp. 1–9. ACM(2014)
10. Hofmann, T.: Collaborative filtering via gaussian probabilistic latent semantic
analysis. In: SIGIR, pp. 259–266. ACM (2003)
11. Jebara, T.: Machine Learning: Discriminative and Generative, vol. 755. Springer
Science & Business Media, New York (2012)
12. Juan, Y.C., Zhuang, Y., Chin, W.S.: 3 Idiots Approach for Display Advertising
Challenge. Internet and Network Economics, pp. 254–265. Springer, Heidelberg
(2011)
13. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender
systems. Comput. 8, 30–37 (2009)
14. Lee, K., Orten, B., Dasdan, A., Li, W.: Estimating conversion rate in display
advertising from past performance data. In: KDD, pp. 768–776. ACM (2012)
15. Li, B., Yang, Q., Xue, X.: Transfer learning for collaborative filtering via a rating-
matrix generative model. In: ICML, pp. 617–624. ACM (2009)
16. Liao, X., Xue, Y., Carin, L.: Logistic regression with an auxiliary data source. In:
ICML, pp. 505–512. ACM (2005)
17. Mangalampalli, A., Ratnaparkhi, A., Hatch, A.O., Bagherjeiran, A., Parekh, R.,
Pudi, V.: A feature-pair-based associative classification approach to look-alike
modeling for conversion-oriented user-targeting in tail campaigns. In: WWW, pp.
85–86. ACM (2011)
18. McAfee, R.P.: The design of advertising exchanges. Rev. Ind. Organ. 39(3), 169–
185 (2011)
19. Muthukrishnan, S.: Ad exchanges: research issues. In: Leonardi, S. (ed.) WINE
2009. LNCS, vol. 5929, pp. 1–12. Springer, Heidelberg (2009)
Implicit Look-Alike Modelling in Display Ads 601
20. Oentaryo, R.J., Lim, E.P., Low, D.J.W., Lo, D., Finegold, M.: Predicting response
in mobile advertising with hierarchical importance-aware factorization machine.
In: WSDM (2014)
21. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2010)
22. PricewaterhouseCoopers: IAB internet advertising revenue report (2014). Accessed
29 July 2015. http://www.iab.net/media/file/PwC IAB Webinar Presentation
HY2014.pdf
23. Rendle, S.: Factorization machines. In: ICDM, pp. 995–1000. IEEE (2010)
24. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-
through rate for new ads. In: WWW, pp. 521–530. ACM (2007)
25. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering
recommendation algorithms. In: WWW, pp. 285–295. ACM (2001)
26. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recom-
mender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web
2007. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007)
27. Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a
survey. J. Mach. Learn. Res. 10, 1633–1685 (2009)
28. Wang, J., De Vries, A.P., Reinders, M.J.: Unifying user-based and item-based col-
laborative filtering approaches by similarity fusion. In: SIGIR (2006)
29. Yan, J., Liu, N., Wang, G., Zhang, W., Jiang, Y., Chen, Z.: How much can behav-
ioral targeting help online advertising?. In: WWW, pp. 261–270. ACM (2009)
30. Yan, L., Li, W.J., Xue, G.R., Han, D.: Coupled group lasso for web-scale ctr
prediction in display advertising. In: ICML, pp. 802–810 (2014)
31. Yuan, S., Wang, J., Zhao, X.: Real-time bidding for online advertising: measure-
ment and analysis. In: ADKDD, pp. 3. ACM (2013)
32. Zhang, W., Yuan, S., Wang, J.: Real-time bidding benchmarking with ipinyou
dataset. arXiv preprint.(2014). arxiv:1407.7073
Efficient Pseudo-Relevance Feedback Methods
for Collaborative Filtering Recommendation
1 Introduction
Recommender systems are recognised as a key instrument to deliver relevant
information to the users. Although the problem that attracts most attention
in the field of Recommender Systems is accuracy, the emphasis on efficiency is
increasing. We present new Collaborative Filtering (CF) algorithms. CF methods
exploit the past interactions between items and users. Common approaches to
CF are based on nearest neighbours or matrix factorisation [17]. Here, we focus
on probabilistic techniques inspired by Information Retrieval methods.
A growing body of literature has been published on applying techniques from
Information Retrieval to the field of Recommender Systems [1,5,14,19–21]. These
papers model the recommendation task as an item ranking task with an implicit
query [1]. A very interesting approach is to formulate the recommendation prob-
lem as a profile expansion task. In this way, the users’ profiles can be expanded with
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 602–613, 2016.
DOI: 10.1007/978-3-319-30671-1 44
Efficient Pseudo-Relevance Feedback Methods 603
relevant items in the same way in which queries are expanded with new terms. An
effective technique for performing automatic query expansion is Pseudo-Relevance
Feedback (PRF). In [4,14,18], the authors proposed the use of PRF as a CF
method. Specifically, they adapted a formal probabilistic model designed for PRF
(Relevance-Based Language Models [12]) for the CF recommendation task. The
reported experiments showed a superior performance of this approach, in terms of
precision, compared to other recommendation methods such as the standard user-
based neighbourhood algorithm, SVD and several probabilistic techniques [14].
These improvements can be understood if we look at the foundations of Relevance-
Based Language Models since they are designed for generating a ranking of terms
(or items in the CF task) in a principled way. Meanwhile, others methods aim to
predict the users’ ratings. However, it is worth mentioning that Relevance-Based
Language Models also outperform other probabilistic methods that focus on top-N
recommendation [14].
Nevertheless, the authors in [14] did not analyse the computational cost of
generating recommendations within this probabilistic framework. For these rea-
sons, in this paper we analyse the efficiency of the Relevance-Based Language
Modelling approach and explore other PRF methods [6] that have a better trade-
off between effectiveness and efficiency and, at the same time, do not require any
type of smoothing as it is required in [14].
The contributions of this paper are: (1) the adaptation of four efficient
Pseudo-Relevance Feedback techniques (Rocchio’s weights, Robertson Selection
Value, Chi-Squared and Kullback-Leibler Divergence) [6] to CF recommenda-
tion, (2) the conception of a new probability estimate that takes into account
the length of the neighbourhood in order to improve the accuracy of the rec-
ommender system and (3) a critical study of the efficiency of these techniques
compared to the Relevance-Based Language Models as well as (4) the analysis
of the recommenders from the point of view of the ranking quality, the diversity
and the novelty of the suggestions. We show that these new models improve
the trade-off between accuracy and diversity/novelty and provide a fast way for
computing recommendations.
2 Background
extracts from them the best term candidates for query expansion and performs a
second search with the expanded query.
The goal of a recommender is to choose for each user of the system (u ∈ U)
items that are relevant from a set of items (I). Given the user u, the output of
the recommender is a personalised ranked list Lku of k elements. We denote by
Iu the set of items rated by the user u. Likewise, the set of users that rated the
item i is denoted by Ui .
The adaptation of the PRF procedure for the CF task [14] is as follows.
Within the PRF framework, the users of the system are analogous to queries in
IR. Thus, the ratings of the target user act as the query terms. The goal is to
expand the original query (i.e., the profile of the user) with new terms that are
relevant (i.e., new items that may be of interest to the user). For performing the
query expansion process, it is necessary a pseudo-relevant set of documents, from
which the expansion terms are extracted. In the context of recommender systems,
the neighbours of the target user play the role of pseudo-relevant documents.
Therefore, similar users are used to extract items that are candidates to expand
the user profile. These candidate items conform the recommendation list.
Parapar et al. [14] experimented with both estimates of the Relevance-Based
Language Models [12]: RM1 and RM2. However, as Eqs. 1 and 2 shows, they are
considerably expensive. For each user u, they compute a relevance model Ru and
they estimate the relevance of each item i under it, p(i|Ru ). Vu is defined as the
neighbourhood of the user u. The prior probabilities, p(v) and p(i), are consid-
ered uniform. In addition, the conditional probability estimations, pλ (i|v) and
pλ (j|v), are obtained interpolating the Maximum Likelihood Estimate (MLE)
with the probability in the collection using Jelinek-Mercer smoothing controlled
by the parameter λ (see Eq. 3). More details can be found in [14].
RM1 : p(i|Ru ) ∝ p(v)pλ (i|v) pλ (j|v) (1)
v∈Vu j∈Iu
pλ (i|v)p(v)
RM2 : p(i|Ru ) ∝ p(i) pλ (j|v) (2)
p(i)
j∈Iu v∈Vu
ru,i ru,i
pλ (i|u) = (1 − λ) + λ u∈U (3)
j∈Iu ru,j u∈U , j∈I ru,j
Rocchio’s Weights. This method is based on the Rocchio’s formula [16]. The
assigned score is computed as the sum of the weights for each term of the pseudo-
relevant set. This approach promotes highly rated items in the neighbourhood.
rv,i
pRocchio (i|u) = (4)
|Vu |
v∈Vu
Chi-Squared (CHI-2). This method roots in the chi-squared statistic [6]. The
probability in the neighbourhood plays the role of the observed frequency and
the probability in the collection is the expected frequency.
2
p(i|Vu ) − p(i|C)
pCHI−2 (i|u) = (6)
p(i|C)
This improvement does not make sense for the RSV item ranking function
because the ranking would be the same (the scores will be rescaled by a constant);
however, it can be useful for CHI-2 and KLD methods as it can be seen in Sect. 5.
5 Evaluation
We used three film datasets from GroupLens1 : MovieLens 100k, MovieLens 1M
and MovieLens 10M, for the efficiency experiment. Additionally, we used the
R3-Yahoo! Webscope Music 2 dataset and the LibraryThing 3 book collection for
the effectiveness tests. The details of the collections are gathered in Table 1. We
used the splits provided by the collections. However, since Movielens 1M and
LibraryThing do not offer predefined partitions, we selected 80 % of the ratings
of each user for the training subset whilst the rest is included in the test subset.
by the target user. It has been acknowledged that considering non-rated items
as irrelevant may underestimate the true metric value (since non-rated items
can be of interest to the user); however, it provides a better estimation of the
recommender quality [2,13].
The employed metrics are evaluated at a specified cut-off rank, i.e., we con-
sider only the top k recommendations of the ranking for each user because these
are the ones presented to the user. For assessing the quality of the ranking we
employed nDCG. This metric uses graded relevance of the ratings for judging the
ranking quality. Values of nDGG increases when highly relevant documents are
located in the top positions of the ranking. We used the standard formulation
as described in [22]. We also employed the complement of the Gini index for
quantifying the diversity of the recommendations [9]. The index is 0 when only
a single item is recommended for every user. On the contrary, a value of 1 is
achieved when all the items are equally recommended among the users. Finally,
to measure the ability of a recommender system to generate unexpected recom-
mendations, we computed the mean self-information (MSI) [25]. Intuitively, the
value of this metric increases when unpopular items are recommended.
5.2 Baselines
To assess the performance of the proposed recommendation techniques, we chose
a representative set of state-of-the-art recommenders. We used a standard user-
based neighbourhood CF algorithm (labelled as UB): the neighbours are com-
puted using k-NN with Pearson’s correlation as the similarity measure [8]. We
also tested Singular Value Decomposition (SVD), a matrix factorisation tech-
nique which is among the best methods for rating prediction [11]. Additionally,
we included an algorithm which has its roots in the IR probabilistic modelling
framework [20], labelled as UIR-Item. Finally, as the strongest baselines, we chose
the RM1 and RM2 models [14]. Instead of employing Jelinek-Mercer smoothing
as it was originally proposed [14], we used Absolute Discounting because recent
studies showed that it is more effective stable than Jelinek-Mercer [18].
Fig. 1. Recommendation time per user (in logarithmic scale) using UIR-Item (UIR),
RM1, RM2, RSV, Rocchio’s Weights (RW), CHI-2 and KLD algorithms with NMLE
as the probability estimate on the MovieLens 100k, 1M and 10M collections.
The obtained nDCG@10 values are reported in Table 2 with statistical signif-
icance tests (two-sided Wilcoxon test with p < 0.05). Generally, RM2 is the best
recommender algorithm as it was expected—better probabilistic models should
lead to better results. Nevertheless, it can be observed that in the R3-Yahoo!
dataset, the best nDCG values of our efficient PRF methods are not statistically
different from RM2. Moreover, in the LibraryThing collection, many of the pro-
posed models significantly outperform RM2 with important improvements. This
may be provoked by the sparsity of the collections which leads to think that RM2
is too complex to perform well under this more common scenario. Additionally,
although we cannot improve the nDCG figures of RM2 on the MovieLens 100k,
we significantly surpass the other baselines.
In most of the cases, the proposals that use collection statistics (i.e., KLD and
the CHI-2 methods) tend to perform better than those that only use neighbour-
hood information (Rocchio’s Weights and RSV). Regarding the proposed neigh-
bourhood length normalisation, the experiments show that NMLE improves the
ranking accuracy compared to the regular MLE in the majority of the cases.
Thus, the evidence supports the idea that the size of the users’ neighbourhoods
is an important factor to model in a recommender system.
Now we take the best baselines (UIR-Item and RM2) and our best proposal
(CHI-2 with NMLE) in order to study the diversity and novelty of the top ten
recommendations. Note that we use the same rankings which were optimized for
nDCG@10. The values of Gini@10 and MSI@10 are presented in Tables 3 and 4,
respectively. In the case of Gini, we cannot perform paired significance analysis
since it is a global metric.
610 D. Valcarce et al.
Table 3. Gini@10 values of UIR-Item, RM2 and CHI-2 with NMLE (optimised for
nDCG@10). Values in bold indicate the best recommender for the each dataset. Sig-
nificant differences are indicated with the same criteria as in Table 2.
Table 4. MSI@10 values of UIR-Item, RM1, RM2 and CHI-2 with NMLE (optimised
for nDCG@10). Values in bold indicate the best recommender for the each dataset.
Significant differences are indicated with the same criteria as in Table 2.
6 Related Work
Exploring Information Retrieval (IR) techniques and applying them to Recom-
mender Systems is an interesting line of research. In fact, in 1992, Belkin and
Croft already stated that Information Retrieval and Information Filtering (IF)
Efficient Pseudo-Relevance Feedback Methods 611
Fig. 2. Values of the G-measure in the MovieLens 100k collection plotted against the
size of the neighbourhood (k), for the nDCG@10-MSI@10 (left) and the nDCG@10-
Gini@10 (right) trade-offs.
are two sides of the same coin [1]. Recommenders are automatic IF systems: their
responsibility lies in selecting relevant items for the users. Consequently, besides
the work of Parapar et al. on applying Relevance-Based Language Models to
CF recommendation [14], there is a growing amount of literature about different
approaches that exploit IR techniques for recommendation [5,19–21].
Wang et al. derived user-based and item-based CF algorithms using the clas-
sic probability ranking principle [20]. They also presented a probabilistic rel-
evance framework with three models [21]. Also, Wang adapted the language
modelling scheme to CF using a risk-averse model that penalises less reliable
scores [19].
Another approach is the one formulated by Bellogı́n et al. [5]. They devised
a general model for unifying memory-based CF methods and text retrieval algo-
rithms. They show that many IR methods can be used within this framework
obtaining better results than classic CF techniques for the item ranking task.
Relevance-Based Language Models were also adapted to CF in a different
manner. Bellogı́n et al. [4] formulate the formation of user neighbourhoods as a
query expansion task. Then, by using the negative cross entropy ranking princi-
ple, they used the neighbours to compute item recommendations.
Since Relevance Models [12] are an effective tool for item recommendation [14],
the aim of this work was to assess if other faster PRF methods could be used
for the same task. The results of this investigation revealed that, indeed, simpler
and more efficient PRF techniques are suitable for this CF task. We have car-
ried out experiments that showed that the proposed recommendation algorithms
(Rocchio’s Weigths, RSV, KLD and CHI-2) are orders of magnitude faster than
the Relevance Models for recommendation. These alternatives offer important
improvements in terms of computing time while incurring, in some cases, in a
modest decrease of accuracy. Furthermore, these methods lack of parameters:
they only rely on the neighbourhood information. In a large-scale scenario, a
speed-up of 200x can lead to notable savings in computational resources.
612 D. Valcarce et al.
References
1. Belkin, N.J., Croft, W.B.: Information filtering and information retrieval: two sides
of the same coin? Commun. ACM 35(12), 29–38 (1992)
2. Bellogı́n, A., Castells, P., Cantador, I.: Precision-oriented evaluation of recom-
mender systems. In: RecSys 2011, p. 333. ACM (2011)
3. Bellogı́n, A., Parapar, J.: Using graph partitioning techniques for neighbour selec-
tion in user-based collaborative filtering. In: RecSys 2012, pp. 213–216. ACM (2012)
4. Bellogı́n, A., Parapar, J., Castells, P.: Probabilistic collaborative filtering with
negative cross entropy. In: RecSys 2013, pp. 387–390. ACM (2013)
5. Bellogı́n, A., Wang, J., Castells, P.: Bridging memory-based collaborative filtering
and text retrieval. Inf. Retr. 16(6), 697–724 (2013)
6. Carpineto, C., de Mori, R., Romano, G., Bigi, B.: An information-theoretic app-
roach to automatic query expansion. ACM Trans. Inf. Syst. 19(1), 1–27 (2001)
7. Coggeshall, F.: The arithmetic, geometric, and harmonic means. Q. J. Econ. 1(1),
83–86 (1886)
8. Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based rec-
ommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.)
Recommender Systems Handbook, pp. 107–144. Springer, Heidelberg (2011)
Efficient Pseudo-Relevance Feedback Methods 613
9. Fleder, D., Hosanagar, K.: Blockbuster culture’s next rise or fall: the impact of
recommender systems on sales diversity. Manage. Sci. 55(5), 697–712 (2009)
10. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative
filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004)
11. Koren, Y., Bell, R.: Advances in collaborative filtering. In: Ricci, F., Rokach, L.,
Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 145–186.
Springer, Heidelberg (2011)
12. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: SIGIR 2001, pp.
120–127. ACM (2001)
13. McLaughlin, M.R., Herlocker, J.L.: A collaborative filtering algorithm and evalua-
tion metric that accurately model the user experience. In: SIGIR 2004, pp. 329–336.
ACM (2004)
14. Parapar, J., Bellogı́n, A., Castells, P., Barreiro, A.: Relevance-based language mod-
elling for recommender systems. Inf. Process. Manage. 49(4), 966–980 (2013)
15. Robertson, S.E.: On term selection for query expansion. J. Doc. 46(4), 359–364
(1990)
16. Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The
SMART Retrieval System - Experiments in Automatic Document Processing, pp.
313–323. Prentice Hall (1971)
17. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recom-
mender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web
2007. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007)
18. Valcarce, D., Parapar, J., Barreiro, A.: A study of smoothing methods for relevance-
based language modelling of recommender systems. In: Hanbury, A., Kazai, G.,
Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 346–351. Springer,
Heidelberg (2015)
19. Wang, J.: Language models of collaborative filtering. In: Lee, G.G., Song, D., Lin,
C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS,
vol. 5839, pp. 218–229. Springer, Heidelberg (2009)
20. Wang, J., de Vries, A.P., Reinders, M.J.T.: A user-item relevance model for
log-based collaborative filtering. In: Lalmas, M., MacFarlane, A., Rüger, S.M.,
Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp.
37–48. Springer, Heidelberg (2006)
21. Wang, J., de Vries, A.P., Reinders, M.J.T.: Unified relevance models for rating
prediction in collaborative filtering. ACM Trans. Inf. Syst. 26(3), 1–42 (2008)
22. Wang, Y., Wang, L., Li, Y., He, D., Chen, W., Liu, T.-Y.: A theoretical analysis
of NDCG ranking measures. In: COLT 2013, pp. 1–30 (2013). JMLR.org
23. Wong, W.S., Luk, R.W.P., Leong, H.V., Ho, L.K., Lee, D.L.: Re-examining the
effects of adding relevance information in a relevance feedback environment. Inf.
Process. Manage. 44(3), 1086–1116 (2008)
24. Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to
information retrieval. In: CIKM 2001, pp. 403–410. ACM (2001)
25. Zhou, T., Kuscsik, Z., Liu, J.-G., Medo, M., Wakeling, J.R., Zhang, Y.-C.: Solving
the apparent diversity-accuracy dilemma of recommender systems. PNAS 107(10),
4511–4515 (2010)
Language Models for Collaborative
Filtering Neighbourhoods
1 Introduction
Recommender systems aim to provide useful items of information to the users.
These suggestions are tailored according to the users’s tastes. Considering the
increasing amount of information available nowadays, it is hard to manually fil-
ter what is interesting and what is not. Additionally, users are becoming more
demanding—they do not conform with traditional browsing or searching activi-
ties, they want relevant information immediately. Therefore, recommender sys-
tems play a key role in satisfying the users’ needs.
We can classify recommendation algorithms in three main categories: content-
based systems, which exploit the metadata of the items to recommend similar
ones; collaborative filtering, which uses information of what other users have
done to suggest items; and hybrid techniques, which combine both content-
based and collaborative filtering approaches [15]. In this paper, we focus on the
collaborative filtering scenario. Collaborative techniques ignore the content of
the items since they merely rely on the feedback from other users. They tend to
perform better than content-based approaches if sufficient historical data is avail-
able. We can distinguish two main types of collaborative methods. On the one
hand, model-based techniques learn a latent factor representation from the data
after a training process [10]. On the other hand, neighbourhood-based methods
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 614–625, 2016.
DOI: 10.1007/978-3-319-30671-1 45
Language Models for Collaborative Filtering Neighbourhoods 615
(also called memory-based algorithms) use the similarities among past user-item
interactions [6]. Neighbourhood-based recommenders, in turn, are classified in
two categories: user-based and item-based approaches depending on which type
of similarities are computed. User-based recommenders rely on user neighbour-
hoods (i.e., they recommend items that similar users like). By contrast, item-
based algorithms compute similarities between items (i.e., two items are related
if users rate them in a similar way).
Neighbourhood-based approaches are simpler than their model-based coun-
terparts because they do not require a previous training step—still, we need to
compute the neighbourhoods. Multiple approaches to generate neighbourhoods
exist in the literature [6] because this phase is crucial in the recommendation
process. The effectiveness of these type of recommenders depends largely on how
we calculate the neighbourhoods. A popular approach consists in computing the
k Nearest Neighbours according to a pairwise similarity metric such as Pearson’s
correlation coefficient, adjusted cosine or cosine similarity.
Traditionally, recommender systems were designed as rating predictors; how-
ever, it has been acknowledged that it is more interesting to model the recom-
mendation problem as an item ranking task [1,8]. Top-N recommendation is the
term coined to name this new perspective [4]. For this task, the use of Infor-
mation Retrieval techniques and models is attracting more and more attention
[2,13,17,20]. The reason is that these methods were specifically conceived for
ranking documents according to an explicit query. However, they can also rank
items using the user’s profile as an implicit query.
Previous work has found that the cosine similarity yields the best results in
terms of accuracy metrics in the neighbourhood computation process [4]. In fact,
it surpasses Pearson’s correlation coefficient which is, by far, the most used sim-
ilarity metric in the recommender system literature [6]. Thinking about cosine
similarity in terms of retrieval models, we can note that it is the basic distance
measure used in the Vector Space Model [16]. Following this analogy between
Information Retrieval and Recommender Systems, if the cosine similarity is a
great metric for computing neighbourhoods, it sounds reasonable to apply more
sophisticated representations and measures to this task used in other more effec-
tive retrieval models. Thus, in this paper we focus on modelling the finding of
user and item neighbourhoods as a text retrieval task. In particular, we propose
an adaptation of the Language Modelling retrieval functions as a method for
computing neighbourhoods. Our proposal leverages the advantages of this suc-
cessful retrieval technique for calculating collaborative filtering neighbourhoods.
Our proposal—which can be used in a user or item-based approach—in conjunc-
tion with a simple neighbourhood algorithm surpasses state-of-the-art methods
(NNCosNgbr and PureSVD [4]) in terms of accuracy and is also very efficient.
2 Background
An extensive literature has studied several neighbourhood-based approaches
because they are simple, interpretable and efficient [4–6,9]. After calculating
616 D. Valcarce et al.
where bu,i denotes the bias for the user u and the item i (computed as in [9]); si,j ,
the cosine similarity between items i and j; Ji , the neighbourhood of the item
i, and ru,j , the rating that the user u gave to the item j. The major difference
between this method and the standard neighbourhood approach [6] is the absence
of the normalising denominator. Since we are not interested in predicting ratings,
we do not worry about getting scores in a fixed range. On the contrary, this method
fosters those items with high ratings by many neighbours [4,5,9].
where p(q|d) is the query likelihood and p(d), the document prior. We can ignore
the query prior p(q) because it has no effect in the ranking for the same query.
Usually, a uniform document prior is chosen and the query likelihood retrieval
model is used. The most popular approach in IR to compute the query likelihood
is to use a unigram model based on a multinomial distribution:
p(q|d) = p(t|d)c(t,d) (3)
t∈q
where c(t, d) denotes the count of term t in document d. The conditional prob-
ability p(t|d) is computed via the maximum likelihood estimate (MLE) of a
multinomial distribution smoothed with a background model [23].
Our recommendation algorithm stems from NNCosNgbr method and each mod-
ification described in this section was evaluated in Sect. 4.2. First, we kept the
biases (see Eqs. 4 and 5), instead of removing them as in Eq. 1. Removing biases is
very important in rating prediction recommenders because it allows to estimate
ratings more accurately [6,9], however it is useless on the top-N recommenda-
tion because we are concerned about rankings. Moreover, this process adds an
extra parameter to tune [9]. Next, we focused on the similarity metric. In [4], the
authors introduced a shrinking factor into the cosine metric to promote those
similarities that are based on many shared ratings. This shrinkage procedure
has shown good results in previous studies based on error metrics [6,9] at the
expense of putting an additional parameter into the model. However, we found
that its inclusion is detrimental in our scenario. This is reasonable because the
main advantage of cosine similarity over other metrics such as Pearson’s cor-
relation coefficient is that it considers non-rated values as zeroes. In this way,
cosine already takes into account the amount of co-occurrence between vectors
of ratings which makes unnecessary the use of a shrinkage technique.
618 D. Valcarce et al.
where s is the cosine similarity between the user or item vectors. Vu is the
neighbourhood of user u, as Ji is the neighbourhood of item i.
Preliminary tests showed that our algorithm (Eqs. 4 and 5) performs very well
compared to more sophisticated ones using plain cosine similarity and evaluating
ranking quality. In fact, techniques such as biases removal or similarity shrinkage
worsened the performance and introduced additional parameters in the model.
Major differences in terms of ranking accuracy metrics occur when varying the
neighbourhood computation method. In particular, our experiments showed that
cosine similarity is a great metric for computing k Nearest Neighbours (k-NN).
This process is analogous to the document ranking procedure in the Vector Space
Model [16] if the target user plays the role of the query and the rest of the users
are the documents in the collection. The outcome of this model will be a list
of neighbours ordered by decreasing cosine similarity with respect to the user.
Thus, choosing the k nearest neighbours is the same as taking the top k results
using the user as the query.
Language Models (LM) are a successful retrieval method [14,22] that deals
with data sparsity [23], enables to introduce a priori information and performs
document length normalisation [11]. Recommendation algorithms could benefit
from LM because: user feedback is sparse, we may have a priori information
and the profile sizes vary. We adapted LM framework to the task of finding
neighbourhoods in a user or item-based manner. If we choose the former, we
can model the generation of ratings by users as a random process given by a
probability distribution (as Language Models do with the occurrences of terms).
In this way, we can see documents and queries as users and terms as items. Thus,
the retrieval procedure results in finding the nearest neighbours of the target user
(i.e., the query). Analogously, we can flip to the item-based approach. In this
case, the query plays the role of the target item while the rest of items play the
role of the documents. In this way, a retrieval returns the most similar items.
The user-based analogy between the IR and recommendation tasks has
already been stated. The consideration of a multinomial distribution of ratings
has been used in [2,13] under the Relevance-Based Language Modelling frame-
work for computing recommendations. In our Language Modelling adaptation
for calculating neighbourhoods, we can estimate the probability of a candidate
Language Models for Collaborative Filtering Neighbourhoods 619
where Iu are the items rated by user u. Here we only present the user-based
approach for the sake of space: the item-based counterpart is derived analogously.
Jelinek-Mercer (JM)
ru,i
pλ (i|u) = (1 − λ) + λ p(i|C) (9)
j∈Iu ru,j
4 Experiments
We ran our experiments on four collections: MovieLens 100 k and 1 M1 film
dataset, the R3-Yahoo! Webscope Music2 dataset and the LibraryThing3 book
dataset. We present the details of these collections in Table 1.
We used the splits that MovieLens 100 k and R3 Yahoo! provide for evaluation
purposes. Since the MovieLens 1 M and LibraryThing collections do not include
predefined splits, we put 80 % of the ratings of each user in the training subset
and the rest in the test subset randomly.
1
http://grouplens.org/datasets/movielens/.
2
http://webscope.sandbox.yahoo.com.
3
http://www.macle.nl/tud/LT/.
620 D. Valcarce et al.
In this section, we analyse the different options in WSR and NNCosNgr algo-
rithms described in Sect. 3.1. Table 2 shows the best values of nDCG@10. We
used cosine as the similarity metric in k-NN and tuned the number of nearest
neighbours from k = 50 to 250 in steps of 50 neighbours. We chose a similarity
shrinking factor of 100 as recommended in [4]. Biases were computed using L2
regularisation with a factor of 1 [9]. The first row corresponds to the NNCos-
Ngbr algorithm [4]. The last two rows are the WSR method, our proposal for
recommendation generation (Eqs. 4 and 5, respectively).
WSR variants performed the best and they significantly surpass NNCosNgbr
(first row) in every dataset. The user-based approach reported the best figures on
the dense film datasets while the item-based algorithm yielded the best results
on the sparse songs and books collections. However, there are only statistically
significant differences between these two methods in the LibraryThing dataset.
This result agrees with the literature about neighbourhoods methods [4–6]: item-
based approaches tend to work well on sparse datasets because they compute
similarities among items which often contain more dense information than users.
Language Models for Collaborative Filtering Neighbourhoods 621
5 Related Work
Table 3. nDCG@10 best values on the four datasets. Statistically significant improve-
ments (Wilcoxon two-sided test p < 0.01) w.r.t. the first, second, third, fourth and
fifth method are superscripted with a, b, c, d and e, respectively. Values underlined are
not statistically different from the best value. The number of neighbours/latent factors
used in each case is indicated in the right side.
pseudo-relevance feedback task but it has been adapted for finding user neigh-
bourhoods [2] and computing recommendations directly [13,18,19]. As a neigh-
bourhood technique, our experiments showed that its accuracy is worse than
our proposal in addition to being more computationally expensive. Regarding
their use as a recommender algorithm, Relevance-Based Language Models have
proved to be a very effective recommendation technique [13]. Since this method
is based on user neighbourhoods, it would be interesting to combine it with our
Language Modelling proposal. We leave this possibility as future work.
624 D. Valcarce et al.
References
1. Bellogı́n, A., Castells, P., Cantador, I.: Precision-oriented evaluation of recom-
mender systems. In: RecSys 2011, p. 333. ACM (2011)
2. Bellogı́n, A., Parapar, J., Castells, P.: Probabilistic collaborative filtering with
negative cross entropy. In: RecSys 2013, pp. 387–390. ACM (2013)
3. Bellogı́n, A., Wang, J., Castells, P.: Bridging memory-based collaborative filtering
and text retrieval. Inf. Retr. 16(6), 697–724 (2013)
4. Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on
top-N recommendation tasks. In: RecSys 2010, pp. 39–46. ACM (2010)
5. Deshpande, M., Karypis, G.: Item-based top-N recommendation algorithms. ACM
Trans. Inf. Syst. 22(1), 143–177 (2004)
Language Models for Collaborative Filtering Neighbourhoods 625
Jean-Michel Renders(B)
1 Introduction
Real-world recommender systems have to capture the dynamic aspects of user
and item characteristics: user preferences and needs obviously change over time,
depending on her life cycle, on particular events and on social influences; sim-
ilarly, item perception could evolve in time, due to a natural slow decrease in
popularity or a sudden gain in interest after winning some award or getting pos-
itive appreciations of highly influential experts. Clearly, adopting a static app-
roach, for instance through “static matrix completion” – as commonly designed
and evaluated on a random split without considering the real temporal struc-
ture of the data – will fail to provide accurate results in the medium and long
run. Intuitively, the system should give more weight to recent observations and
should constantly update user and item “profiles” or latent factors in order to
offer the adaptivity and flexibility that are required.
At the same time, these recommender systems typically have to face the
“cold-start” problem: new users and new items constantly arrive in the system,
without any historical information. In this paper, we assume no other external
source of information than the time-stamped ratings, so that it is not possible
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 626–638, 2016.
DOI: 10.1007/978-3-319-30671-1 46
Adaptive Collaborative Filtering with Extended Kalman Filters 627
to use external user or item features (including social and similarity relation-
ships) to partly solve the cold-start issue. In the absence of such information, it
is impossible to provide accurate recommendation at the very early stage and
tackling the cold start problem consists then in solving the task (accurate recom-
mendation), while trying to simultaneously uncover the user and item profiles.
The problem amounts to optimally control the trade-off between exploration
and exploitation, to which the Multi-Armed Bandit (MAB) setting constitutes
an elegant approach.
This paper proposes to tackle both the adaptivity and the “cold start” chal-
lenges through the same framework, namely Extended Kalman Filters coupled
with contextual Multi-Armed Bandits. One extra motivation of this work is
focused on scalability and tractability, leading us to design rather simple and effi-
cient methods, precluding us from implementing some complex inference meth-
ods derived from fully Bayesian approaches. The starting point of the framework
is the standard Matrix Completion approach to Collaborative Filtering and the
aim of the framework is to extend it to the adaptive, dynamic case, while con-
trolling the exploitation/exploration trade-off (especially in the “cold start” sit-
uations). Extended Kalman Filters constitute an ideal framework for modelling
smooth non-linear, dynamic systems with time-varying latent factors (called
“states” in this case). They maintain, in particular, a covariance matrix over the
state estimates or, equivalently, a posterior distribution over the user/item biases
and latent factors, which will then be fully exploited by the MAB mechanism
to guide its sampling strategy. Two different families of MAB are investigated:
Thompson sampling, based on the probability matching principle, and UCB-like
(Upper Confidence Bound) sampling, based on the principle of optimism in face
of uncertainty.
In a nutshell, the method’s principle is that, when a user u calls the system
at time t to have some recommendations, an arm (i.e. an item i) is chosen
that will simultaneously satisfy the user and improve the quality estimate of
the parameters related to both the user u and the proposed item. The system
then receives a new feedback (< u, i, r, t > tuple) and updates the corresponding
entries of the latent factor matrices as well as the posterior covariance matrices
over the factor estimates. We will show that the problem could be solved by a
simple algorithm, requiring only basic algebraic computations, without matrix
inversion or singular value decomposition. The algorithm could easily update
the parameters of the model and make recommendations, even with an arrival
rate of several thousands ratings per second.
2 Related Work
One of the first works to stress the importance of temporal effects in Recom-
mender Systems and to cope with it was the timeSVD++ algorithm [8]. The
approach consists in explicitly modelling the temporal patterns on historical
rating data, in order to remove the “temporal drift” biases. This means that the
time dependencies are modelled parametrically as time-series, typically in the
628 J.-M. Renders
intervals between successive observations should ideally allow for larger updates,
while this approach does not capture this kind of implicit volatility).
On the side of Multi-Armed Bandits for item recommendation, the most
representative works are [2,4,9]. They use linear contextual bandits, where a
context is typically a user calling the system at time t and her associated feature
vector; the reward (ie. the rating) is assumed to be a linear function of this
feature vector and the coefficients of this linear function could be interpreted as
the arm (or item) latent factors that are incrementally updated. Alternatively,
they also consider binary ratings, with a logistic regression model for each arm
(or item) and then use Thompson Sampling or UCB sampling to select the best
item following an exploration/exploitation trade-off perspective. More recently,
[19] combines Probabilistic Matrix Factorization and linear contextual bandits.
Unfortunately, all these approaches do not offer any adaptive behaviour: the
features associated to a user are assumed to be constant and known accurately;
we could easily consider the dual problem, namely identifying the user as an
arm and the item as the context (as in [19]), but then we have no adaptation to
possible changes and drifts in the item latent factors and biases.
In a nutshell, to the best of our knowledge, there is no approach that simul-
taneously combines the dynamic tracking of both user and item latent factors
with an adequate control of the exploration/exploitation trade-off in an on-line
learning setting. The method that is proposed here is a first step to fill this gap.
ru,i = μ + au + bi + Lu .RiT +
where au , bi , Lu and Ri are respectively the user bias, the item popularity,
the user latent factors and the item latent factors (μ is a constant which could
be interpreted as the global average rating). Both Lu and Ri are row vectors
with K components, K being the dimension of the latent space. The noise is
assumed to be i.i.d. Gaussian noise, with mean equal to 0 and variance equal to
σ 2 . The matrix completion problem typically involves the minimization of the
following loss function, combining the reconstruction error over the training set
and regularization terms:
L(a, b, L, R) = rui −μ−au −bi −Lu .RiT 2 +λa a2 +λb b2 +λL L2F +λR R2F
(u,i)∈Ω
where Ω is the training set of observed tuples. Note that this loss could be inter-
preted in a Bayesian setting as the MAP estimate, provided that all parameters
(au , bi , Lu and Ri ) have independent Gaussian priors, with diagonal covariance
2
matrices. In this case, λL = σσ2 where σL
2
is the variance of the diagonal Gaussian
L
630 J.-M. Renders
Predictor Step:
x̂t|t−1 = f (x̂t−1|t−1 )
Pt|t−1 = Jf (x̂t|t−1 )Pt−1|t−1 Jf (x̂t|t−1 )T + Qt
Corrector Step:
Kt = Pt|t−1 Jh (x̂t|t−1 )T (Jh (x̂t|t−1 )Pt|t−1 Jh (x̂t|t−1 )T + σt2 )−1
x̂t|t = x̂t|t−1 + Kt .[yt − h(x̂t|t−1 )]
Pt|t = [I − Kt Jh (x̂t|t−1 )]Pt|t−1
where Jf and Jh are the Jacobian matrices of the functions f and h respectively
. ∂f . ∂h
(Jt = ∂x t
and Jh = ∂x t
). Basically, x̂t|t−1 is the prediction of the state at time
t, given all observations up to t − 1, while x̂t|t is the prediction that also includes
the observation of the outputs at time t (yt ). In practice, we use the Iterated
Extended Kalman Filter (IEKF), where the first two equations of the Corrector
(i)
step are iterated till x̂t|t is stabilised, gradually offering a better approximation
of the non-linearity through the Jacobian matrices.
In order to apply these filters, let us now express our adaptive collabora-
tive filtering as a continuous-time dynamic system with the following equations,
assuming that we observe the tuple < u, i, ru,i > at time t:
Adaptive Collaborative Filtering with Extended Kalman Filters 631
with au,0 ∼ N (0, λa ), bi,0 ∼ N (0, λb ), Lu,0 ∼ N (0, ΛL ), Ri,0 ∼ N (0, ΛR ) and
t ∼ N (0, σ 2 ). Here au,t−1 denotes the value of the bias of user u when she
appeared in the system for the last time before time t. Similarly, bi,t−1 denotes
the value of the popularity of item i when it appeared in the system for the
last time before time t. So, the short-cut notation (t − 1) is contextual to an
item or to a user. All the parameters au , bi , Lu and Ri follow some kind of
Brownian motion (the continuous counter-part of discrete random walk) with
Gaussian transition process noises wa,u , wb,i , WL,u and WR,i respectively, whose
variances are proportional to the time interval since a user / an item appeared
in the system for the last time before the current time, denoted respectively as
Δu (t − 1, t) and Δi (t − 1, t): wa,u (t − 1, t) ∼ N (0, Δu (t − 1, t).γa ), wb,i (t − 1, t) ∼
N (0, Δi (t − 1, t).γb ), WL,u (t − 1, t) ∼ N (0, Δu (t − 1, t).ΓL ) and WR,i (t − 1, t) ∼
N (0, Δi (t − 1, t).ΓR ). We call γa , γb , ΓL and ΓR the volatility hyper-parameters.
It is assumed that the hyper-parameters λa , γa and the diagonal covariance
matrices ΛL , ΓL are identical for all users, and independent from each other.
The same is assumed for the hyper-parameters related to items.
With these assumptions, the application of the Iterated Extended Kalman
filter equations gives:
Predictor Step:
au
Pt|t−1 au
= Pt−1|t−1 + Δu (t, t − 1)γa
bi
Pt|t−1 bi
= Pt−1|t−1 + Δi (t, t − 1)γb
Lu
Pt|t−1 Lu
= Pt−1|t−1 + Δu (t, t − 1)ΓL
Ri
Pt|t−1 Ri
= Pt−1|t−1 + Δi (t, t − 1)ΓR
Then:
Ktau = ωPt|t−1
au
Ktbi = ωPt|t−1
bi
Ri
Pt|t = (I − KtRi Lu,t )Pt|t−1
Ri
au
with P0|0 Lu
= λa , P0,0 = ΛL ∀u and P0|0
bi Ri
= λb , P0|0 = ΛR ∀i. Note that ω,
Ktau , Ktbi , au
Pt|.
and bi
Pt|.
are scalars.
In practice, at least with the datasets that were used in our experiments, the
iterative part of the Corrector step converges in very few iterations (typically 2 or
3). It should be noted that, if a user is not well known (hight covariance P au and
P Lu due to a low number of ratings or a long time since her last appearance), her
weight – and so her influence – in adapting the item i is decreased, and vice-versa.
Our independence and Gaussian assumptions make it simple to compute the pos-
terior distribution of the rating of a new pair < u, i > at time t: it is a Gaussian
T au bi Lu T
with mean μ + au,t + bi,t + Lu,t Ri,t and variance σ 2 + Pt|t + Pt,t + Ri,t Pt|t Ri,t +
Ri T
Lu,t Pt|t Lu,t . Note also that one can easily extend the IEKF method to introduce
T
any smooth non-linear link function (e.g. ru,i,t = g(μ + au,t + bi,t + Lu,t Ri,t ) with
g(x) a sigmoid between the minimum and maximum rating values). The hyper-
parameters could be learned from the training data through a procedure similar to
the EM algorithm using Extended Kalman smoothers (a forward-backward ver-
sion of the Extended Kalman Filters) as described in [14], or by tuning them on
a development set, whose time interval is later than the training set.
Observe rating rt (for pair < ut , i∗ >), update D, update the parameters
and variances/covariances through IEKF
end for
The α parameter controls the trade-off between exploration and exploitation
(in practice, α = 2 is often used).
634 J.-M. Renders
5 Experimental Results
Experiments have been performed on 2 datasets: MovieLens (10M ratings) and
Vodkaster1 (2.7M ratings), divided into 3 temporal, chronologically ordered, splits:
Train (90 %), Development (5 %), Test (5 %). Note that we have beforehand
removed the ratings corresponding to the early, non-representative (transient) time
period from both datasets (2.5M ratings for MovieLens, 0.3M ratings for Vod-
kaster). The Development set is used to tune the different hyper-parameters of the
algorithm. These two datasets show very different characteristics, as illustrated in
Table 1, especially in the arrival rate of new users. We divided the experiments into
two parts: one assessing separately the adaptive capacities of our method, the other
evaluating the gain of coupling these adaptive capacities with Multi-Armed Ban-
dits (ie. the full story).
MovieLens Vodkaster
Number of ratings 7,501,601 2,428,163
Median number of ratings /user 92 11
Median number of ratings /item 121 7
% of users with at least 100 ratings 47 24
Total time span (years) 9.02 3.7
Duration Dev Set (months) 7 3
Duration Test Set (months) 7 3
% of new users in Dev Set 72.6 35.6
% of new items in Dev Set 4.2 4.3
% of new users in Test Set 79.4 35.4
% of new items in Test Set 3.2 2.2
2
It is easy to show that we can divide all values of the hyper-parameters by σ 2 without
changing the predicted value; so we can fix σ 2 to 1.
636 J.-M. Renders
6 Conclusion
We have proposed in this paper a single framework that combines the adap-
tive tracking of both user and item latent factors through Extended Non-linear
Kalman filters and the exploration/exploitation trade-off required by the on-line
learning setting (including cold-start) through Multi-Armed Bandits strategies.
Experimental results showed that, at least for the datasets and settings that we
considered, this framework constitutes an interesting alternative to more com-
mon approaches.
Of course, this is only a first step towards a more thorough analysis of the
best models to capture the underlying dynamics of user and item evolution in
real recommender systems. The use of more powerful non-linear state tracking
techniques such as Particle Filters should be investigated, especially to over-
come the limitations of the underlying Gaussian and independence assumptions.
One promising avenue of research is to allow the volatility and prior variance
hyperparameters to be user-specific (or item-specific) and to be themselves time-
dependent. Moreover, it could be useful to take into account possible dependen-
cies between the distributions of the latent factors, what was not at all considered
here. All these topics will the subject of our future work.
Acknowledgement. This work was partially funded by the French Government under
the grant <ANR-13-CORD-0020> (ALICIA Project).
References
1. Agarwal, D., Chen, B., Elango, P.: Fast online learning through offline initialization
for time-sensitive recommendation. In: KDD 2010 (2010)
2. Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS 2011
(2011)
3. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
4. Mahajan, D., Rastogi, R., Tiwari, C., Mitra, A.: Logucb: an explore-exploit algo-
rithm for comments recommendation. In: CIKM 2012 (2012)
5. Gaillard, J., Renders, J.-M.: Time-sensitive collaborative filtering through adaptive
matrix completion. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR
2015. LNCS, vol. 9022, pp. 327–332. Springer, Heidelberg (2015)
6. Gultekin, S., Paisley, J.: A collaborative kalman filter for time-evolving dyadic
processes. In: ICDM 2014 (2014)
7. Han, S., Yang, Y., Liu, W.: Incremental learning for dynamic collaborative filtering.
J. Softw. 6(6), 969–976 (2011)
8. Koren, Y.: Collaborative filtering with temporal dynamics. Commun. ACM 53(4),
89–97 (2010)
9. Li, L., Chu, W., Langford, J., Schapire, R.: A contextual-bandit approach to per-
sonalized news article recommendation. In: WWW 2010 (2010)
10. Lu, Z., Agarwal, D., Dhillon, I.: A spatio-temporal approach to collaborative fil-
tering. In: RecSys 2009 (2009)
638 J.-M. Renders
11. Ott, P.: Incremental matrix factorization for collaborative filtering. Science, Tech-
nology and Design 01/, Anhalt University of Applied Sciences, 2008 (2008)
12. Rendle, S., Schmidt-thieme, L.: Online-updating regularized kernel matrix factor-
ization models for large-scale recommender systems. In: RecSys 2008 (2008)
13. Stern, D., Herbrich, R., Graepel, T.: Matchbox: large scale online bayesian recom-
mendations. In: WWW 2009 (2009)
14. Sun, J., Parthasarathy, D., Varshney, K.: Collaborative kalman filtering for
dynamic matrix factorization. IEEE Trans. Sig. Process. 62(14), 3499–3509 (2014)
15. Sun, J., Varshney, K., Subbian, K.: Dynamic matrix factorization: A state space
approach. In: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2012)
16. Xiang, L., Yang, Q.: Time-dependent models in collaborative filtering based rec-
ommender system. In: IEEE/WIC/ACM International Joint Conferences on Web
Intelligence and Intelligent Agent Technologies, vol. 1 (2009)
17. Xiong, L., Chen, X., Huang, T.-K., Schneider, J., Carbonel, J.: Temporal collabo-
rative filtering with bayesian probabilistic tensor factorization. In: Proceedings of
the SIAM International Conference on Data Mining (SDM), vol. 10 (2010)
18. Yu, L., Liu, C., Zhang, Z.: Multi-linear interactive matrix factorization. Knowledge-
Based Systems (2015)
19. Zhao, X., Zhang, W., Wang, J.: Interactive collaborative filtering. In: CIKM 2013
(2013)
Short Papers
A Business Zone Recommender System
Based on Facebook and Urban Planning Data
1 Introduction
Location is a pivotal factor for retail success, owing to the fact that 94 % of
retail sales are still transacted in physical stores [9]. To increase the chance
of success for their stores, business owners need to know not only where their
potential customers are, but also their surrounding competitors and potential
allies. However, assessing a store location is a cumbersome task for business
owners as numerous factors need to be considered that often require gathering
and analyzing the relevant data. To this end, business owners typically conduct
ground surveys, which are time-consuming, costly, and do not scale up well.
Moreover, with the rapidly changing environment and emergence of new business
locations, one has to continuously reevaluate the value of the store locations.
Fortunately, in the era of social media and mobile apps, we have an abun-
dance of data that capture both online activities of users and offline activities
at physical locations. For example, more than 1 billion people actively use Face-
book everyday [8]. The availability of online user, location, and other behavioral
data makes it possible now to estimate the value of a business location.
Accordingly, we develop ZoneRec, a business location recommender system
that takes a user’s description about his/her business and produces a ranked
list of zones that would best suit the business. Such ranking constitutes a fun-
damental information retrieval (IR) problem [6,7], where the user’s description
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 641–647, 2016.
DOI: 10.1007/978-3-319-30671-1 47
642 J. Lin et al.
corresponds to the query, and the pairs of business profiles and zones are the
documents. Our system is targeted at business owners who have little or no prior
knowledge on which zone they should set up their business at. In our current
work, the zones refer to the 55 urban planning areas, the boundaries of which are
set by the Singapore government. While we currently focus on Singapore data,
it is worth noting that our approach can be readily used in other urban cities.
Figure 1 illustrates how our ZoneRec system works. First, the system asks
the user to define the type of his/her hypothetical (food) business (Fig. 1a),
and to then provide some description of the business (Fig. 1b). In turn, our
system analyzes the input data, based on which its recommendation algorithm
produces a ranked list of zones. The ranking scores of the zones are represented
by a heatmap overlaid on the Singapore map (Fig. 1c). Further details of each
recommended zone can be obtained by hovering or clicking on the zone.
Related Works. Using social media data to understand the dynamics of a soci-
ety is an increasingly popular research theme. For example, Chang and Sun [3]
analyzed the “check-ins” data of Facebook users to develop models that can pre-
dict where users will check-in next, and in turn predict user friendships. Another
close work by Karamshuk et al. [5] demonstrated the power of geographic and
user mobility features in predicting the best placement of retail stores. Our work
differs from [3] in that we use Facebook data to recommend locations instead of
friendships. Meanwhile, Karamshuk et al. [5] discretized the city into a uniform
grid of multiple circles. In contrast, we use more accurate, non-uniform area
boundaries that are curated by government urban planning.
Contributions. In summary, our contributions are: (i) To our best knowledge,
we are the first to develop a business zone recommendation method that fuses
A Business Zone Recommender System Based on Facebook 643
Facebook business location and urban planning data to help business owners find
the optimal zone placement of their businesses; (ii) We develop a user-friendly
web application to realize our ZoneRec approach, which is now available online at
http://research.larc.smu.edu.sg/bizanalytics/. (iii) We conduct empirical stud-
ies to compare different algorithms for zone recommendation, and assess the
relevancy of different feature groups.
2 Datasets
In this work, we use two public data sources, which we elaborate below.
Singapore Urban Planning Data. To obtain the zone information, we
retrieved the urban planning data from the Urban Development Authority
(URA) of Singapore [10]. The data consist of 55 predetermined planning zones.
To get the 55 zones, URA first divided Singapore into five regions: Central, West,
North, North-East and East. Each region has a population of more than 500,000
people, and is a mix of residential, commercial, business and recreational areas.
These regions are further divided into zones, each having a population of about
150,000 and being served by a town centre and several commercial/shopping
centres.
Facebook Business Data. In this work, we focus on data from Facebook pages
about food-related businesses that are located within the physical boundaries of
Singapore. Our motivation is that food-related businesses constitute one of the
largest groups in our Singapore Facebook data. From a total of 82,566 business
profiles we extracted via Facebook’s Graph API [4], we found 20,877 (25.2 %)
profiles that are food-related. Each profile has the following attributes:
– Business Name and Description. This represents the name and textual
description of the shop, respectively.
– List of Categories. From the 20,877 food-related businesses, we retrieve 357
unique categorical labels, as standardized by Facebook. These may contain
not only food-related labels such as “bakery,” “bar,” and “coffee shop”, but
also non-food ones such as “movie theatre,” “mall,” and “train station.”
The existence of non-food labels for food businesses is Facebook’s way of
allowing the users to tag multiple labels for a business profile. For example,
a Starbucks outlet near a train station in an airport may have both food and
non-food labels, such as “airport,” “cafe,” “coffee shop,” and “train station”.
– Location of Physical Store. Each business profile has a location
attribute containing the physical address and latitude-longitude coordinates
(hereafter called “lat-long”). We map the lat-long information to the URA
data to determine which of the 55 zones the target business is in. Note that
we rule out business profiles that do not have explicit lat-long coordinates.
– Customer Check-ins. A check-in is the action of registering one’s physical
presence; and the total number of check-ins received by a business gives us
a rough estimate of how popular and well-received it is.
644 J. Lin et al.
3 Proposed Approach
We cast the zone recommendation as a classification task, where the input fea-
tures are derived from the textual and categorical information of a business and
the class labels are the zone IDs. This formulation corresponds to the pointwise
ranking method for IR [6], whereby the ranking problem is transformed to a
conventional classification task. Our approach consists of three phases:
Data Cleaning. For each business, we first extract its (i) business name,
(ii) business description, and (iii) the tagged categories that it is associated
with. As some business profiles may have few or no descriptive text, we set
the minimum number of words in a description to be 20. This is to ensure that
our study only includes quality business profiles, as the insertion of businesses
with noisy “check-ins” will likely deterioriate the quality of the recommendations
produced by our classification algorithms. We remove all stop words and words
containing digits. Stemming is also performed to reduce inflectional forms and
derivationally related forms of a word to a common base form (e.g., car, cars,
car’s ⇒ car).
Feature Construction. Using the cleaned text from the previous stage, we
construct a bag of words for each feature group, i.e., the name, description,
and categories of each business profile. As not all words in the corpus are impor-
tant, however, we compute the term frequency-inverse document frequency (TF-
IDF) [7] to measure how important a (set of) word or is to a business profile
(i.e., a document) in a corpus. We also include bigram features, since in some
cases pairs of words make more sense than the individual words. With the inclu-
sion of unigrams and bigrams, we have a total number of 51,397 unique terms.
We set the minimum document frequency (DF) as 3, and retain the top 5000
terms based on their inverse document frequency (IDF) score.
Classification Algorithms. Based on the constructed TF-IDF features of a
business profile as well as the zone (i.e., class label) it belongs to, we can now craft
the training data for our classification algorithms. Specifically, each classifier is
trained to compute the matching score between a business profile and a zone
ID. We can then apply the classifiers to the testing data and sort the matching
scores in descending order, based upon which we pick the K highest scores that
would constitute our top K recommended zones.
In this study, we investigate on three popular classification algorithms:
(i) support vector machine (SVM) with linear kernel (SVM-Linear) [2], (ii) SVM
with radial basis function kernel (SVM-RBF) [2], and (iii) random forest classi-
fier (RF) [1]. The first two aim at maximizing the margin of separation between
data points from different classes, which would imply a lower generalization error.
Meanwhile, RF is an ensemble classifier that comprises an army of decision trees.
It works based on bagging mechanism, i.e., each tree is built from bootstrap sam-
ples drawn with replacement from the training data, and the final prediction is
done via a majority voting of the decisions made by the constituent trees. Being
an ensemble model, RF exhibits its high accuracy and robustness, and the bag-
ging mechanism facilitates an efficient, parallelizable learning process.
A Business Zone Recommender System Based on Facebook 645
of our ablation study. The first three rows show the results of ablating (remov-
ing) two feature groups, while the last three rows are for ablating one feature
group.
From the first three rows, it is evident that the “description” is the most
important feature group, consistently providing the highest Hit@10, MAP@10,
and NDCG@10 scores compared to the other two. This is reasonable, as the
“description” provides the richest set of features (in terms of word vocabulary
and frequencies) representing a business, and some of these features provide
highly discriminative inputs for our RF classifier. We can also see that the
“name” group is more discriminative than the “categories” group for all the
three metrics. Again, this can be attributed to the more fine-grained information
provided by the business’ name features as compared to the category features.
Finally, we find that the results in the last three rows are consistent with those of
the first three rows. That is, the “description” group constitutes the most infor-
mative features (for our RF model), followed by the “name” and “category”
groups.
5 Conclusion
We put forward the ZoneRec recommender system that can help business own-
ers decide which zones they should set their businesses at. Despite its promising
potentials, there remains room for improvement. First, the zone-level recommen-
dations may not provide sufficiently granular information for business owners,
e.g., where exactly a store should be set at and how the surrounding businesses
may affect this choice. It is also fruitful to include more comprehensive residential
and demographic information in our feature set, and conduct deeper analysis on
the contribution of the individual features. To address these, we plan to develop
a two-level location recommender system, whereby ZoneRec serves as the first
level and the second level recommends the specific hotspots within each zone.
References
1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
2. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM-
TIST 2(27), 1–27 (2011)
3. Chang, J., Sun, E.: Location3: how users share and respond to location-baseddata
on social networking sites. In: ICWSM, pp. 74–80 (2011)
4. Facebook. Graph API reference (2015). https://goo.gl/8ejSw0
5. Karamshuk, D., Noulas, A., Scellato, S., Nicosia, V., Mascolo, C.: Geo-spotting:
mining online location-based services for optimal retail store placement. In: KDD,
pp. 793–801 (2013)
6. Liu, T.-Y.: Learning to rank for information retrieval. Found. Trends Inf. Retrieval
3(3), 225–331 (2009)
7. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Cambridge University Press, Cambridge (2008)
8. Smith, C.: 200+ amazing facebook user statistics (2016). http://goo.gl/RUoCxE
9. Thau, B.: How big data helps chains like starbucks pick store locations–an (unsung)
key to retail success (2015). http://onforb.es/1k8VEQY
10. URA. Master plan: View planning boundaries (2015). http://goo.gl/GA3dR8
On the Evaluation of Tweet Timeline
Generation Task
1 Introduction
With the enormous volume of tweets posted daily and the associated redundancy
and noise in such vibrant information sharing medium, a user can find it diffi-
cult to get updates about a topic or an event of interest. The Tweet Timeline
Generation (TTG) task was recently introduced at TREC-2014 microblog track
to tackle this problem. TTG aims at generating a timeline of relevant and novel
tweets that summarizes the development of a topic over time [5].
In the TREC task, a TTG system is evaluated using variants of F1 mea-
sure that combine precision and recall of the generated timeline against a gold
standard of clusters of semantically-similar tweets. Different TTG approaches
were presented in TREC-2014 [5] and afterwards [2,4]: almost all rely on an
initial step of retrieval of a ranked list of potentially-relevant tweets, followed
by applying novelty detection and duplicate removal techniques to generate the
timeline [5]. In such design, the quality of generated timeline naturally relies
on that of the initially-retrieved list. There is a major concern that the eval-
uation metrics do not fairly rank TTG systems since they start from different
retrieved ranked lists. An effective TTG system that is fed low quality list may
achieve lower performance compared to another low quality TTG system fed a
high quality list; current TTG evaluation metrics lacks the ability to evaluate
TTG independently from the retrieval effectiveness. This creates an evaluation
challenge, especially for future approaches that use different retrieval models.
In this work, we examine the bias of TTG evaluation methodology intro-
duced in the track [1]. We first empirically measure the dependency of TTG
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 648–653, 2016.
DOI: 10.1007/978-3-319-30671-1 48
On the Evaluation of Tweet Timeline Generation Task 649
2 Experimental Setup
A set of 55 queries and corresponding relevance judgments were provided by
TREC [5]. For each query, a set of semantic clusters were identified; each consists
of tweets relevant to an aspect of the topic but substantially similar to each other.
Precision, recall, and F1 measures over the semantic clusters were used for
evaluation. Precision (P) is defined as the proportion of tweets returned by a
TTG system representing distinct semantic clusters. Recall (R) is defined as the
proportion of the total semantic clusters that are covered by the returned tweets.
Weighted Recall (wR) is measured similarly but weighs each covered semantic
cluster by the sum of relevance grades1 of its tweets. F1 combines P and R,
while wF1 combines P and wR. Each of those measures is first computed over
the returned timeline of each query and then averaged over all queries.
In our experiments, we used 12 officially-submitted ad-hoc runs by 3 of the top
4 participated groups in TREC-2014 TTG task [3,6,9]. Additionally, we used a
baseline run directly provided by TREC search API [5]. This concludes a total of
13 ad-hoc runs for our study, denoted by the set A = {a1 , a2 , ... , a13 }. The retrieval
approaches used by those runs are mainly five: (1) direct search by TREC API
(D), (2) using query expansion (QE), (3) using QE that utilizes the links in tweets
(QE+Web), (4) using QE then learning to rank (QE+L2R), and (5) using relevance
modeling (RM). Table 1 presents all ad-hoc runs and their retrieval performance.
We also used 8 different TTG systems (of two TREC participants) [3,6],
denoted by T = {t1 , t2 , ... , t8 }. Their approaches are summarized as follows:
– t1 to t4 applied 1NN-clustering (using modified versions of Jaccard similarity)
on the retrieved tweets [6] and generated timelines using different retrieval
depths, which made their performance results significantly different [5,6].
– t5 is a simple TTG system that just returns the retrieved tweets after removing
exact duplicates.
– t6 to t8 applied an incremental clustering approach that treats the retrieved
tweets, sorted by their retrieval scores, as a stream and clusters each tweet
based on cosine similarity to the centroids of existing clusters. They also used
different number of top retrieved tweets and different similarity thresholds,
and considered the top-scoring tweet in each cluster as its centroid [3].
1
1 for a relevant tweet and 2 for a highly-relevant tweet.
650 W. Magdy et al.
Table 2 presents the performance of the 8 TTG systems when applied to a13 ,
which was selected as a sample to illustrate the quality of each TTG system. As
shown, the quality of the 8 TTG systems varies significantly. In fact, by applying
significance test on wF1 using two-tailed t-test with α of 0.05, we found that all
TTG system pairs but 6 were statistically significantly different.
Combinations of ad-hoc runs and TTG systems created a list of 104 TTG
runs that we used to study the bias of the task evaluation. We aim to show
whether the evaluation methodology used in the TREC microblog track is biased
towards retrieval quality, and if there is a way to reduce possible bias.
To measure bias and dependency of TTG on the quality of the used ad-hoc
runs, we use Kendall tau correlation (τ ) and AP correlation (τAP ) [10]. τAP is
used besides τ since it is more sensitive to errors at higher ranks [10].
Fig. 1. τ and τAP between ad-hoc runs and their corresponding TTG timelines, aver-
aged over TTG systems.
Figure 1 reports the average τ and τAP correlations using different retrieval
and TTG performance metrics. As shown, there is always a positive correlation
between the quality rankings of ad-hoc runs and the TTG timelines. Considering
the main metrics for evaluating retrieval (MAP) and for evaluating TTG (wF1 ),
the correlation values are 0.49 for both τ and τAP . This indicates a considerable
correlation, but it is not very strong as expected.
Table 3 reports the average τ and τAP correlations among all pairs of TTG
systems. Achieved correlation scores align with the same ones in Fig. 1, but
with slightly higher values. This also supports the finding that TTG system
performance depends, to some extent, on the quality of input ad-hoc runs. This
observation suggests that using different ad-hoc runs with different TTG systems
makes it unlikely to have unbiased evaluation for the TTG systems, since the
output of TTG systems, in general, depends on the quality of the retrieval run.
652 W. Magdy et al.
σ∗ R wR P F1 wF1 σ∗ R wR P F1 wF1
avg. τ 0.76 0.57 0.57 0.68 0.56 avg. τ 0.96 0.97 0.93 0.86 0.85
avg. τAP 0.71 0.52 0.50 0.63 0.51 avg. τAP 0.92 0.93 0.81 0.72 0.76
Table 4 reports the average τ and τAP correlations of TTG rankings over all
pairs of ad-hoc runs. It shows that there are strong correlation values for all of
the evaluation metrics, especially recall and precision. There are some notice-
able difference in the values of τ and τAP , where the latter is smaller. This is
expected since τAP is more sensitive to changes on the ranks at the top of the
list. According to Voorhees [8], a τ correlation over 0.9 “should be considered
equivalent since it is not possible to be more precise than this. Correlations
less than 0.8 generally reflect noticeable changes in ranking”. A later study by
Sanderson and Soborroff [7] showed that τ gets lower values when lists of smaller
range of values are compared, which holds in our case. Thus, the correlation val-
ues achieved in Table 4 show that ranking of TTG systems is almost equivalent
by all TTG evaluation scores regardless of the ad-hoc run used.
This finding is of high importance, since it suggests a possible solution
to achieve less-biased evaluation of the TTG task, simply by using a com-
mon/standard ad-hoc run when evaluating new TTG systems.
One possible and straightforward ad-hoc retrieval run that can be used as
a standard run for evaluating different TTG systems is the baseline run a13 .
Such run is easy to construct through searching the tweets collection with-
out any processing to the queries. Although the retrieval effectiveness of a13 is
expected to be one of the poorest (see Table 1), when we calculated the average
On the Evaluation of Tweet Timeline Generation Task 653
τ and τAP correlation for ranking TTG systems with this run compared to the
other 12 add-hoc runs using the wF1 score, the values were 0.88 and 0.82 respec-
tively. This is a high correlation according to Voorhees [8].
References
1. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR
(2000)
2. Fan, F., Qiang, R., Lv, C., Xin Zhao, W., Yang, J.: Tweet timeline generation via
graph-based dynamic greedy clustering. In: AIRS (2015)
3. Hasanain, M., Elsayed, T.: QU at TREC-2014: Online clustering with temporal
and topical expansion for tweet timeline generation. In: TREC (2014)
4. Hasanain, M., Elsayed, T., Magdy, W.: Improving tweet timeline generation by
predicting optimal retrieval depth. In: AIRS (2015)
5. Lin, J., Efron, M., Wang, Y., Sherman, G.: Overview of the TREC-2014 microblog
track. In: TREC (2014)
6. Magdy, W., Gao, W., Elganainy, T., Wei, Z.: QCRI at TREC 2014: Applying the
KISS principle for the TTG task in the microblog track. In: TREC (2014)
7. Sanderson, M., Soboroff, I.: Problems with Kendall’s tau. In: SIGIR (2007)
8. Voorhees, E.M.: Evaluation by highly relevant documents. In: SIGIR (2001)
9. Xu, T., McNamee, P., Oard, D.W.: HLTCOE at TREC 2014: Microblog and clinical
decision support. In: TREC (2014)
10. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for
information retrieval. In: SIGIR (2008)
Finding Relevant Relations
in Relevant Documents
1 Introduction
Constructing knowledge bases from text documents is a well-studied task in the
field of Natural Language Processing [3,5, inter alia]. In this work, we view task
of constructing query-specific knowledge bases from an IR perspective, where a
knowledge base of relational facts is to be extracted in response to a user infor-
mation need. The goal is to extract, select, and present the relevant information
directly in a structured and machine readable format for deeper analysis of the
topic. We focus on the following task:
Task: Given a query Q, use the documents from a large collection of Web doc-
uments to extract binary facts, i.e., subject–predicate–object triples (S, P, O)
between entities S and O with relation type P that are both correctly extracted
from the documents’ text and relevant for the query Q.
For example, a user who wants to know about the Raspberry Pi computer should
be provided with a knowledge base that includes the fact that its inventor Eben
Upton founded the Raspberry Pi Foundation, that he went to Cambridge Uni-
versity, which is located in the United Kingdom, and so on. This knowledge
base should include all relational facts about entities that are of interest when
understanding the topic according to a given relation schema, e.g., Raspberry Pi-
Foundation–founded by–Eben Upton. Figure 1 gives an example of such a query-
specific resource, and shows how relations from text and those from a knowledge
base (DBpedia, [1]) complement each other.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 654–660, 2016.
DOI: 10.1007/978-3-319-30671-1 49
Finding Relevant Relations in Relevant Documents 655
Fig. 1. Example of a knowledge base for the query “raspberry pi”. rf: denotes relations
extracted from documents, whereas dbp: and dbo: are predicates from DBpedia.
2 Method
Document Retrieval. We use the Galago2 search engine to retrieve documents
D from the given corpus that are relevant for the query Q. We build upon the
work of Dalton et al. [4] and rely on the same document pool and state-of-the-art
content-based retrieval and expansion models, namely the sequential dependence
model (SDM), the SDM model with query expansion through RM3 (SDM-RM3),
and the SDM model with query expansion through the top-ranked Wikipedia
article (WikiRM1).
1
Dataset and additional information is available at http://relrels.dwslab.de.
2
http://lemurproject.org/galago.php.
656 M. Schuhmacher et al.
leading to a dataset with 207 relevant facts and 246 non-relevant facts across
all 17 queries, an average of 26.6 relevant facts per query. In this study we only
consider queries with at least five correctly extracted facts (yielding 17 queries).
4 Evaluation
We evaluate here how well the pipeline of document retrieval and relation extrac-
tion performs for finding query-relevant facts. The relevance is separately eval-
uated from extraction correctness, as described in Sect. 3. In the following, we
focus only on the 453 correctly extracted facts. For comparing different settings,
we test statistical significant improvements on the accuracy measure through a
two-sided exact binomial test on label agreements (α = 5 %).
Table 1. Experimental results for relation relevance (correctly extracted relations only)
comparing different fact retrieval features: All facts (All), facts also included in DBpedia
(DBp), fact mentioned three or more times (Frq≥3 ), facts extracted from a relevant
document (Doc). Significant accuracy improvements over “All” marked with †.
Indicators for Fact Relevance (RQ2). We study several indicators that may
improve the prediction of fact relevance. First, we confirm that the frequency of
fact mentions indicates fact relevance. If we classify a correctly extracted fact as
‘relevant’ only when it is mentioned at least three times5 then relevance accuracy
is improved by 23.6 % from 0.457 to 0.565 (statistically significant). This also
reduces the number of predicted facts to a fourth (see Table 1, column F rq≥3 ).
Next, we compare the extracted facts with facts known to the large general-
purpose knowledge base DBpedia. When classifying only extracted facts as rel-
evant when they are confirmed—that is, both entities are related in DBpedia
(independent of the relation type)—we do not obtain any significant improve-
ments in accuracy or precision. Therefore, confirmation of a known fact in an
external knowledge base does not indicate relevance. However, we notice that
only 64 of the relevant facts are included in DBpedia, whereas another 143 new
and relevant facts are extracted from the document-centric approach (cf. Table 1,
column DBp). This indicates that extracting yet unknown relations (i.e., those
not found in the knowledge base) from query-relevant text has the potential to
provide the majority of relevant facts to the query-specific knowledge base.
Our study relies on a document retrieval system, leading to some non-relevant
documents in the result list. We confirm that the accuracy of relation relevance
improves significantly when we only consider documents assessed as relevant.
However, it comes at the cost of retaining only a tenth of the facts (cf. Table 1,
column Doc).
5
We chose ≥ 3 in order to be above the median of the number of sentences per fact,
which is 2.
Finding Relevant Relations in Relevant Documents 659
5 Conclusion
We investigate the idea of extracting query relevant facts from text documents
to create query-specific knowledge bases. Our study combines publicly available
data sets and state-of-the-art systems for document retrieval and relation extrac-
tion to answer research questions on the interplay between relevant documents
and relational facts for this task. We can summarize our key findings as follows:
(a) Query-specific documents contain relevant facts, but even with perfect
extractions, only around half of the facts are actually relevant with respect
to the query.
(b) Many relevant facts are not contained in a wide-coverage knowledge base like
DBpedia, suggesting importance of extraction for query-specific knowledge
bases.
(c) Improving retrieval precision of documents increases the ratio of relevant
facts significantly, but sufficient recall is required for appropriate coverage.
(d) Facts that are relevant can contain entities (typically in object position) that
are—by themselves—not directly relevant.
From a practical perspective, we conclude that the combination of document
retrieval and relation extraction is a suitable approach to query-driven knowl-
edge base construction, but it remains an open research problem. For fur-
ther advances, we recommend to explore the potential of integrating document
retrieval and relation extraction—as opposed to simply applying them sequen-
tially in the pipeline architecture.
References
1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann,
S.: DBpedia — A crystallization point for the web of data. J. Web Semant. 7(3),
154–165 (2009)
2. Blanco, R., Zaragoza, H.: Finding support sentences for entities. In: Proceedings of
SIGIR 2010, pp. 339–346 (2010)
3. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R., Mitchell, T.M.:
Toward an architecture for never-ending language learning. In: Proceedings of AAAI
2010, pp. 1306–1313 (2010)
4. Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base
links. In: Proceedings of SIGIR-2014, pp. 365–374 (2014)
5. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information
extraction. In: Proceedings of EMNLP 2011, pp. 1535–1545 (2011)
6. Gabrilovich, E., Ringgaard, M., Subramanya, A.: FACC1: Freebase annotation of
ClueWeb corpora, Version 1 (2013)
7. Roth, B., Barth, T., Chrupala, G., Gropp, M., Klakow, D.: Relationfactory: A fast,
modular and effective system for knowledge base population. In: Proceedings of
EACL 2014, p. 89 (2014)
8. Schuhmacher, M., Dietz, L., Ponzetto, S.P.: Ranking entities for web queries through
text and knowledge. In: Proceedings of CIKM 2015 (2015)
9. Voskarides, N., Meij, E., Tsagkias, M., de Rijke, M., Weerkamp, W.: Learning to
explain entity relationships in knowledge graphs. In: Proceedings of ACL 2015,
pp. 564–574 (2015)
Probabilistic Multileave Gradient Descent
1 Introduction
Modern search engines are complex aggregates of multiple ranking signals. Such
aggregates are learned using learning to rank methods. Online learning to rank
methods learn from user interactions such as clicks [4,6,10,12]. Dueling Bandit
Gradient Descent [16] uses interleaved comparison methods [1,3,6,7,10] to infer
preferences and then learns by following a gradient that is meant to lead to an
optimal ranker.
We introduce probabilistic multileave gradient descent (P-MGD), an online
learning to rank method that builds on a recently proposed highly sensitive
and unbiased online evaluation method, viz. probabilistic multileave. Multileave
comparisons allow one to compare multiple but a still limited set of candidate
rankers per user interaction [13]. The more recently introduced probabilistic
multileave comparison method improves over this by allowing for comparisons
of an unlimited number of rankers at a time [15]. We show experimentally that P-
MGD significantly outperforms state-of-the-art online learning to rank methods
in terms of online performance, without sacrificing offline performance and at
greater learning speed than those methods. In particular, we include comparisons
between P-MGD on the one hand and multiple types of DBGD and multileaved
gradient descent methods [14, MGD] and candidate preselection [5, CPS] on
the other. We answer the following research questions: (RQ1) Does P-MGD
convergence on a ranker of the same quality as MGD and CPS? (RQ2) Does
P-MGD require less queries to converge compared to MGD and CPS? (RQ3) Is
the user experience during the learning process of P-MGD better than during
that of MGD or CPS?
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 661–668, 2016.
DOI: 10.1007/978-3-319-30671-1 50
662 H. Oosterhuis et al.
2 Probabilistic Multileaving
Multileaving [13] is an online evaluation approach for inferring preferences
between rankers from user clicks. Multileave methods take a set of rankers and
when a query is submitted a ranking is computed for each of the rankers. These
rankings are then combined into a single multileaved ranking. Team Draft Mul-
tileaving (TDM) assigns each document in this resulting ranking to a ranker.
The user is then presented with this multileaved ranking and his interactions
are recorded. TDM keeps track of the clicks and attributes every clicked docu-
ment to the ranker to which it was assigned. Two important aspects of online
evaluation methods are sensitivity and bias. TDM is more sensitive than exist-
ing interleaving methods [3,9,11], since it requires fewer user interactions to
infer preferences. Secondly, empirical evaluation also showed that TDM has no
significant bias [13]. Probabilistic multileave [15, PM] extends probabilistic inter-
leave [3, PI]. Unlike TDM, PM selects documents from a distribution where the
probability of being added correlates with the perceived relevance. It marginal-
izes over all possible team assignments, which makes it more sensitive and allows
it to infer preferences within a virtually unlimited set of rankers from a single
interaction. The increased sensitivity of PM and its lack of bias were confirmed
empirically [15]. Our novel contribution is that we use PM instead of TDM for
inferring preferences in our online learning to rank method, allowing the learner
to explore a virtually unlimited set of rankers.
be preferred over the current best; we consider the Mean-Winner update app-
roach [14] as it is the most robust; it updates the current best towards the mean
of all preferred candidates. The algorithm repeats this for every incoming query,
yielding an unending adaptive process.
Probabilistic Multileave Gradient Descent (P-MGD) is introduced in this paper.
The novelty of this method comes from the usage of PM instead of TDM as its
multileaving method. TDM needs to assign each document to a team in order to
infer preferences. This limits the number of rankers that are compared at each
interaction to the number of displayed documents. PM on the other hand allows
for a virtually unlimited number of rankers to be compared. The advantage of
P-MGD is that it can learn faster by having n, the number of candidates, in
Algorithm 1 exceed the length of the result list.
Candidate Preselection (CPS) [5], unlike MGD, does not alter the number of
candidates compared per impression. It speeds up learning by reusing histori-
cal data to select more promising candidates for DBGD. A set of candidates is
generated by repeatedly sampling the unit sphere around the current best uni-
formly. Several rounds are simulated to eliminate all but one candidate. Each
round starts by sampling two candidates between which a preference is inferred
with Probabilistic Interleave [3, PI]. The least preferred of the two candidates is
discarded; if no preference is found, one is discarded at random. The remaining
candidate is then used by DBGD.
4 Experimental Setup
We describe our experi- Table 1. Overview of instantiations of CCM [2].
ments, designed to answer P (click = 1|R) P (stop = 1|R)
the research questions posed
R 0 1 2 3 4 0 1 2 3 4
in Sect. 1.1 An experiment is
based on a stream of inde- per 0.0 0.2 0.4 0.8 1.0 0.0 0.0 0.0 0.0 0.0
pendent queries submitted nav 0.05 0.3 0.5 0.7 0.95 0.2 0.3 0.5 0.7 0.9
by users interacting with the inf 0.4 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5
1
Our experimental code is available at https://bitbucket.org/ilps/lerot.
664 H. Oosterhuis et al.
Table 2. Offline score (NDCG) after 10,000 query impressions of each of the algorithms
for the 3 instantiations of the CCM (see Table 1). Bold values indicate maximum per-
formance per dataset and click model. Statistically significant improvements (losses)
over the DBGD and TD-MGD baseline are indicated by (p < 0.05) and (p < 0.01)
( and ). Standard deviation in brackets.
the offline performance of CPS drops after an initial peak. CPS seems to overfit
because of the effect of historical data on candidate sampling. The other methods
sample candidates uniformly, thus noisy false preferences are expected in all
directions evenly; therefore, over time they will oscillate in the right direction.
Conversely, CPS samples more candidates in the directions that historical data
expects the best candidates to be, causing the method not to oscillate but drift
due to noise. The increased sensitivity of CPS does not compensate for its bias
in the long run.
To answer how the learning speed of P-MGD compares to our baselines (RQ2)
we consider Fig. 1, which shows offline performance on the NP-2003 dataset with
the informational click model. P-MGD with 99 candidates and CPS perform
substantially better than TD-MGD and DBGD during the first 1,000 queries
and it takes around 2,000 queries before TD-MGD to reach a similar level of
performance. From the substantial difference between P-MGD with 9 and 99
candidates, also present in the other runs, we conclude that P-MGD with a
large number of candidates has a greater learning speed.
To answer (RQ3) we evaluate the user experience during learning. Table 3
displays the results of our online experiments. In all runs the online performance
of P-MGD significantly improves over DBGD, again showing the positive effect
of increasing the number of candidates. Compared to TD-MGD, P-MGD per-
forms significantly better under the informational click model. We conclude that
P-MGD is a definite improvement over TD-MGD when clicks contain a large
amount of noise. We attribute this difference to the greater learning speed of P-
MGD: fewer queries are required to find rankers of the same performance as TD-
666 H. Oosterhuis et al.
Table 3. Online score (NDCG) after 10,000 query impressions of each of the algorithms
for the 3 instantiations of the CCM (see Table 1). Notation is the same as that of
Table 2.
0.70
NDCG
0.65
0.60 PI-DBGD TD-MGD-9c P-MGD-99c
0.55 CPS P-MGD-9c
Fig. 1. Offline performance (NDCG) on the NP-2003 dataset for the informational
click model.
MGD. Consequently, the rankings shown to users are better during the learning
process. When comparing CPS to TD-MGD we see no significant improvements
except on the informational and navigational runs on the NP-2003 dataset. This
is surprising as CPS was introduced as an alternative to DBGD that improves
the user experience. Thus, P-MGD is a better alternative of TD-MGD espe-
cially when clicks are noisy; CPS does not offer reliable benefits when compared
to TD-MGD.
6 Conclusions
We have introduced an extension of multileave gradient descent (MGD) that
uses a recently introduced multileaving method, probabilistic multileaving. Our
extension, probabilistic multileave gradient descent (P-MGD) marginalizes over
document assignments in multileaved rankings. P-MGD has an increased sen-
sitivity as it can infer preferences over a large number of assignments. P-MGD
can be run with a virtually unlimited number of candidates. We have compared
P-MGD with dueling bandit gradient descent (DBGD), team-draft multileave
Probabilistic Multileave Gradient Descent 667
References
1. Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and
analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 41 (2012)
2. Guo, F., Liu, C., Wang, Y.M.: Efficient multiple-click models in web search. In:
WSDM 2009. ACM (2009)
3. Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring
preferences from clicks. In: CIKM 2011. ACM (2011)
4. Hofmann, K., Whiteson, S., de Rijke, M.: Balancing exploration and exploitation
in listwise and pairwise online learning to rank for information retrieval. Inf. Retr.
16(1), 63–90 (2012)
5. Hofmann, K., Schuth, A., Whiteson, S., de Rijke, M.: Reusing historical interaction
data for faster online learning to rank for IR. In: WSDM 2013. ACM (2013)
6. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002.
ACM (2002)
7. Joachims, T.: Evaluating retrieval performance using clickthrough data. In: Franke,
J., Nakhaeizadeh, G., Renz, I. (eds.) Text Mining. Physica/Springer, Heidelberg
(2003)
8. Liu, T.-Y., Xu, J., Qin, T., Xiong, W., Li, H.: LETOR: benchmark dataset for
research on learning to rank for information retrieval. In: LR4IR 2007 (2007)
9. Radlinski, F., Craswell, N.: Optimized interleaving for online retrieval evaluation.
In: WSDM 2013. ACM (2013)
10. Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval
quality? In: CIKM 2008. ACM (2008)
668 H. Oosterhuis et al.
11. Radlinski, F., Bennett, P.N., Yilmaz, E.: Detecting duplicate web documents using
clickthrough data. In: WSDM 2011. ACM (2011)
12. Sanderson, M.: Test collection based evaluation of information retrieval systems.
Found. Tr. Inform. Retr. 4(4), 247–375 (2010)
13. Schuth, A., Sietsma, F., Whiteson, S., Lefortier, D., de Rijke, M.: Multileaved
comparisons for fast online evaluation. In: CIKM 2014, pp. 71–80. ACM, November
2014
14. Schuth, A., Oosterhuis, H., Whiteson, S., de Rijke, M.: Multileave gradient descent
for fast online learning to rank. In: WSDM 2016. ACM, February 2016
15. Schuth, A., et al.: Probabilistic multileave for online retrieval evaluation. In: SIGIR
2015 (2015)
16. Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a
dueling bandits problem. In: ICML 2009 (2009)
Real-World Expertise Retrieval: The Information Seeking
Behaviour of Recruitment Professionals
UXLabs, London, UK
[email protected], [email protected]
1 Introduction
Research into how people find and share expertise can be traced back to the 1960s, with
early studies focusing on knowledge workers such as engineers and scientists and the
information sources they consult [1]. Since then, the process of finding human experts
(or expertise retrieval) has been studied in a variety of contexts and become the subject
of a number of evaluation campaigns (e.g. the TREC Enterprise track and Entity Track
[2, 3]). This has facilitated the development of numerous research systems and proto‐
types, and led to significant advances in performance, particularly against a range of
system-oriented metrics [4].
However, in recent years there has been a growing recognition that the effectiveness
of expertise retrieval systems is highly dependent on a number of contextual factors [5].
This has led to a more human-centred approach, where the emphasis is on how people
search for expertise in the context of a specific task. These studies have typically been
performed in an enterprise context, where the aim is to utilize human knowledge within
an organization as efficiently as possible (e.g. [5, 6]).
However, there is a more ubiquitous form of expertise retrieval that embodies expert
finding in its purest, most elemental form: the work of the professional recruiter. The
job of a recruiter is to find people that are the best match for a client brief and return a
list of qualified candidates. This involves the creation and execution of complex Boolean
expressions, including nested, composite structures such as the following:
Java AND (Design OR develop OR code OR Program) AND (“*
Engineer” OR MTS OR “* Develop*” OR Scientist OR technol-
ogist) AND (J2EE OR Struts OR Spring) AND (Algorithm OR
“Data Structure” OR PS OR Problem Solving)
Over time, many recruiters create their own collection of queries and draw on these
as a source of intellectual property and competitive advantage. Moreover, the creation
of such expressions is the subject of many community forums (such as Boolean Strings
and Undercover Recruiter) and the discussions that ensue involve topics that many IR
researchers would recognise as wholly within their field of expertise (such as query
expansion and optimisation, results evaluation, etc.).
However, despite these shared interests, the recruitment profession has been largely
overlooked by the IR community. Even recent systematic reviews of professional search
behaviour make no reference to this profession [7], and their information seeking behav‐
iours remain relatively unstudied. This paper seeks to address that omission. We summa‐
rise the results of a survey of 64 recruitment professionals, examining their search tasks
and behaviours, and the types of functionality that they value.
2 Background
We are aware of no prior work investigating the recruitment profession from an infor‐
mation seeking perspective. However, there are studies of other professions with related
characteristics, such as Joho et al.’s [8] survey of patent searchers and Geschwandtner
et al.’s [9] survey of medical professionals.
Unfilled vacancies have high impact on the economy, costing the UK £18bn annually
[10]. Recruitment or sourcing is the process of finding capable applicants for those
vacancies. It is a skill that is to some extent emulated by expert finding systems [4],
although recruiters also must take into account contextual variables such as availability,
previous experience, remuneration, etc.
Sourcing is also similar to people search on the web where the goal is to analyse
large volumes of unstructured and noisy data to return a list of individuals who fit specific
criteria [11]. The professional recruiter must normalise and disambiguate the returned
results [2], and then apply additional factors to select a smaller group of qualified candi‐
dates. The gold standard for evaluation in this case is recommending one or more candi‐
dates that successfully fulfil a client brief.
3 Method
1
Available from https://isquared.wordpress.com/.
Real-World Expertise Retrieval: The Information Seeking 671
4 Results
4.1 Demographics
We then examined the broader query lifecycle. In total, the majority of respondents
(80 %) used examples or templates at least sometimes; suggesting that the value
embodied in such expressions is recognised and re-used wherever possible. In addition,
most respondents (57 %) were prepared to share queries with colleagues in their work‐
group and a further 22 % would share more broadly within their organisation. However,
very few (5 %) were prepared to share publicly, underlining the competitive nature of
the industry. Job boards such as Monster, CareerBuilder and Indeed were the most
commonly used databases (77 %), although a similar proportion (73 %) also targeted
social networks such as LinkedIn, Twitter and Facebook.
Table 1 shows the amount of time that recruiters spend in completing their most
frequently performed search task, the time spent formulating individual queries, and the
number of queries they use. On average, it takes around 3 h to complete a search task
which consists of roughly 5 queries, with each query taking around 5 min to formulate.
This suggests that recruitment follows a largely iterative paradigm, consisting of succes‐
sive phases of candidate search followed by other activities such as candidate selection
and evaluation. Compared to patent search the task completion time is less (3 h vs. 12 h)
but is longer than typical web search tasks [12]. Also, the number of queries is fewer
(5 compared to 15) but the average query formulation time is the same (5 min).
672 T. Russell-Rose and J. Chamberlain
In this section we examine the mechanics of the query formulation process, by asking
respondents to indicate a level of agreement to various statements using a 5-point Likert
scale ranging from strong disagreement (1) to strong agreement (5). The results are
shown in Fig. 1 as a weighted average across all responses.
Boolean
Synonyms
Query expansion
Abbreviations
Proximity
Weighting
Field operators
Misspellings
Wildcard
Truncation
Query translation
Case sensitivity
1 2 3 4 5
The results suggest two observations in common with patent search. Firstly, the
average of all but one of the features is above 3 (neutral), which suggests a willingness
to adopt a wide range of search functionality to complete search tasks. Secondly,
Boolean logic is shown to be the most important feature (4.25), closely followed by the
use of synonyms (4.16) and query expansion (4.02). These scores indicate such func‐
tionality is desired by recruiters but the support offered by current search tools is highly
variable. On the one hand, support for complex Boolean expressions is provided by
many of the popular job boards. However, practical support for query formulation
Real-World Expertise Retrieval: The Information Seeking 673
and synonym generation is much more limited, with most current systems still relying
instead on the expertise and judgement of the recruiter.
This paper summarises the results of a survey of the information seeking behaviour of
recruitment professionals, uncovering their search needs in a manner that allows
comparison with other, better-studied professions. In this section we briefly discuss the
findings with verbatim quotes from respondents shown in italics where applicable.
Sourcing is shown to be something of a hybrid search task. The goal is essentially a
people search task, but, the objects being returned are invariably documents (e.g. CVs
and resumes), so the practice also shares characteristics of document search. Recruiters’
display a number of professional search characteristics that differentiate their behaviour
from web search [13], such as lengthy search sessions, different notions of relevance,
different sources searched separately, and the use of specific domain knowledge: “The
hardest part of creating a query is comprehending new information and developing a
mental model of the ideal search result.”
Recruitment professionals use complex search queries, and actively cultivate skills
in the formulation of such expressions. The search tasks they perform are inherently
interactive, requiring multiple iterations of query formulation and results evaluation “it
is the limitations of available technology that force them to downgrade their concept
tree into a Boolean expression”. In contrast with patent searchers, recruiter search
behaviour is characterised by satisficing strategies, in which the objective is to identify
674 T. Russell-Rose and J. Chamberlain
References
1. Menzel, H.: Information needs and uses in science and technology. Ann. Rev. Inf. Sci.
Technol. 1, 41–69 (1966)
2. Balog, K., Soboroff, I., Thomas, P., Craswell, N., de Vries, A.P., Bailey, P.: Overview of the
TREC 2008 enterprise track. In: The Seventeenth Text Retrieval Conference Proceedings
(TREC 2008), NIST (2009)
3. Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2011 entity track.
In: Proceedings of the Twentieth Text REtrieval Conference (TREC 2011) (2012)
4. Balog, K., Fang, Y., de Rijke, M., Serdyukov, P., Si, L.: Expertise Retrieval. Found. Trends
Inf. Retrieval 6(2–3), 127–256 (2012)
5. Hofmann, K., Balog, K., Bogers, T., de Rijke, M.: Contextual factors for finding similar
experts. Inf. Sci. Technol. 61(5), 994–1014 (2010)
6. Woudstra, L.S.E., Van den Hooff, B.J.: Inside the source selection process: Selection criteria
for human information sources. Inf. Process. Manage. 44, 1267–1278 (2008)
7. Vassilakaki, E., Moniarou-Papaconstantinou, V.: A systematic literature review informing
library and information professionals’ emerging roles. New Librar. World 116(1/2), 37–66
(2015)
8. Joho, H., Azzopardi, L., Vanderbauwhede, W.: A survey on patent users: An analysis of tasks,
behavior, search functionality and system requirements. In: Proceedings of the 3rd
Symposium on Information Interaction in Context (IIiX 2010) (2010)
9. Geschwandtner, M., Kritz, M., Boyer, C.: D8.1.2: Requirements of the health professional
search. Technical report, Khresmoi Project (2011)
10. Cann, J.: IOR Recruitment Sector Report: Report No.1 (UK), Institute of Recruiters (2015)
11. Guan, Z., Miao, G., McLoughlin, R., Yan, X., Cai, D.: Co-occurrence-based diffusion for
expert search on the web. IEEE Trans. Knowl. Data Eng. 25(5), 1001–1014 (2013)
12. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)
13. Lupu, M., Salampasis, M., Hanbury, A.: Domain specific search. Professional search in the
modern world. Springer International Publishing, pp. 96–117 (2014)
14. Evans, W.: Eye tracking online metacognition: Cognitive complexity and recruiter decision
making, TheLadders (2012)
15. Tait, J.: An introduction to professional search, pp. 1–5. Professional search in the modern
world. Springer International Publishing, Heidelberg (2014)
Compressing and Decoding Term Statistics
Time Series
1 Introduction
There is increasing awareness that time plays an important role in many retrieval
tasks, for example, searching newswire articles [5], web pages [3], and tweets [2].
It is clear that effective retrieval systems need to model the temporal characteris-
tics of the query, retrieved documents, and the collection as a whole. This paper
focuses on the problem of efficiently storing and accessing term statistics time
series—specifically, counts of unigrams and bigrams across a moving window
over a potentially large text collection. These retrospective term statistics are
useful for modeling the temporal dynamics of document collections. On Twitter,
for example, term statistics can change rapidly in response to external events
(disasters, celebrity deaths, etc.) [6]. Being able to store and access such data is
useful for the development of temporal ranking models.
Term statistics time series are large—essentially the cross product of the
vocabulary and the number of time intervals—but are also sparse, which makes
them amenable to compression. Naturally, we would like to achieve as much
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 675–681, 2016.
DOI: 10.1007/978-3-319-30671-1 53
676 J. Rao et al.
3 Methods
In this work, we assume that counts are aggregated at five minute intervals,
so each unigram or bigram is associated with 24 × 60/5 = 288 values per day.
Previous work [7] suggests that smaller windows are not necessary for most appli-
cations, and coarser-grained statistics can always be derived via aggregation.
We compared five basic integer compression techniques: variable-byte encod-
ing (VB) [8], Simple16 [1], PForDelta (P4D) [9], discrete wavelet transform
1
We set aside compression speed since we are working with retrospective collections.
Compressing and Decoding Term Statistics Time Series 677
(DWT) with Haar wavelets, and variants of Huffman codes [4]. The first three
are commonly used in IR applications, and therefore we simply refer readers to
previous papers for more details. We discuss the last two in more detail.
Huffman Coding. A nice property of Huffman coding [4] is that it can find
the optimal prefix code for each symbol when the frequency information of all
symbols are given. In our case, given a list of counts, we first partition the list into
several blocks, with each block consisting of eight consecutive integers. After we
calculate the frequency counts of all blocks, we are able to construct a Huffman
tree over the blocks and obtain a code for each block. We then concatenate the
binary Huffman codes of all blocks and convert this long binary representation
into a sequence of 32-bit integers. Finally, we can apply any compression method
on top of these integer sequences. To decode, we first decompress the integer
array into its binary representation. Then, this binary code is checked bit by bit
to determine the boundaries of the original Huffman codes. Once the boundary
positions are obtained, we can recover the original integer counts by looking
up the Huffman code mapping. The decoding time is linear with respect to the
length of Huffman codes after concatenation.
Beyond integer compression techniques, we can exploit the sparseness of unigram
counts to reduce storage for bigram counts. There is no need to store the bigram
count if any unigram of that bigram has a count of zero at that specific interval.
For example, suppose we have count arrays for unigram A, B and bigram AB
below: A: 00300523, B: 45200103, and AB: 00100002. In this case, we only need
to store the 3rd, 6th, and 8th counts for bigram AB (that is, 102), while the
other counts can be dropped since at least one of its unigrams has count zero
in those intervals. To keep track of these positions we allocate a bit vector 288
bits long (per day) and store this bit vector alongside the compressed data. This
678 J. Rao et al.
truncation technique saves space but at the cost of an additional step during
decoding. When recovering the bigram counts, we need to consult the bit vector,
which is used to pad zeros in the truncated count array accordingly.
In terms of physical storage, we maintain a global array by concatenating
the compressed representations for all terms across all days. To access the com-
pressed array for a term on a specific day, we need its offset and length in the
global array. Thus, we keep a separate table of the mapping from (term id, day)
to this information. Although in our experiments we assume that all data are
held in main memory, our approach can be easily extended to disk-based storage.
As an alternative, instead of placing data for all unigrams and bigrams for
all days together, we could partition the global array into several shards with
each shard containing term statistics for a particular day. The advantage of this
design is apparent: we can select which data to load into memory when the global
array is larger than the amount of memory available.
4 Experiments
We evaluated our compression techniques in terms of two metrics: size of the com-
pressed representation and decoding latency. For the decoding latency experi-
ments, we iterated over all unigrams or bigrams in the vocabulary, over all days,
and report the average time it takes to decode counts for a single day (i.e.,
288 integers). All our algorithms were implemented in Java and available open
source.2 Experiments were conducted on a server with dual Intel Xeon 4-core
processors (E5620 2.4 GHz) and 128 GB RAM.
Our algorithms were evaluated over the Tweets2011 and Tweets2013 collec-
tions. The Tweets2011 collection consists of an approximately 1 % sample of
tweets from January 23, 2011 to February 7, 2011 (inclusive), totaling approx-
imately 16 m tweets. This collection was used in the TREC 2011 and TREC
2012 microblog evaluations. The Tweets2013 collection consists of approximately
243 m tweets crawled from Twitter’s public sample stream between February 1
and March 31, 2013 (inclusive). This collection was used in the TREC 2013 and
TREC 2014 microblog track evaluations. All non-ASCII characters were removed
in the preprocessing phase. We set a threshold (by default, greater than one per
day) to filter out all low frequency terms (including unigrams and bigrams).
We extracted a total of 0.7 m unigrams and 7.3 m bigrams from the Tweets2011
collection; 2.3 m unigrams and 23.1 m bigrams from the Tweets2013 collection.
Results are shown in Table 1. Each row denotes a compression method. The
first row “Raw” is the collection without any compression (i.e., each count is
represented by a 32-bit integer). The row “VB” denotes variable-byte encoding;
row “P4D” denotes PForDelta. Next comes the wavelet and Huffman-based tech-
niques. The last row “Optimal” shows the optimal storage space with the lowest
entropy to represent all Huffman blocks. Given the frequency information of all
blocks, the optimal space can be computed by summing over the entropy bits
2
https://github.com/Jeffyrao/time-series-compression.
Compressing and Decoding Term Statistics Time Series 679
consumed by each block (which is also the minimum bits to represent a block).
The column “size” represents the compressed size of all data (in base two). To
make comparisons fair, instead of comparing with the (uncompressed) raw data,
we compared each approach against PForDelta, which is considered state of the
art in information retrieval for coding sequences such as postings lists [9]. The
column “percentage” shows relative size differences with respect to PForDelta.
The column “time” denotes the decompression time for each count array (the
integer list for one term in one day).
Results show that both Simple16 and PForDelta are effective in compress-
ing the data. Simple16 achieves better compression, but for unigrams is slightly
slower to decode. Variable-byte encoding, on the other hand, does not work par-
ticularly well: the reason is that our count arrays are aggregated over a relative
small temporal window (five minutes) and therefore term counts are generally
small. This enables Simple16 and PForDelta to represent the values using very
few bits. In contrast, VB cannot represent an integer using fewer than eight bits.
680 J. Rao et al.
5 Conclusion
The main contribution of our work is an exploration of integer compression
techniques for term statistics time series. We demonstrated the effectiveness of
our novel techniques based on Huffman codes, which exploit the sparse and
highly non-uniform distribution of blocks of counts. Our best technique can
reduce storage requirements by a factor of four to five compared to PForDelta
encoding. A small footprint means that it is practical to load large amounts of
term statistics time series into memory for efficient access.
Acknowledgments. This work was supported in part by the U.S. National Science
Foundation under IIS-1218043. Any opinions, findings, conclusions, or recommenda-
tions expressed are those of the authors and do not necessarily reflect the views of the
sponsor.
References
1. Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary
codes. Inf. Retrieval 8(1), 151–166 (2005)
2. Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.: Earlybird: real-
time search at Twitter. In: ICDE (2012)
3. Elsas, J.L., Dumais, S.T.: Leveraging temporal dynamics of document content in
relevance ranking. In: WSDM (2010)
4. Huffman, D.A., et al.: A method for the construction of minimum redundancy
codes. Proc. IRE 40(9), 1098–1101 (1952)
Compressing and Decoding Term Statistics Time Series 681
5. Jones, R., Diaz, F.: Temporal profiles of queries. ACM TOIS 25, Article no. 14
(2007)
6. Lin, J., Mishne, G.: A study of “churn” in tweets and real-time search queries. In:
ICWSM (2012)
7. Mishne, G., Dalton, J., Li, Z., Sharma, A., Lin, J.: Fast data in the era of big data:
Twitter’s real-time related query suggestion architecture. In: SIGMOD (2013)
8. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J.
42(3), 193–201 (1999)
9. Zhang, J., Long, X., Suel, T.: Performance of compressed inverted list caching in
search engines. In: WWW (2008)
Feedback or Research: Separating Pre-purchase
from Post-purchase Consumer Reviews
1 Introduction
The content posted on online consumer review platforms contains a wealth of
information, which besides positive and negative judgments about product fea-
tures and services, often includes specific suggestions for their improvement and
root causes for customer dissatisfaction. Such information, if accurately identi-
fied, could be of immense value to businesses. Although previous research on
consumer review analysis has resulted in accurate and efficient methods for clas-
sifying reviews according to the overall sentiment polarity [8], segmenting reviews
into aspects and estimating the sentiment score of each aspect [12], as well as
summarizing both aspects and sentiments towards them [6,10,11], more focused
types of review analysis, such as detecting the intent or the timing of reviews,
are needed to better assist companies in making business decisions. One such
problem is separating the reviews (or review fragments) written by the users
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 682–688, 2016.
DOI: 10.1007/978-3-319-30671-1 53
Separating Pre-purchase from Post-purchase Consumer Reviews 683
2 Related Work
Although consumer reviews have been a subject of many studies over the past
decade, a common trend of recent research is to move from detecting sentiments
and opinions in online reviews towards the broader task of extracting actionable
insights from customer feedback. One relevant recent line of work focused just
on detecting wishes [5,9] in reviews or surveys. In particular, Goldberg et al.
[5] studied how wishes are expressed in general and proposed a template-based
684 M. Hasan et al.
method for detecting the wishes in product reviews and political discussion posts,
while Ramanand et al. [9] proposed a method to identify suggestions in product
reviews. Moghaddam [7] proposed a method based on distant supervision to
detect the reports of defects and suggestions for product improvements.
Other non-trivial textual classification problems have also been recently stud-
ied the literature. For example, Bergsma et al. [2] used a combination of lexical
and syntactic features to detect whether the author of a scientific article is a
native English speaker, male or female, or whether an article was published in
a conference or a journal, while de Vel et al. [3] used style markers, structural
characteristics and gender-preferential language as features for the task of gender
and language background detection.
3 Experiments
To create the gold standard for experiments in this work1 , we collected the
reviews of all major car makes and models released to the market in the past
3 years from MSN Autos2 . Then we segmented the reviews into individual sen-
tences, removed punctuation except exclamation (!) and question (?) marks
(since [1] suggest that retaining them can improve the results of some classi-
fication tasks), and annotated the review sentences using Amazon Mechanical
Turk. In order to reduce the effect of annotator bias, we created 5 HITs per
each label and used the majority voting scheme to determine the final label for
each review sentence. In total, the gold standard consists of 3983 review sen-
tences. Table 3 shows the distribution of these sentences over classes. We used
unigram bag-of-words lexical feature representation for each review fragment
as a baseline, to which we added four binary features based on the dictionar-
ies and four binary features based on the POS tag patterns that we manually
compiled as described in Sect. 3.2. We used Naive Bayes (NB), Support Vec-
tor Machine (SVM) with linear kernel implemented in Weka machine learning
toolkit3 , as well as L2-regularized Logistic Regression (LR) implemented in LIB-
LINEAR4 [4] as classification methods. All experimental results reported in this
work were obtained using 10-fold cross validation and macro-averaged over the
folds.
Each of the dictionaries contain the terms, which represent a particular concept
related to product experience, such as negative emotion, ownership, satisfaction
etc. To create the dictionaries, we first came up with a small set of seed terms,
1
Gold standard and dictionaries are available at http://github.com/teanalab/prepost.
2
http://www.msn.com/en-us/autos.
3
http://www.cs.waikato.ac.nz/ml/weka.
4
http://www.csie.ntu.edu.tw/∼cjlin/liblinear.
Separating Pre-purchase from Post-purchase Consumer Reviews 685
Dictionary Words
OWNERSHIP own, ownership, owned, mine, individual, personal, etc
PURCHASE buy, bought, acquisition, purchase, purchased, etc
SATISFACTION happy, cheerful, content, delighted, glad, etc
USAGE warranty, guarantee, guaranty, cheap, cheaper, etc
such as “buy”, “own”, “happy”, “warranty”, that capture the key lexical clues
related to the timing of review creation regardless of any particular type of
product. Then, we used on-line thesaurus5 to expand the seed words with their
synonyms and considered each resulting set of words as a dictionary.
Using similar procedure, we also created a small set of POS tag-based pat-
terns that capture the key syntactic clues related to the timing of review creation
with respect to the purchase of a product. For example, the presence of sequences
of possessive pronouns and cardinal numbers (pattern “PRP$ CD”, e.g. match-
ing the phrases “my first”, “his second”, etc.), personal pronouns and past tense
verbs (pattern “PRP VBD”, e.g. matching “I owned”) or modal (pattern “PRP
MD”, e.g. matching “I can”, “you will”, etc.) verbs, past participles (pattern
“VBN”, e.g. matching “owned or driven”), as well as adjectives, including com-
parative and superlative (patterns “JJ”, “JJR” and “JJS”) indicates that a
review is likely to be post-purchase. More examples of dictionary words and
POS patterns are provided in Tables 1 and 2.
Results for the second set of experiments, aimed at determining the relative
performance of SVM, NB and LR classifiers in conjunction with: (1) combina-
tion of lexical and POS pattern-based features; (2) combination of lexical and
dictionary-based features; (3) combination of all three feature types (lexical, dic-
tionary and POS pattern features) are presented in Table 5, from which several
conclusions regarding the influence of non-lexical features on performance of
different classifiers for this task can be made.
First, we can observe that SVM achieves the highest performance among all
classifiers in terms of precision (0.752), recall (0.743) and accuracy (0.743), when
a combination of lexical, POS pattern and dictionary-based features was used.
5 Conclusion
References
1. Barbosa, L., Feng, J.: Robust Sentiment detection on Twitter from biased and
noisy data. In: Proceedings of the 23rd COLING, pp. 36–44 2010)
2. Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In:
Proceedings of the NAACL-HLT, pp. 327–337 (2012)
3. de Vel, O.Y., Corney, M.W., Anderson, A.M., Mohay, G.M.: Language and gender
author cohort analysis of e-mail for computer forensics. In: Proceedings of the
Digital Forensics Workshop (2002)
4. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library
for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
5. Goldberg, A.B., Fillmore, N., Andrzejewski, D., Xu, Z., Gibson, B., Zhu, X.: May
all your wishes come true: a study of wishes and how to recognize them. In: Pro-
ceedings of the NAACL-HLT, pp. 263–271 (2009)
6. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the
10th ACM SIGKDD, pp. 168–177 (2004)
7. Moghaddam, S.: Beyond sentiment analysis: mining defects and improvements from
customer feedback. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR
2015. LNCS, vol. 9022, pp. 400–410. Springer, Heidelberg (2015)
8. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf.
Retrieval 2(1–2), 1–135 (2008)
9. Ramanand, J., Bhavsar, K., Pedanekar, N.: Wishful thinking: finding suggestions
and ‘buy’ wishes from product reviews. In: Proceedings of the NAACL-HLT Work-
shop on Computational Approaches to Analysis and Generation of Emotion in
Text, pp. 54–61 (2010)
688 M. Hasan et al.
10. Titov, I., McDonald, R.T.: A joint model of text and aspect ratings for sentiment
summarization. In: Proceedings of the 46th ACL, pp. 308–316 (2008)
11. Yang, Z., Kotov, A., Mohan, A., Lu, S.: Parametric and non-parametric user-aware
sentiment topic models. In: Proceedings of the 38th ACM SIGIR, pp. 413–422
(2015)
12. Yu, J., Zha, Z.J., Wang, M., Chua, T.-S.: Aspect ranking: identifying important
product aspects from online consumer reviews. In: Proceedings of the 49th ACL,
pp. 1496–1505 (2011)
Inferring the Socioeconomic Status of Social
Media Users Based on Behaviour and Language
1 Introduction
Online information has been used in recent research to derive new or enhance
our existing knowledge about the physical world. Some examples include the use
of social media or search query logs to model financial indices [1], understand
voting intentions [10] or improve disease surveillance [4,8,9]. At the same time,
complementary studies have focused on characterising individual users or specific
groups of them. It has been shown that user attributes, such as age [15], gen-
der [2], impact [7], occupation [14] or income [13], can be inferred from Twitter
profiles. This automatic and often large-scale information extraction has com-
mercial and research applications, from improving personalised advertisements
to facilitating answers to various questions in the social sciences.
This paper presents a method for classifying social media users according to
their socioeconomic status (SES). SES can be broadly defined as one’s access to
financial, social, cultural, and human capital resources; it also includes additional
components such as parental and neighbourhood properties [3]. We focused our
work on the microblogging platform of Twitter and formed a new data set of user
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 689–695, 2016.
DOI: 10.1007/978-3-319-30671-1 54
690 V. Lampos et al.
profiles together with a SES label for each one of them. To map users to a SES,
we utilised the Standard Occupation Classification (SOC) hierarchy, a broad tax-
onomy of occupations attached to socioeconomic categorisations in conjunction
with the National Statistics Socio-Economic Classification (NS-SEC) [5,17].
Users are represented by a broad set of features reflecting their behaviour
and impact on the social platform. The classification task uses a nonlinear, ker-
nelised method that can more efficiently capture the divergent feature categories.
Related work has looked into different aspects of this problem, such as inferring
the job category [14] or the income (as a regression task [13]) of social media
users. As with our work here, nonlinear methods showed better performance
in these tasks as well. However, the previously proposed models did not jointly
explore the various sets of features reported in this paper. The proposed classifier
achieves a strong performance in both 3-way and binary classification scenarios.
Table 1. 1-gram samples from a subset of the 200 latent topics (word clusters)
extracted automatically from Twitter data (D2 ).
The dimensionality of user attributes c3 and c4 , after filtering out stop words
and n-grams occurring less than two times in the data, was equal to 523 (1-grams
plus 2-grams) and 560 (1-grams) respectively. Thus, a Twitter user in our data
set is represented by a 1, 291-dimensional feature vector.
We applied spectral clustering [12] on D2 to derive 200 (hard) clusters of
1-grams that capture a number of latent topics and linguistic expressions (e.g.
‘Politics’, ‘Sports’, ‘Internet Slang’), a snapshot of which is presented in Table 1.
Previous research has shown that this amount of clusters is adequate for achiev-
ing a strong performance in similar tasks [7,13,14]. We then computed the fre-
quency of each topic in the tweets of D1 as described in feature category c5 .
To obtain a SES label for each user account, we took advantage of the SOC
hierarchy’s characteristics [5]. In SOC, jobs are categorised based on the required
skill level and specialisation. At the top level, there exist 9 general occupation
groups, and the scheme breaks down to sub-categories forming a 4-level struc-
ture. The bottom of this hierarchy contains more specific job groupings (369
in total). SOC also provides a simplified mapping from these job groupings to a
SES as defined by NS-SEC [17]. We used this mapping to assign an upper, mid-
dle or lower SES to each user account in our data set. This process resulted in
710, 318 and 314 users in the upper, middle and lower SES classes, respectively.2
3 Classification Methods
2
The data set is available at http://dx.doi.org/10.6084/m9.figshare.1619703.
692 V. Lampos et al.
where m(·) is the mean function (here set equal to 0) and k(·, ·) is the covariance
kernel. We apply the squared exponential (SE) kernel, also known as the radial
basis function (RBF), defined as kSE (x, x ) = θ2 exp −x − x 22 /(22 ) , where
θ2 is a constant that describes the overall level of variance and is referred to as
the characteristic length-scale parameter. Note that is inversely proportional
to the predictive relevancy of x (high values indicate a low degree of relevance).
Binary classification using GPs ‘squashes’ the real valued latent function f (x)
output through a logistic function: π(x) P(y = 1|x) = σ(f (x)) in a similar
way to logistic regression classification. In binary classification, the distribu-
f∗ is combined with the logistic function to produce the
tion over the latent
prediction π̄∗ = σ(f∗ )P(f∗ |x, y, x∗ )df∗ . The posterior formulation has a non-
Gaussian likelihood and thus, the model parameters can only be estimated. For
this purpose we use the Laplace approximation [16,18].
Based on the property that the sum of covariance functions is also a valid
covariance function [16], we model the different user feature categories with a
different SE kernel. The final covariance function, therefore, becomes
C
k(x, x ) = kSE (cn , cn ) + kN (x, x ) ,
(2)
n=1
where cn is used to express the features of each category, i.e., x = {c1 , . . . , cC ,},
C is equal to the number of feature categories (in our experimental setup, C = 5)
and kN (x, x ) = θN
2
× δ(x, x ) models noise (δ being a Kronecker delta function).
Similar GP kernel formulations have been applied for text regression tasks [7,9,11]
as a way of capturing groupings of the feature space more effectively.
Although related work has indicated the superiority of nonlinear approaches
in similar multimodal tasks [7,14], we also estimate a performance baseline using
a linear method. Given the high dimensionality of our task, we apply logistic
regression with elastic net regularisation [6] for this purpose. As both classifica-
tion techniques can address binary tasks, we adopt the one–vs.–all strategy for
conducting an inference.
4 Experimental Results
We assess the performance of the proposed classifiers via a stratified 10-fold
cross validation. Each fold contains a random 10 % sample of the users from
each of the three socioeconomic statuses. To train the classifier on a balanced
data set, during training we over-sample the two less dominant classes (middle
and lower), so that they match the size of the one with the greatest representation
(upper). We have also tested the performance of a binary classifier, where the
middle and lower classes are merged. The cumulative confusion matrices (all data
Inferring the Socioeconomic Status of Social Media Users 693
Table 2. SES classification mean performance as estimated via a 10-fold cross valida-
tion of the composite GP classifier for both problem specifications. Parentheses hold
the SD of the mean estimate.
Fig. 1. The cumulative confusion matrices for the 3-way (left) and binary (right) clas-
sification tasks. Columns contain the Target class labels and rows the Output ones.
The row and column extensions respectively specify the Precision and Recall per class.
The numeric identifiers (1–3) are in descending SES order (upper to lower).
from the 10 folds) for both classification scenarios and the GP-based classifier
are presented in Fig. 1. Table 2 holds the respective mean performance metrics.
The mean accuracy of the 3-way classification obtained by the GP model is equal
to 75.09 % (SD = 3.28 %). The regularised logistic regression model yielded a
mean accuracy of 72.01 % (SD = 2.45 %). A two sample t-test concluded that
the 3.08 % difference between these mean performances is statistically signifi-
cant (p = 0.029). The precision and recall per class are reported in the row and
column extensions of the confusion matrices respectively. It is evident that it is
more difficult to correctly classify users from the middle class (lowest precision
and recall). The binary classifier is able to create a much better class separa-
tion, achieving a mean accuracy of 82.05 % (SD = 2.41 %) with fairly balanced
precision and recall among the classes.
Looking at the occupation titles of the users, where false negatives occurred in
the 3-way classification, we identified the following jobs as the most error-prone:
‘sports players’ for the upper class, ‘photographers’, ‘broadcasting equipment
operators’, ‘product/clothing designers’ for the middle class, ‘fitness instructors’
and ‘bar staff’ for the lower class. Further investigation is needed to fully under-
stand the nature of these errors. However, we note that SES is influenced by
many factors, including income, education and occupation. In contrast, our clas-
sifier does not explicitly consider either income or education, and this may limit
accuracy.
694 V. Lampos et al.
References
1. Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput.
Sci. 2(1), 1–8 (2011)
2. Burger, D.J., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twit-
ter. In: EMNLP, pp. 1301–1309 (2011)
3. Cowan, C.D., et al.: Improving the measurement of socioeconomic status for the
national assessment of educational progress: a theoretical foundation. Technical
report, National Center for Education Statistics (2003)
4. Culotta, A.: Towards detecting influenza epidemics by analyzing Twitter messages.
In: SMA, pp. 115–122 (2010)
5. Elias, P., Birch, M.: SOC2010: revision of the standard occupational classification.
Econ. Labour Mark. Rev. 4(7), 48–55 (2010)
6. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear
models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
7. Lampos, V., Aletras, N., Preoţiuc-Pietro, D., Cohn, T.: Predicting and character-
ising user impact on Twitter. In: EACL, pp. 405–413 (2014)
8. Lampos, V., Cristianini, N.: Tracking the flu pandemic by monitoring the social
web. In: CIP, pp. 411–416 (2010)
9. Lampos, V., Miller, A.C., Crossan, S., Stefansen, C.: Advances in nowcasting
influenza-like illness rates using search query logs. Sci. Rep. 5, 12760 (2015)
10. Lampos, V., Preoţiuc-Pietro, D., Cohn, T.: A user-centric model of voting intention
from social media. In: ACL, pp. 993–1003 (2013)
11. Lampos, V., Yom-Tov, E., Pebody, R., Cox, I.: Assessing the impact of a health
intervention via user-generated Internet content. Data Min. Knowl. Disc. 29(5),
1434–1457 (2015)
12. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416
(2007)
Inferring the Socioeconomic Status of Social Media Users 695
13. Preoţiuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., Aletras, N.: Studying
user income through language, behaviour and affect in social media. PLoS ONE
10(9), e0138717 (2015)
14. Preoţiuc-Pietro, D., Lampos, V., Aletras, N.: An analysis of the user occupational
class through Twitter content. In: ACL, pp. 1754–1764 (2015)
15. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes
in Twitter. In: SMUC, pp. 37–44 (2010)
16. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT
Press, Cambridge (2006)
17. Rose, D., Pevalin, D.: Re-basing the NS-SEC on SOC2010: a report to ONS. Tech-
incal report, University of Essex (2010)
18. Williams, C.K.I., Barber, D.: Bayesian classification with Gaussian processes. IEEE
Trans. Pattern Anal. 20(12), 1342–1351 (1998)
Two Scrolls or One Click: A Cost Model
for Browsing Search Results
Abstract. Modeling how people interact with search interfaces has been
of particular interest and importance to the field of Interactive Informa-
tion Retrieval. Recently, there has been a move to developing formal
models of the interaction between the user and the system, whether
it be to run a simulation, conduct an economic analysis, measure sys-
tem performance, or simply to better understand the interactions. In
this paper, we present a cost model that characterizes a user examin-
ing search results. The model shows under what conditions the interface
should be more scroll based or more click based and provides ways to
estimate the number of results per page based on the size of the screen
and the various interaction costs. Further extensions to the model could
be easily included to model different types of browsing and other costs.
1 Introduction
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 696–702, 2016.
DOI: 10.1007/978-3-319-30671-1 55
Two Scrolls or One Click: A Cost Model for Browsing Search Results 697
Fig. 1. The area marked by the dotted line shows how much of the page is initially
visible, where k snippets can be seen. k will vary according to screen size. If the number
of results per page n is larger than k, then n − k results are below the fold.
2 Cost Model
To develop a cost model for results browsing we assume that the user will be
interacting with a standard search engine result page (SERP) with the following
layout: a query box, a list of search results (snippets), and pagination buttons
(see Fig. 1). Put more formally, the SERP displays n snippets, of which only
k are visible above-the-fold. To view the remaining n − k snippets, i.e., those
that are below-the-fold, the user needs to scroll down the page, while to see the
next n snippets, the user needs to paginate (i.e., click next). And so we wonder
whether is it better to scroll, click, or some combination of?
Here, we consider the case where the user wants the document at the mth
result. However, m is not known a priori. To calculate the total browsing costs
we assume that the user has just entered their query and has been presented
with the result list. We further assume that there are three main actions the user
can perform: inspecting a snippet, scrolling down the list, or clicking to go to
the next page. Therefore, we are also assuming a linear traversal of the ranked
list. Each action incurs a cost: Cs to inspect a snippet, Cscr to scroll to the next
snippet1 , and when the user presses the ‘next’ button to see the subsequent n
results, they incur a click cost Cc . The click cost includes the time it takes the
user to click and the time it takes the system to respond. Given these costs, we
can now express a cost model for browsing to the mth result as follows:
1
i.e., the scroll cost is the average cost to scroll the distance of one snippet, which
includes the time to scroll and then focus on the next result.
698 L. Azzopardi and G. Zuccon
m m
Cb (n, k, m) = .Cc + .(n − k) + (mr − k) .Cscr + m.Cs (1)
n n inspecting
clicking scrolling
where mr represents the remaining snippets to inspect on the last page. Equa-
tion 1 is composed by three distinct components: the number of clicks the user
must perform, the amount of scrolling required, and finally the number of snip-
pets they need to inspect. The number of clicks is the number of pages that need
to be viewed rounded up, because the whole page needs to be viewed. The num-
ber of scrolls is based on how many full pages of results need to be examined,
and how many results remain on the last page that need to be inspected. The
remaining snippets to inspect on the last page is: mr = (m − m n .n). However,
if mr < k, then mr = k as there is no scrolling on the last page. Note that k is
bounded by n, i.e., if only two results are shown per page, the maximum number
of viewable results per page is 2. It is possible that only part of the result snippet
is viewable, so k is bound as follows: 0 < k ≤ n. The estimated cost is based on
the number of “clicks,” “scrolls” and result snippets viewed (i.e., m).
With this model it is possible to analyze the costs of various designs by setting
the parameters accordingly. For example, a mobile search interface with a small
screen size can be represented with a low k, while a desktop search interface
with a large screen can be characterized with a larger k. The interaction costs
for different devices can also be encoded accordingly.
Figure 2 shows an example of the cost of browsing m results when the number
of results presented per page (n) is 1, 3, 6 and 10 with up to k = 6 viewable
results in the display window. Here we have approximated the costs of interaction
as follows: 0.25 seconds to scroll, 2 seconds to click and 2.5 seconds to inspect a
snippet2 .
Intuitively, displaying one result per page requires the most time as (m − 1)
clicks are needed to find the mth document. Displaying three results per page
is also costly as m increases requiring approximately m/3 clicks. However, the
difference in costs for 6 and 10 results per page vary depending on the specific
number of results the user wants to inspect. For example, if m = 12 then 6
results per page is lower in cost; whereas if m = 13 then 10 results per page
is lower in cost. In this example, since scrolling is relatively cheap, one might
be tempted to conclude that the size of the result page should be as large as
possible. However, using the model, we can determine the optimal size of the
result page depending on the different parameters.
2
These values were based on the estimated time spent examining each snippet being
between 1.7 and 3 seconds from [1], the GOMS timings for a mouse move (1.1 s), click
(0.2 s) and system response (0.8 s) being approximately 2 seconds taken from [2] and
based on [6]. For the scroll action on a wheeled mouse, we assume that it is similar
to a key press (0.28 s) and was approximated by 0.25 s.
Two Scrolls or One Click: A Cost Model for Browsing Search Results 699
60
1 RPP
30
20
10
0
3 5 7 9 11 13
m − number of results to inspect
Fig. 2. The cost (total time in seconds) to examine m snippets for SERPs of different
sizes (cscr = 0.25, cc = 2, cs = 2.5 s).
∂ Ĉb m m
= − 2 .Cc + 2 .k.Cscr (3)
∂n n n
and then solve the equation by setting ∂∂nĈb
= 0, in order to find what values
minimize Eq. 2. The following is obtained:
m m
− .Cc + 2 .k.Cscr = 0
n2 n
m m
.k.Cscr = 2 .Cc
n2 n
k.Cscr = Cc (4)
Interestingly, n disappears from the equation. This at first seems counter
intuitive, as it suggests that to minimize the cost of interaction n is not a factor.
However, on closer inspection we see that the number of results to show per page
depends on the balance between k and Cscr , on one hand, and Cc , on the other.
If k.Cscr is greater than, equal to, or less than Cc , then the influence on total
cost in Eq. 1 results in three different cases (see below). To help illustrate these
cases, we have plotted three examples in Fig. 3, where the user would like to
inspect m = 25 result snippets, and a maximum of k = 6 result snippets are
viewable per page.
700 L. Azzopardi and G. Zuccon
100
c =0.1, c =2
scr c
95 cscr=0.5, cc=2
C − Results Browsing Cost
c =c =2
scr c
90
85
80
75
70
65
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
n − number of results per page
Fig. 3. An example of how the cost changes as the page size increases, when m = 25
and k ≤ 6 for the three different cases of k.Cscr versus Cc .
cost Cscr , it is likely that the total cost is approximated by the blue dot dashed
line and the red solid line shown in Fig. 3. This is interesting because choosing a
SERP size of n = 10 (as done by most search engines when interacting from a
PC), tends to be near or close to the minimum cost. While increasing the SERP
size to beyond ten would lead to lower total costs, this is at a diminishing rate.
In this model, we have assumed a fixed download cost, (within Cc ). However,
a more realistic estimate of this cost would be proportional to n, such that
Cc (n), where a larger page takes longer to download. Another refinement of the
model would be to condition scrolling on the number of results that need to
be scrolled through; as users might find it increasingly difficult and cognitively
taxing to scroll through long lists. Nonetheless, our model is still informative
and a starting point for estimating the browsing costs. Future work, therefore,
could: (i) extend the model to cater for these other costs in order to obtain a
more accurate estimate of the overall cost, (ii) obtain empirical estimates for
the different costs, on different devices (e.g., laptops, mobiles, desktops, tablets,
etc.) as well as with different means of interaction (e.g., mouse with/without a
scroll wheel, touchscreen, touchmouse, voice, etc.), and, (iii) incorporate such
a browsing model into simulations, measures and analyses. A further extension
would be to consider different types of layouts (e.g., grids, lists, columns, etc.)
and different scenarios (e.g., finding an app on a tablet, mobile, etc.).
References
1. Azzopardi, L.: Modelling interaction with economic models of search. In: Proceed-
ings of the 37th ACM SIGIR Conference, pp. 3–12 (2014)
2. Azzopardi, L., Kelly, D., Brennan, K.: How query cost affects search behavior. In:
Proceedings of the 36th ACM SIGIR Conference, pp. 23–32 (2013)
3. Azzopardi, L., Zuccon, G.: An analysis of theories of search and search behavior.
In: Proceedings of the 2015 International Conference on The Theory of Information
Retrieval, pp. 81–90 (2015)
4. Baskaya, F., Keskustalo, H., Järvelin, K.: Time drives interaction: simulating ses-
sions in diverse searching environments. In: Proceedings of the 35th ACM SIGIR
Conference, pp. 105–114 (2012)
5. Baskaya, F., Keskustalo, H., Järvelin, K.: Modeling behavioral factors in interactive
ir. In: Proceedings of the 22nd ACM SIGIR Conference, pp. 2297–2302 (2013)
6. Card, S.K., Moran, T.P., Newell, A.: The keystroke-level model for user perfor-
mance time with interactive systems. Comm. ACM 23(7), 396–410 (1980)
7. Kashyap, A., Hristidis, V., Petropoulos, M.: Facetor: Cost-driven exploration of
faceted query results. In: Proceedings of the 19th ACM CIKM, pp. 719–728 (2010)
8. Kelly, D., Azzopardi, L.: How many results per page?: A study of serp size, search
behavior and user experience. In: Proceedings of the 38th ACM SIGIR Conference,
pp. 183–192. SIGIR 2015 (2015)
702 L. Azzopardi and G. Zuccon
9. Pirolli, P., Card, S.: Information foraging. Psych. Rev. 106, 643–675 (1999)
10. Russell, D.M., Stefik, M.J., Pirolli, P., Card, S.K.: The cost structure of sensemak-
ing. In: Proceedings of the INTERACT/SIGCHI, pp. 269–276 (1993)
11. Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In:
Proceedings of the 35th ACM SIGIR Conference, pp. 95–104 (2012)
12. Smucker, M.D., Jethani, C.P.: Human performance and retrieval precision revis-
ited. In: Proceedings of the 33rd ACM SIGIR Conference, pp. 595–602 (2010)
Determining the Optimal Session Interval for Transaction
Log Analysis of an Online Library Catalogue
1 Introduction
For the practical purposes of TLA, a variety of methods for assigning sessions have
been developed, with Gayo-Avello [4] providing a comprehensive summary of these
approaches. Some researchers have advocated methods based on query reformulation
[9], navigation patterns [12], and combinations of various metrics [10]. However, such
methods are often complex and time-consuming to implement on the logs practically
acquired from search systems. The simplest, and most widely used, method is the adop‐
tion of a session temporal cut-off interval, which segments sessions according to a set
period of inactivity. Thus a new session is applied to logs originating from a single IP
address if server transactions attributable to that IP address are separated by a pre-deter‐
mined time interval, e.g. 30 min.
The work presented in this paper relates to a wider research project investigating the
users and uses of WorldCat.org, an online union catalogue operated by OCLC, and
containing the aggregated catalogues of over 70,000 libraries from around the world. In
conducting TLA on search logs from WorldCat.org the challenge of determining an
appropriate means of identifying and segmenting sessions within the logs arose. Whilst
a 30 min inactive period is most commonly used for web search logs, there is little
evidence to support the use of this period for the logs of an online library catalogue
system. This paper describes an attempt to determine the optimum session interval for
the WorldCat.org log by comparing the segmentation and collation of sessions for
various session intervals with human judgements. Section 2 discusses related work;
Sect. 3 describes the methodology used to study sessions; Sect. 4 presents and discusses
results; and Sect. 5 concludes the paper.
2 Related Work
While there is a long history of applying TLA to library catalogues and other resource
databases [19], little attention has been paid to the process of session segmentation.
Despite the apparent advantages of session-level analysis, studies of library systems
frequently chose to focus on query-level analysis (e.g., [7, 15]), thereby negating a need
for session segmentation; whilst other studies that employ session level analysis do not
specify how sessions have been segmented (e.g., [11, 18]). Other work in this area makes
use of system-determined session delimitations, either through the use of client-side
session cookies (e.g., [17]), or server-side system time-outs (e.g., [2, 14]); however, no
details are provided regarding the precise details of the time-out periods. Only a small
number of library system studies do define a session cut-off interval. At one extreme,
Dogan et al.’s study of PubMed [3] specifies that all actions from a single user in a 24 h
period are considered a single session. Lown [13] and Goodale and Clough [6] adopt a
30-minute session interval, with this period apparently based on the standard session
interval applied to web search logs.
Given the paucity of discussion of this issue in studies relating to library systems, it
is instructive to consider the greater body of literature relating to TLA for web search.
Here a general consensus has emerged for a session interval of 30 min [10]. This figure
is based on early search log work by Catledge and Pitkow [1] which showed that 25.5 min
session interval meant that most events occurred within 1.5 standard deviations of the
Determining the Optimal Session Interval for Transaction Log Analysis 705
mean inactive period. Jones and Klinkner [10] used an analysis of manually annotated
search sessions to argue that the 30-minute interval is not supported, and that any
segmentation based solely on temporal factors achieves only 70-80 % accuracy. Other
researchers have suggested both lower and higher session interval periods, ranging from
15 min [9] to 125 min [16]. In their study of Reuters Intranet logs, Goker and He [5]
compared session boundaries created by a range of session intervals with human judge‐
ments. They also identified different error types: Type A errors being when adjacent log
activity is incorrectly split into different sessions; and Type B errors when unrelated
activities are incorrectly collated into the same session. Their findings indicate that
whilst overall accuracy was relatively stable for intervals between 10 min and an hour,
errors were split equally between the two error types between 10 and 15 min. Above
15 min, Type B errors were found to predominate. Their overall findings indicate an
optimum session interval of between 11-15 min. The work presented in this paper applies
this method to the logs of an online union catalogue, and represents the first attempt to
establish an optimum session interval for segmentation of this type of log.
3 Methodology
The WorldCat.org log data contained the fields shown in Table 1. A random sample of
10,000 lines of the logs ordered by IP address was generated (representing 721 unique
IP addresses), and all instances identified where lines of the log originating from the
same IP address were separated by between 10 and 60 min. A total of 487 such instances
were found. Each instance was then manually examined in the context of the full logs
to determine whether the activity either side of the inactive period might reasonably be
considered part of the same session, and coded accordingly (“Same session” or
“Different Session”). Following [5], this judgement was primarily based on the subject
Field Description
Anonymised IP A random code assigned to each unique IP address present
Address in the log
Country of origin The country of origin of the IP address, as determined by
an IP lookup service
Date The date of the server interaction
Time The time of the server interaction (hh:mm:ss)
URL The URL executed by the server
OCLCID The OCLC ID of the item being viewed (if applicable)
Referrer URL The page from which the URL was executed
Browser Technical information about the browser type and version
706 S. Wakeling and P. Clough
area of the queries executed and items viewed either side of the inactive period. Since
this judgement was inherently subjective a third code was also used (“Unknown”) to
limit the likelihood of incorrect judgements. This was applied in circumstances where
there was no reasonable way of judging whether the inactive period constituted a new
session or not. A subset of 20 % of instances were coded by a second assessor, and results
compared. Overall inter-coder reliability using Cohen’s kappa coefficient was shown to
be κ = 0.86, above the 0.80 required to indicate reliable coding [21].
The resulting dataset consisted of 487 inactive periods of between 10 and 60 min,
and the code assigned to each period. 99 of these were coded “Unknown”, and were not
considered for further analysis. It was subsequently possible to simulate the effectiveness
of a variety of potential session timeout durations based on the codes assigned to the
388 remaining inactive periods. Where i = the inactive period in the log, t = the proposed
timeout duration, s = “Same session” and d = “Different session”, we observe four
potential outcomes:
1. i > t, s = Incorrect session segmentation (Type A error)
2. i > t, d = Correct session segmentation
3. i < t, s = Correct session collation
4. i < t, d = Incorrect session collation (Type B error)
Outcomes were calculated for each of the coded inactive periods in the log sample
(n = 388) for session intervals at 30 s intervals between 10 and 60 min.
A session cut-off time of 39 min was found to provide the highest proportion of correctly
segmented sessions (77.1 %), although there was little variation in the proportion of
correctly segmented sessions between 26 and 57 min, with each session interval period
producing correct outcomes for over 75 % of inactive periods. However, the results
indicate that using a 10-minute cut-off time results in a high proportion (70 %) of sessions
representing a Type A error (the sessions being incorrectly split). Naturally as the session
cut-off period is extended, an increasing number of sessions are incorrectly collated.
A session cut-off time of 28 min was found to produce an equal number of the two error
types (Type A = 13 %; Type B = 13 %). Thus we can conclude that whilst session
intervals of between 26 and 57 min have little effect on the overall accuracy of session
segmentation, there is variation in the distribution of the error types. A session interval
period of between 28 and 29 min is shown to reduce the likelihood of one error type
predominating (see Fig. 1).
Whilst this goes some way to validating the commonly used 30-minute session
interval, we suggest that attention needs to be paid to the overall aims of the TLA being
undertaken. In particular, researchers may be conducting analysis for purposes where
reducing one particular error type is preferable. In conducting TLA for the development
of user-orientated learning techniques, for example, Goker and He [5] argue that mini‐
mising overall and Type B errors is a priority. Other situations, for example the inves‐
tigation of rates of query reformulation, may demand a reduction in Type A errors.
Determining the Optimal Session Interval for Transaction Log Analysis 707
Thus whilst a 30-minute session interval provides the most effective means of mitigating
the effects of any one error type, researchers investigating library catalogue logs should
consider raising or lowering the session interval depending on their research goals.
5 Conclusions
This paper investigates the effects of using time intervals between 10 and 60 min for
segmentating search logs from WorldCat.org into sessions for subsequent analysis of
user searching behaviour. A period of 30 min is commonly used in the literature, partic‐
ularly in the analysis of web search logs. However, this is often without sufficient justi‐
fication. Analysis of library catalogue logs frequently also uses a 30 min cut-off, or does
not employ sessionisation at all. Based on a manual analysis of sessions from
WorldCat.org, our results indicate that the accuracy of segmenting sessions is relatively
stable for time intervals between 26 and 57 min, with 28 and 29 min shown to reduce
the likelihood of one error type (incorrect segmentation or incorrect collation) predom‐
inating. This work supports the use of a 30-minute timeout period in TLA studies, and
is of particular value to researchers wishing to conduct session level analysis of library
catalogue logs.
References
1. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the World-Wide Web.
Comput. Netw. Isdn. 27(6), 1065–1073 (1995)
2. Cooper, M.D.: Usage patterns of a web-based library catalog. J. Am. Soc. Inf. Sci. Tec. 52(2),
137–148 (2001)
3. Dogan, R.I., Murray, G.C., Névéol, A., Lu, Z.: Understanding PubMed® user search behavior
through log analysis. Database 2009 (bap018) (2009)
708 S. Wakeling and P. Clough
4. Gayo-Avello, D.: A survey on session detection methods in query logs and a proposal for
future evaluation. Inf. Sci. 179(12), 1822–1843 (2009)
5. Göker, A., He, D.: Analysing web search logs to determine session boundaries for user-
oriented learning. In: Brusilovsky, P., Stock, O., Strapparava, C. (eds.) AH 2000. LNCS,
vol. 1892, p. 319. Springer, Heidelberg (2000)
6. Goodale, P., Clough, P.: Report on Evaluation of the Search25 Demo System. University of
Sheffield, Sheffield (2012)
7. Han, H., Jeong, W., Wolfram, D.: Log analysis of academic digital library: User query
patterns. In: 10th iConference, pp. 1002–1008. Ideals, Illinois (2014)
8. Jansen, B.: Search log analysis: What it is, what’s been done, how to do it. Libr. Inform. Sci.
Res. 28(3), 407–432 (2006)
9. Jansen, B.J., Spink, A., Kathuria, V.: How to define searching sessions on web search engines.
In: Nasraoui, O., Spiliopoulou, M., Srivastava, J., Mobasher, B., Masand, B. (eds.) WebKDD
2006. LNCS (LNAI), vol. 4811, pp. 92–109. Springer, Heidelberg (2007)
10. Jones, R., Klinkner, K.L.: Beyond the session timeout: automatic hierarchical segmentation
of search topics in query logs. In: 17th ACM conference on Information and knowledge
management, pp. 699–708. ACM, New York (2008)
11. Jones, S., Cunningham, S.J., McNab, R., Boddie, S.: A transaction log analysis of a digital
library. Int. J. Digit. Libr. 3(2), 152–169 (2000)
12. Kapusta, J., Munk, M., Drlík, M.: Cut-off time calculation for user session identification by
reference length. In: 6th International Conference on Application of Information and
Communication Technologies (AICT), pp. 1–6. IEEE, New York (2012)
13. Lown, C.: A transaction log analysis of NCSU’s faceted navigation OPAC. School of
Information and Library Science, University of North Carolina (2008)
14. Malliari, A., Kyriaki-Manessi, D.: Users’ behaviour patterns in academic libraries’ OPACs:
a multivariate statistical analysis. New Libr. World 108(3/4), 107–122 (2007)
15. Meadow, K., Meadow, J.: Search query quality and web-scale discovery: A qualitative and
quantitative analysis. College Undergraduate Librar. 19(2–4), 163–175 (2012)
16. Montgomery, A., Faloutsos, C.: Identifying web browsing trends and patterns. IEEE Comput.
34(7), 94–95 (2007)
17. Nicholas, D., Huntington, P., Jamali, H.R.: User diversity: as demonstrated by deep log
analysis. Electron. Libr. 26(1), 21–38 (2008)
18. Niu, X., Zhang, T., Chen, H.L.: Study of user search activities with two discovery tools at an
academic library. Int. J. Hum. Comput. Interact. 30(5), 422–433 (2014)
19. Peters, T.A.: The history and development of transaction log analysis. Libr. Hi Tech. 11(2),
41–66 (1993)
20. Spink, A., Park, M., Jansen, B.J., Pedersen, J.: Multitasking during web search sessions.
Inform. Process. Manag. 42(1), 264–275 (2006)
21. Yardley, L.: Demonstrating validity in qualitative psychology. In: Smith, J.A. (ed.)
Qualitative psychology: A practical guide to research methods, pp. 235–251. SAGE, London,
UK (2008)
22. Ye, C., Wilson, M.L.: A user defined taxonomy of factors that divide online information
retrieval sessions. In: 5th Information Interaction in Context Symposium, pp. 48–57. ACM,
New York (2014)
A Comparison of Deep Learning Based Query
Expansion with Pseudo-Relevance Feedback
and Mutual Information
1 Introduction
User queries are usually too short to describe the information need accurately.
Important terms can be missing from the query, leading to a poor coverage
of the relevant documents. To solve this problem, automatic query expansion
techniques leveraging on several data sources and employ different methods for
finding expansion terms [2]. Selecting such expansion terms is challenging and
requires a framework capable of adding interesting terms to the query.
Different approaches have been proposed for selecting expansion terms.
Pseudo-relevance feedback (PRF) assumes that the top-ranked documents
returned for the initial query are relevant, and uses a sub set of the terms
extracted from those documents for expansion. PRF has been proven to be
effective in improving retrieval performance [4].
Corpus-specific approaches analyze the content of the whole document
collection, and then generate correlation between each pair of terms by co-
occurrence [6], mutual information [3], etc. Mutual information (MI) is a good
measure to assess how much two terms are related, by analyzing the entire col-
lection in order to extract the association between terms. For each query term,
every term that has a high mutual information score with it is used to expand
the user query.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 709–715, 2016.
DOI: 10.1007/978-3-319-30671-1 57
710 M. ALMasri et al.
t1 , vt2 )
SIM (t1 , t2 ) = cos(v (1)
1
A real-valued vector of a predefined dimension, 600 dimensions for exemple.
A Comparison of Deep Learning Based Query Expansion 711
t1 , vt2 ) ∈ [0, 1] is the normalized cosine between the two term vectors
where cos(v
vt1 and vt2 . Based on this normalized cosine similarity between terms, we now
define the function that returns the k-most similar terms to a term t, topk (t):
topk : V → 2V (2)
Let q be a user query represented by a bag of terms, q = [t1 , t2 , ..., t|q| ]. Each
term in the query has a frequency #(t, q). In order to expand a query q, we
follow these steps:
3 Experiments
The first goal of our experiments is to analyze the effect of the number of expan-
sion terms k on the retrieval performance using deep learning vectors. The second
goal is to compare between the proposed expansion based on deep learning vec-
tors (VEXP) with two existing expansion approaches: pseudo-relevance feedback
(PRF) [4], and mutual information (MI) [3], which both have been proven to be
effective in improving retrieval performance. In order to achieve the comparison
between VEXP, PRF, and MI, we use a language model with no expansion as a
baseline (NEXP).
Documents are retrieved using Indri search engine [9], and two smoothing
methods of language models: Jelinek-Mercer and Dirichlet.
The optimization of the free parameter α (Eq. 3) for controlling expansion
terms importance is done using 4-fold cross-validation with Mean Average Preci-
sion (MAP) as the target metric. We vary α values between [0.1, 1] with 0.1 as an
interval. The best values of the tuning parameter α that indicate the importance
of expansion terms are between [0.2, 0.4].
In our experiments, the statistical significance is determined using Fisher’s
randomization test with p < 0.05 [8].
712 M. ALMasri et al.
Table 2. VEXP performance using MAP on test collections. k is the number of expan-
sion terms for each query term.
k Jelinek-Mercer Dirichlet
Image2010 Image2011 Image2012 Case2011 Image2010 Image2011 Image2012 Case2011
1 0.3286 0.2258 0.1997 0.1373 0.3397 0.2173 0.1947 0.1288
2 0.3298 0.2325 0.1988 0.1431 0.3361 0.2204 0.1890 0.1345
3 0.3395 0.2330 0.1996 0.1440 0.3411 0.2192 0.1902 0.1366
4 0.3399 0.2338 0.2002 0.1413 0.3561 0.2175 0.1909 0.1384
5 0.3323 0.2340 0.1909 0.1634 0.3519 0.2187 0.1787 0.1410
6 0.3402 0.2324 0.1909 0.1432 0.3603 0.2163 0.1798 0.1451
7 0.3397 0.2333 0.1881 0.1446 0.3599 0.2184 0.1778 0.1431
8 0.3397 0.2353 0.1895 0.1414 0.3584 0.2200 0.1813 0.1416
9 0.3365 0.2230 0.2004 0.1387 0.3544 0.2221 0.1953 0.1379
10 0.3362 0.2233 0.2036 0.1343 0.3510 0.2215 0.1990 .1357
Jelinek-Mercer Dirichlet
Image Image Image Case Image Image Image Case
2010 2011 2012 2011 2010 2011 2012 2011
PRF k 15 10 20 10 15 10 10 10
#fbdocs 10 10 20 10 10 10 10 10
MI k 10 8 6 10 10 7 6 10
VEXP k 6 4 10 4 5 9 10 5
In this section, we compare three expansion methods: VEXP, PRF, and MI, using
a language model with no expansion as a baseline (NEXP). We use two tests
for statistical significance: † indicates a statistical significant improvement over
NEXP, and ∗ indicates a statistical significant improvement over PRF. Results
are given in Table 4. We first observe that VEXP is always statistically better
than NEXP for the four test collection, which is not the case for PRF and MI.
VEXP shows a statistically significant improvement over PRF in five cases.
Deep learning vectors are a promising source for query expansion because
they are learned from hundreds of millions of words, in contrast to PRF which
is obtained from top retrieved document and MI which is calculated on the
collection itself. Deep learning vectors are not only useful for collections that
were used in the training phase, but also for other collections which contain
similar documents. In our case, all collections deal with medical cases.
There are two architectures of neural networks for obtaining deep learning
vectors: skip-gram and bag-of-words [5]. We only present the results obtained
using the skip-gram architecture in our experiments. We have also evaluated the
bag-of-words architecture, but there was no big difference in retrieval perfor-
mance between the two architectures.
714 M. ALMasri et al.
Jelinek-Mercer Dirichlet
Image2010 Image2011 Image2012 Case2011 Image2010 Image2011 Image2012 Case2011
NEXP 0.3016 0.2113 0.1862 0.1128 0.3171 0.2033 0.1681 0.1134
PRF 0.3090 0.2136 0.1920 0.1256 0.3219 0.2126 0.1766 0.1267
MI 0.3239 0.2116 0.1974 0.1360 0.3338 0.2110 0.1775 0.1327
VEXP 0.3402†* 0.2340† 0.2036† 0.1634†* 0.3603†* 0.2221† 0.1990†* 0.1451†*
4 Conclusions
We explored the use of the relationships extracted from deep learning vectors for
query expansion. We showed that deep learning vectors are a promising source for
query expansion by comparing it with two effective methods for query expansion:
pseudo-relevance feedback and mutual information. Our experiments on four
CLEF collections showed that using this expansion source gives a statistically
significant improvement over baseline language models with no expansion and
pseudo-relevance feedback. In addition, it is better than the expansion method
using mutual information.
References
1. Bengio, Y., Schwenk, H., Sencal, J.-S., Morin, F., Gauvain, J.-L.: Neural proba-
bilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine
Learning. Studies in Fuzziness and Soft Computing, vol. 194, pp. 137–186. Springer,
Heidelberg (2006)
2. Carpineto, C., Romano, G.: A survey of automatic query expansion in information
retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012)
3. Jiani, H., Deng, W., Guo, J.: Improving retrieval performance by global analysis.
In: ICPR 2006, pp. 703–706 (2006)
4. Lavrenko, V., Croft, W.B.: Relevance based language models. In: SIGIR 2001, pp.
120–127. ACM, New York (2001)
5. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality. CoRR (2013)
6. Peat, H.J., Willett, P.: The limitations of term co-occurrence data for query expan-
sion in document retrieval systems. J. Am. Soc. Inf. Sci. 42(5), 378–383 (1991)
7. Serizawa, M., Kobayashi, I.: A study on query expansion based on topic distribu-
tions of retrieved documents. In: Gelbukh, A. (ed.) CICLing 2013, Part II. LNCS,
vol. 7817, pp. 369–379. Springer, Heidelberg (2013)
8. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance
tests for information retrieval evaluation. In: CIKM 2007. ACM (2007)
A Comparison of Deep Learning Based Query Expansion 715
9. Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language model-based
search engine for complex queries. In: Proceedings of the International Conference
on Intelligence Analysis (2004)
10. Widdows, D., Cohen, T.: The semantic vectors package: New algorithms and public
tools for distributional semantics. In: ICSC, pp. 9–15 (2010)
11. Yang, X., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback
based on wikipedia. In: SIGIR 2009, Boston, MA, USA, pp. 59–66 (2009)
12. Zhang, J., Deng, B., Li, X.: Concept based query expansion using wordnet. In:
AST 2009, pp. 52–55. IEEE Computer Society (2009)
13. Zhu, W., Xuheng, X., Xiaohua, H., Song, I.-Y., Allen, R.B.: Using UMLS-based
re-weighting terms as a query expansion strategy. In: 2006 IEEE International
Conference on Granular Computing, pp. 217–222, May 2006
A Full-Text Learning to Rank Dataset
for Medical Information Retrieval
1 Introduction
Health-related content is available in information archives as diverse as the gen-
eral web, scientific publication archives, or patient records of hospitals. A similar
diversity can be found among users of medical information, ranging from mem-
bers of the general public searching the web for information about illnesses,
researchers exploring the PubMed database1 , or patent professionals querying
patent databases for prior art in the medical domain2 . The diversity of informa-
tion needs, the variety of medical knowledge, and the varying language skills of
users [4] results in a lexical gap between user queries and medical information
that complicates information retrieval in the medical domain.
In this paper, we present a dataset that bridges this lexical gap by exploiting
links between queries written in layman’s English to scientific articles as pro-
vided on the www.NutritionFacts.org (NF) website. NF is a non-commercial,
public service provided by Dr. Michael Greger and collaborators who review
state-of-the-art nutrition research papers and provide transcribed videos, blog
articles and Q&A about nutrition and health for the general public. NF content
is linked to scientific papers that are mainly hosted on the PubMed database.
By extracting relevance links at three levels from direct and indirect links of
queries to research articles, we obtain a database that can be used to directly
learn ranking models for medical information retrieval. To our knowledge this is
1
www.ncbi.nlm.nih.gov/pubmed.
2
For example, the USPTO and EPO provide specialized patent search facilities at
www.uspto.gov/patents/process/search and www.epo.org/searching.html.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 716–722, 2016.
DOI: 10.1007/978-3-319-30671-1 58
A Full-Text Learning to Rank Dataset for Medical Information Retrieval 717
the first dataset that provides full texts for thousands of relevance-linked queries
and documents in the medical domain. In order to showcase the potential use of
our dataset, we present experiments on training ranking models, and find that
they significantly outperform standard bag-of-words retrieval models.
2 Related Work
The NF website contains three different content sources – videos, blogs, and
Q&A posts, all written in layman’s English, which we used to extract queries
of different length and language style. Both the internal linking structure and
the scientific papers citations establish graded relevance relations between pieces
of NF content and scientific papers. Additionally, the internal NF topic taxon-
omy, used to categorize similar NF content that is not necessarily interlinked, is
exploited to define the weakest relevance grade.
3
www.research.microsoft.com/en-us/um/beijing/projects/letor, www.research.
microsoft.com/en-us/projects/mslr, www.webscope.sandbox.yahoo.com.
4
www.clefehealth2014.dcu.ie/task-3.
5
www.cl.uni-heidelberg.de/statnlpgroup/boostclir/wikiclir.
718 V. Boteva et al.
Since PubMed pages could further link to full-texts on PMC and since
extracting abstracts from these two types of pages was the least error-prone,
we included titles and abstracts of only these two types into the documents side
of the corpus.
Data. We focused on 5 types of queries that differ by length and well-formedness
of the language. In particular we tested full queries, i.e., all fields of NF pages
concatenated: titles, descriptions, topics, transcripts and comments), all titles of
NF content pages, titles of non-topic pages (i.e., titles of all NF pages except topic
pages), video titles (titles of video pages) and video descriptions (description
from videos pages). The latter three types of queries often resemble queries an
average user would type (e.g., “How to Treat Kidney Stones with Diet” or “Meat
Hormones and Female Infertility”), unlike all titles that include headers of topics
pages that often consist of just one word.
For each relevance link between a query and a document we randomly
assigned 80 % of them to the training set and 10 % for dev and test sub-
sets. Retrieval was performed over the full set of abstracts (3,633 in total,
mean/median number of tokens was 147.1/76.0). Note that this makes the test
PubMed abstracts (but not the queries) available during training. The same
methodology was used in [1] who found that it only marginally affected evalua-
tion results compared to the setting without overlaps. Basic statistics about the
different query types are summarized in Table 1.
Extracting Relevance Links. We defined a special relation between queries and
documents that did not exist in the explicit NF link structure. A directly linked
document of query q is considered marginally relevant for query q if the con-
tainment |t(q) ∩ t(q )|/|t(q)| between the sets of topics with which the queries
A Full-Text Learning to Rank Dataset for Medical Information Retrieval 719
4 Experiments
Systems. Our two baseline retrieval systems use the classical ranking scores: tfidf
and Okapi BM25 6 . In addition, we evaluated two learning to rank approaches
that are based on a matrix of query words times document words as feature
representation, and optimize a pairwise ranking objective [1,7]: Let q ∈ {0, 1}Q
be a query and d ∈ {0, 1}D be a document, where the nth vector dimension
indicates the simple occurrence of the nth word for dictionaries Q ofsize Q and D.
Both approaches learn a score function f (q, d) = q W d = i=1 j=1 qi Wij dj ,
D
where W ∈ IR Q×D
encodes a matrix of word associations. Optimal values of W
are found by pairwise ranking given supervision data in the form of a set R of
tuples (q, d+ , d− ), where d+ is a relevant (or higher ranked) document and d−
an irrelevant (or lower ranked) document for query q, the goal is to find W such
that an inequality f (q, d+ ) > f (q, d− ) is violated for the fewest number of tuples
6
BM25 parameters were set to k1 = 1.2, b = 0.75.
720 V. Boteva et al.
from R. Thus, the goal is to learn weights for all domain-specific associations
of query terms and document terms that are useful to discern relevant from
irrelevant documents by optimizing the ranking objectives defined below.
The first method [7] applies the RankBoost algorithm [2], where f (q, d)
T
is a weighted linear combination of T functions ht such that f (q, d) = t=1
wt ht (q, d). Here ht is an indicator that selects a pair of query and document
words. Given differences of query-document relevance ranks m(q, d+ , d− ) =
rq,d+ − rq,d− , RankBoost achieves correct ranking of R by optimizing the
exponential loss
−
m(q, d+ , d− )ef (q,d )−f (q,d ) .
+
Lexp =
(q,d+ ,d− )∈R
The algorithm combines batch boosting with bagging over independently drawn
10 bootstrap data samples from R, each consisting of 100k instances. In every
step, the single word pair feature ht is selected that provides the largest decrease
of Lexp . The resulting models are averaged as a final scoring function. To reduce
memory requirements we used random feature hashing with the size of the hash
of 30 bits [5]. For regularization we rely on early stopping (T = 5000). An
additional fixed-weight identity feature is introduced that indicates the identity
of terms in query and document; its weight was tuned on the dev set.
The second method uses stochastic gradient descent (SGD) as implemented
in the Vowpal Wabbit (VW) toolkit [3] to optimize the 1 -regularized hinge loss:
Lhng = f (q, d+ ) − f (q, d− ) + + λ||W ||1 ,
(q,d+ ,d− )∈R
Table 2. MAP/NDCG results evaluated for different types of queries. Best NDCG
results of learning-to-rank versus bag-of-words models are highlighted in bold face.
Experimental Results. Results according to the MAP and NDCG metrics on pre-
processed data7 are reported in the Table 2. Result differences between the best
performing learning-to-rank versus bag-of-words models were found to be statis-
tically significant [6]. As results show, learning-to-rank approaches outperform
classical retrieval methods by a large margin, proving that the provided corpus
is sufficient to optimize domain-specific word associations for a direct ranking
objective. As shown in row 1 of Table 2, the SGD approach outperforms Rank-
Boost in the evaluation on all fields queries, but performs worse with shorter
(and fewer) queries as in the setups listed in rows 2–5. This is due to a special
“pass-through” feature implemented in RankBoost that assigns a default feature
to word identities, thus allowing to learn better from sparser data. The SGD
implementation does not take advantage of such a feature, but it makes a better
use of the full matrix of word associations which offsets the lacking pass-through
if enough word combinations are observable in the data.
5 Conclusion
We presented a dataset for learning to rank in the medical domain that has
the following key features: (1) full text queries of various length, thus enabling
the development of complete learning models; (2) relevance links at 3 levels for
thousands of queries in layman’s English to documents consisting of abstracts of
research article; (3) public availability of the dataset (with links to full documents
for research articles). We showed in an experimental evaluation that the size of
the dataset is sufficient to learn ranking models based on sparse word association
matrices that outperform standard bag-of-words retrieval models.
References
1. Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O.,
Weinberger, K.: Learning to rank with (a lot of) word features. Inf. Retr. J. 13(3),
291–314 (2010)
2. Collins, M., Koo, T.: Discriminative reranking for natural language parsing. Com-
put. Linguist. 31(1), 25–69 (2005)
3. Goel, S., Langford, J., Strehl, A.L.: Predictive indexing for fast search. In: NIPS,
Vancouver, Canada (2008)
4. Goeuriot, L., Kelly, L., Jones, G.J.F., Müller, H., Zobel, J.: Report on the SIGIR
2014 workshop on medical information retrieval (MedIR). SIGIR Forum 48(2), 78–
82 (2014)
7
Preprocessing included lowercasing, tokenizing, filtering punctuation and stop-words,
and replacing numbers with a special token.
722 V. Boteva et al.
5. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.J., Strehl, A.L.,
Vishwanathan, V.: Hash Kernels. In: AISTATS, Irvine, CA (2009)
6. Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance
tests for information retrieval evaluation. In: CIKM, Lisbon, Portugal (2007)
7. Sokolov, A., Jehl, L., Hieber, F., Riezler, S.: Boosting cross-language retrieval by
learning bilingual phrase associations from relevance rankings. In: EMNLP, Seattle
(2013)
Multi-label, Multi-class Classification
Using Polylingual Embeddings
Input Output
layer Hidden layer
layer
Require: {G (di )}2=1 , a trained AE W WT
English English
1: for each document di do
2: Concatenate G 1 (di ) and G 2 (di ) ..
3: Get PE representation of di as .
the hidden encoding of the AE fed French French
with the concatenation
4: end for
Algorithm
Algorith 1. The process of generating Fig. 1. An AE that generates the
PE representations PE in its hidden layer. The dashed
boxes denote the document DRs in
the corresponding language.
Multi-label, Multi-class Classification Using Polylingual Embeddings 725
1
http://statmt.org/.
2
https://code.google.com/p/word2vec/.
726 G. Balikas and M.-R. Amini
Our Approach. Using the above presented DR model, we first generated the
document embeddings in English and French in a d-dimensional space with d ∈
{50, 100, 200, 300}. Then, for the AE we considered as activation functions the
hyperbolic tangent and the sigmoid function. The sigmoid performed consistently
better and thus we use it in the reported results. The AE was trained with tied
weights using a stochastic back-propagation algorithm with mini-batchs of size
10 and the euclidean distance of the input/output as loss function. The number
of neurons in the hidden layer was set to be 70 % of the size of the input.3
4 Experimental Results
Table 2 presents the scores of the F1 measure when 10 % of the 12.670 docu-
ments were used for training purposes and the rest 90 % for testing. We report
the classification performance with the four different DR models (cbow, skip-
gram, DBOWpv and DMMpv) and 2 learning algorithms (k-NN and SVMs)
for different input sizes. The columns labeled k-NNDR and SVMDR present the
(baseline) performance of SVM and k-NN trained on the monolingual DRs. Also
the last line of the table indicates the F1 score of SVM with tf-idf representation
(SVMBoW ). The best obtained result is shown in bold.
We first notice that the average pooling strategy (cbow and skip-gram)
performs better compared to when the document vectors are directly learned
(DBOWpv and DMMpv). In particular, cbow seems to be the best performing
representation, both as a baseline model and when used as base model to gener-
ate the PE representations. On the other hand, DBOWpv and DMMpv perform
significantly worse: in the baseline setting the best cbow performance achieved is
44.25 whereas the best DMMpv configuration achieves 30.49, 14 F1 points less.
3
The code is available at http://ama.liglab.fr/∼balikas/ecir2015.zip.
Multi-label, Multi-class Classification Using Polylingual Embeddings 727
cbow skip-gram
0.6 0.6
F1 measure
F1 measure
0.5 0.5
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Proportion of the training set Proportion of the training set
The PE representations learned on top of the four base models improve sig-
nificantly over the performance of the monolingual DRs, especially for k-NN. For
instance, for cbow with base-model vector dimension 200, the baseline represen-
tation achieves 40.42 F1 and its corresponding PE representation obtains 46.33,
improving almost 6 points. In general, we notice such improvements between
the base DR and its respective PE, especially when the dimension of the DR
representation increases. Note that the PE improvements are independent of the
methods used to generate the DRs: for instance k-NNPE over the 200-dimensional
PE DMMpv representations gains more than 11 F1 points compared to k-NNDR .
It is also to be noted that the baseline SVMBoW is outperformed by SVMPE
especially when cbow and skip-gram DRs are used.
Comparing the two learning methods (k-NNPE and SVMPE ), we notice that
k-NNPE performs best. This is motivated by the fact that distributed representa-
tions are supposed to capture the semantics in the low dimensional space. At the
same time, the neighbours algorithm compares exactly this semantic distance
between data instances, whereas SVMs tries to draw separating hyperplanes
among them. Finally, it is known that SVMs benefit from high-dimensional vec-
tors such as bag-of-words representations. Notably, in our experiments increasing
the dimension of the representations consistently benefits SVMs.
We now examine the performance of the PE representations taking into
account the amount of labeled training data. Figure 2 illustrates the perfor-
mance of the SVMBoW and SVMPE and k-NNPE with PE representations when
the fraction of the available training data varies from 10 % of the intial training
set to 90 % and in the case where, cbow and skip-gram are used as DR repre-
sentations with an input size of 300. Note that if only a few training documents
are available, the learning approach is strongly benefited by the rich PE repre-
sentations, that outperforms the traditional SVMBoW setting consistently. For
instance, in the experiments with 300 dimensional PE representations with cbow
DRs, when only 20 % of the data are labeled, the SVMBoW needs 20 % more data
to achieve similar performance, a pattern that is observed in most of the runs in
the figure. When, however, more training data are available the tf-idf copes with
the complexity of the problem and levarages this wealth of information more
efficiently than PE does.
728 G. Balikas and M.-R. Amini
5 Conclusion
We proposed the PE, which is a text embedding learned using neural networks by
leveraging translations of the input text. We empirically showed the effectiveness
of the bilingual embedding for classification especially in the interesting case
where few labeled training data are available for learning.
Acknowledgements. We would like to thank the anonymous reviewers for their valu-
able comments. This work is partially supported by the CIFRE N 28/2015 and by the
LabEx PERSYVAL Lab ANR-11-LABX-0025.
References
1. Faruqui, M., Dyer, C.: Improving vector space word representations using multi-
lingual correlation. Association for Computational Linguistics (2014)
2. Gao, J., He, X., Yih, W.T., Deng, L.: Learning continuous phrase representations
for translation modeling. In: Proceedings of ACL. Association for Computational
Linguistics, June 2014
3. Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed represen-
tations without word alignments. arXiv preprint arXiv:1410.2455 (2014)
4. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network
for modelling sentences. arXiv preprint arXiv:1404.2188 (2014)
5. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882 (2014)
6. Lauly, S., Larochelle, H., Khapra, M., Ravindran, B., Raykar, V.C., Saha, A.: An
autoencoder approach to learning bilingual word representations. In: Advances in
Neural Information Processing Systems, pp. 1853–1861 (2014)
7. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents.
arXiv preprint arXiv:1405.4053 (2014)
8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems, pp. 3111–3119 (2013)
10. Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G., Gaussier,
E., Androutsopoulos, I., Amini, M.R., Galinari, P.: Lshtc: a benchmark for large-
scale text classification. arXiv preprint arXiv:1503.08581 (2015)
11. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor-
pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Frameworks, pp. 45–50. ELRA, Valletta, Malta, May 2010. http://is.muni.cz/
publication/884893/en
12. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific
word embedding for twitter sentiment classification. In: Proceedings of the 52nd
Annual Meeting of the Association for Computational Linguistics. vol. 1, pp. 1555–
1565 (2014)
13. Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint
arXiv:1502.01710 (2015)
Learning Word Embeddings from Wikipedia
for Content-Based Recommender Systems
1 Introduction
Word Embedding techniques recently gained more and more attention due to the
good performance they showed in a broad range of natural language processing-
related scenarios, ranging from sentiment analysis [10] and machine translation
[2] to more challenging ones as learning a textual description of a given image1 .
However, even if some recent research gave new lymph to such approaches,
Word Embedding techniques took their roots in the area of Distributional
Semantics Models (DSMs), which date back in the late 60’s [3]. Such models
are mainly based on the so-called distributional hypothesis, which states that the
meaning of a word depends on its usage and on the contexts in which it occurs.
In other terms, according to DSMs, it is possible to infer the meaning of a term
(e.g., leash) by analyzing the other terms it co-occurs with (dog, animal, etc.).
In the same way, the correlation between different terms (e.g., leash and muzzle)
can be inferred by analyzing the similarity between the contexts in which they
are used. Word Embedding techniques have inherited the vision carried out by
DSMs, since they aim to learn in a totally unsupervised way a low-dimensional
1
http://googleresearch.blogspot.it/2014/11/a-picture-is-worth-thousand-coherent.
html
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 729–734, 2016.
DOI: 10.1007/978-3-319-30671-1 60
730 C. Musto et al.
2 Methodology
2.1 Overview of the Techniques
Latent Semantic Indexing (LSI) [1] is a word embedding technique which applies
Singular Value Decomposition (SVD) over a word-document matrix. The goal
of the approach is to compress the original information space through SVD
in order to obtain a smaller-scale word-concepts matrix, in which each column
models a latent concept occurring in the original vector space. Specifically, SVD
is employed to unveil the latent relationships between terms according to their
usage in the corpus.
Next, Random Indexing (RI) [8] is an incremental technique to learn a low-
dimensional word representation relying on the principles of the Random Pro-
jection. It works in two steps: first, a context vector is defined for each context
(the definition of context is typically scenario-dependant. It may be a paragraph,
a sentence or the whole document). Each context vector is ternary (it contains
values in {−1, 0, 1}) very sparse, and its values are randomly distributed. Given
such context vectors, the vector space representation of each word is obtained
by just summing over all the representations of the contexts in which the word
occurs. An important peculiarity of this approach is that it is incremental and
Learning Word Embeddings from Wikipedia 731
scalable: if any new documents come into play, the vector space representation
of the terms is updated by just adding the new occurrences of the terms in the
new documents.
Finally, Word2Vec (W2V) is a recent technique proposed by Mikolov et al.
[5]. The approach learns a vector-space representation of the terms by exploiting
a two-layers neural network. In the first step, weights in the network are ran-
domly distributed as in RI. Next, the network is trained by using the Skip-gram
methodology in order to model fine-grained regularities in word usage. At each
step, weights are updated through Stochastic Gradient Descent and a vector-
space representation of each term is obtained by extracting the weights of the
network at the end of the training.
3 Experimental Evaluation
Experiments were performed by exploiting two state-of-the-art datasets as
MovieLens2 and DBbook3 . The first one is a dataset for movie recommendations,
2
http://grouplens.org/datasets/movielens/.
3
http://challenges.2014.eswc-conferences.org/index.php/RecSys.
732 C. Musto et al.
while the latter comes from the ESWC 2014 Linked-Open Data-enabled Recom-
mender Systems challenge and focuses on book recommendation. Some statistics
about the datasets are provided in Table 1.
A quick analysis of the data immediately shows the very different nature of
the datasets: even if both of them resulted as very sparse, MovieLens is more
dense than DBbook (93.69 % vs. 99.83 % sparsity), indeed each Movielens user
voted 84.83 items on average (against the 11.70 votes given by DBbook users).
DBbook has in turn the peculiarity of being unbalanced towards negative ratings
(only 45 % of positive preferences). Furthermore, MovieLens items were voted
more than DBbook ones (48.48 vs. 10.74 votes for item, on average).
Experimental Protocol. Experiments were performed by adopting different
protocols: as regards MovieLens, we carried out a 5-folds cross validation, while
a single training/test split was used for DBbook. In both cases we used the splits
which are commonly used in literature. Given that MovieLens preferences are
expressed on a 5-point discrete scale, we decided to consider as positive rat-
ings only those equal to 4 and 5. On the other side, the DBbook dataset is
already available as binarized, thus no further processing was needed. Textual
content was obtained by mapping items to Wikipedia pages. All the available
items were successfully mapped by querying the title of the movie or the name
of the book, respectively. The extracted content was further processed through
a NLP pipeline consisting of a stop-words removal step, a POS-tagging step and
a lemmatization step. The outcome of this process was used to learn the Word
Embeddings. For each word embedding technique we compared two different
sizes of learned vectors: 300 and 500. As regards the baselines, we exploited
MyMediaLite library4 . We evaluated User-to-User (U2U-KNN) and Item-to-
Item Collaborative Filtering (I2I-KNN) as well as the Bayesian Personalized
Ranking Matrix Factorization (BPRMF). U2U and I2I neighborhood size was
set to 80, while BPRMF was run by setting the factor parameter equal to 100.
In both cases we chose the optimal values for the parameters.
MovieLens DBbook
Users 943 6,181
Items 1,682 6,733
Ratings 100,000 72,372
Sparsity 93.69 % 99.83 %
Positive Ratings 55.17 % 45.85 %
Avg. ratings/user ± stdev 84.83±83.80 11.70±5.85
Avg. ratings/item ± stdev 48.48±65.03 10.74±27.14
4
http://www.mymedialite.net/.
Learning Word Embeddings from Wikipedia 733
Table 2. Results of the experiments. The best word embedding approach is highlighted
in bold. The best overall configuration is highlighted in bold and underlined. The
baselines which are overcame by at least a word embedding are reported in italics.
Discussion of the Results. The first six columns of Table 2 provide the results
of the comparison among the word embedding techniques. As regards MovieLens,
W2V emerged as the best-performing configuration for all the metrics taken into
account. The gap is significant when compared to both RI and LSI. Moreover,
results show that the size of the vectors did not significantly affect the overall
accuracy of the algorithms (with the exception of LSI). This is an interesting
outcome since with an even smaller word representation, word embeddings can
obtain good results. However, the outcomes emerging from this first experiments
are controversial, since DBbook data provided opposite results: in this dataset
W2V is the best-performing configuration only for F1@5. On the other side, LSI,
which performed the worst on MovieLens data, overcomes both W2V and RI on
F1@10 and F1@15. At a first glance, these results indicate non-generalizable
outcomes. However, it is likely that such behavior depends on specific pecu-
liarities of the datasets, which in turn influence the way the approaches learn
their vector-space representations. A more thorough analysis is needed to obtain
general guidelines which drive the behavior of such approaches.
Next, we compared our techniques to the above described baselines. Results
clearly show that the effectiveness of word embedding approaches is directly
dependent on the sparsity of the data. This is an expected behavior since content-
based approaches can better deal with cold-start situations. In highly sparse
dataset such as DBbook (99.13 % against 93.59 % of MovieLens), content-based
approaches based on word embedding tend to overcome the baselines. Indeed, RI
and LSI, overcome I2I and U2U on F1@10 and F1@15 and W2V overcomes I2I
on F1@5 and I2I and U2U on F1@15. Furthermore, it is worth to note that on
F1@10 and F@15 word embeddings can obtain results which are comparable (or
even better on F1@15) to those obtained by BPRMF. This is a very important
outcome, which definitely confirms the effectiveness of such techniques, even
compared to matrix factorization techniques. Conversely, on less sparse datasets
as MovieLens, collaborative filtering algorithms overcome their content-based
counterpart.
734 C. Musto et al.
References
1. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by
latent semantic analysis. JASIS 41, 391–407 (1990)
2. Gouws, S., Bengio, Y., Corrado, G. Bilbowa: Fast bilingual distributed represen-
tations without word alignments (2014). arXiv:1410.2455
3. Harris, Z.S.: Mathematical Structures of Language. Interscience, New York (1968)
4. McCarey, F., Cinnéide, M.Ó., Kushmerick, N.: Recommending library methods: an
evaluation of the vector space model (VSM) and latent semantic indexing (LSI). In:
Morisio, M. (ed.) ICSR 2006. LNCS, vol. 4039, pp. 217–230. Springer, Heidelberg
(2006)
5. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119
(2013)
6. Musto, C., Semeraro, G., Lops, P., de Gemmis, M.: Random indexing and negative
user preferences for enhancing content-based recommender systems. In: Huemer,
C., Setzer, T. (eds.) EC-Web 2011. LNBIP, vol. 85, pp. 270–281. Springer, Heidel-
berg (2011)
7. Musto, C., Semeraro, G., Lops, P., de Gemmis, M.: Combining distributional
semantics and entity linking for context-aware content-based recommendation. In:
Dimitrova, V., Kuflik, T., Chin, D., Ricci, F., Dolog, P., Houben, G.-J. (eds.)
UMAP 2014. LNCS, vol. 8538, pp. 381–392. Springer, Heidelberg (2014)
8. Sahlgren, M.: An introduction to random indexing. In: Methods and Applications
of Semantic Indexing Workshop, TKE 2005 (2005)
9. Semeraro, G., Lops, P., Degemmis, M.: Wordnet-based user profiles for neighbor-
hood formation in hybrid recommender systems. In: Fifth International Conference
on Hybrid Intelligent Systems, HIS 2005, pp. 291–296. IEEE (2005)
10. Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific
word embedding for twitter sentiment classification. In: Proceedings of the 52nd
Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1555–
1565 (2014)
Tracking Interactions Across Business News,
Social Media, and Stock Fluctuations
1 Introduction
The nature of the complex relationships among traditional news, social media,
and stock price fluctuations is the subject of active research. Recent studies in the
area demonstrate that it is possible to find some correlation between stock prices
and news, when the news are properly classified [1,9]. A comprehensive overview
of market data prediction from text can be found in [7]. In particular, [6] reported
an increase in Wikipedia views for company pages and financial topics before
stock market falls. Joint analysis of news and social media has been previously
studied, inter alia, by [4,5,8]. The approach followed in these papers, as well as
our approach [2], has two interrelated goals: to find information complementary
to what is found in the news, and to control the amount of data that needs to
be downloaded from social media.
We study the interplay among business news, social media, and stock prices.
We believe that the combined analysis of information derived from news, social
media and financial data can be of particular interest for specialists in various
areas: business analysts, Web scientists, data journalists, etc. We use PULS 1 to
collect on-line news articles from multiple sources and to identify the business
entities mentioned in the news texts, e.g., companies and products, and the asso-
ciated event types such as “product launch,” “recall,” “investment” [3]. Using
these entities we then construct queries to get the corresponding social media
content and its metadata, such as, Twitter posts, YouTube videos, or Wikipedia
pages. We focus on analyzing the activity of users of social media in numerical
terms, rather than on analyzing the content, polarity, sentiment, etc.
1
The Pattern Understanding and learning System: http://puls.cs.helsinki.fi.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 735–740, 2016.
DOI: 10.1007/978-3-319-30671-1 61
736 O. Karkulahti et al.
Fig. 1. A news text and a product recall event produced by the PULS IE engine.
The main contributions of this paper: we combine NLP with social media
analysis, and discover interesting correlations between news and social media.
2 Process Overview
We now present the processing steps. First, the system collects unstructured text
from multiple news sources on the Web. PULS uses over a thousand websites
which provide news feeds related to business (Reuters Business News, New York
Times Business Day, etc.). Next, the NLP engine is used to discover, aggregate,
and verify information obtained from the Web. The engine performs Information
Extraction (IE), which is a key component of the platform that transforms facts
found in plain text into a structured form 2.
An example event is shown in Fig. 1. The text mentions a product recall event
involving General Motors, in July 2014. For each event, the IE system extracts
a set of entities: companies, industry sectors, products, location, date, and other
attributes of the event. This structured information is stored in the database, for
querying and broader analysis. Then PULS performs deeper semantic analysis
and uses machine learning to infer some of the attributes of the events, providing
richer information than general-purpose search engines.
Next, using the entities aggregated from the texts, the system builds queries
for the social media sources, e.g. to search company and product names using
Twitter API [2]. The role of the social media component is to enable investigation
of how companies and products mentioned in the news are portrayed on social
media. Our system supports content analysis from different social media services.
In this paper, we focus on numerical measurement and analysis of the content.
We count the number of Wikipedia views of the company and the number of its
mentions in the news and then use time series correlation to demonstrate the
correspondence between news and Wikipedia news. We also correlate these with
upward vs. downward stock fluctuations.
We have complete Wikipedia page request history for all editions, starting
from early 2008, updated daily. We can instantaneously access the daily hit-count
Tracking Interactions Across Business News 737
0.04
Stock diffs Stock diffs 2 Stock diffs
2
0.02 1
1
0
80 0
150 0
120
News News News
120 100
60
80
90
40 60
60
40
20 30 20
0 0 0
10000 Wikipedia 60000 Wikipedia 6000 Wikipedia
8000
6000 40000 4000
4000
20000 2000
2000
0 0 0
Mar Apr May Jun Jul Aug Sep Oct Nov Dec Mar Apr May Jun Jul Aug Sep Oct Nov Dec Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Fig. 2. Daily differences in stock prices, number of mentions in PULS news and number
of Wikipedia hits in 2014 for three companies.
3 Results
In this section we demonstrate results that can be obtained using this kind of
processing. We present two types of results: A. visual analysis of correspondence
between Wikipedia views, news hits and stock prices, and B. time-series corre-
lations between news hits and Wikipedia views.
In the first experiment we chose three companies—Alstom, Malaysia Airlines,
and General Motors. We present the number of mentions in the news collected
by PULS, the number of views of the company’s English-language Wikipedia
page, and stock data, using data from March to December 2014.
In each figure, the top plot shows the daily difference in stock price—the
absolute value of the opening price on a given day minus price on the previous
day, obtained from Yahoo! Finance. The middle plot shows the number of men-
tions of the company in PULS news. The bottom plot shows the number of hits
on the company’s Wikipedia page. In each plot, the dashed line represents the
daily values and the bold line is the value smoothed over three days.
Figure 2a plots the data for the French multinational Alstom. The company
is primarily know for its train-, power-, and energy-related products and services.
In the plot we can see a pattern where the stock price and news mentions seem
to correlate rather closely. Wikipedia page hits show some correlation with the
other plots. The news plot shows three major spikes, with two spikes in Wikipedia
hits. The March peak corresponds to news about business events (investments),
whereas the other peaks had a political aspect, which could trigger activity in
social media; e.g., in June, the French government bought 20 % of Alstom shares,
which caused an active public discussion.
738 O. Karkulahti et al.
Fig. 3. Cross-correlation between Wikipedia views and mentions in PULS news for 11
companies.
total of 121 cross-correlations2 . We limit the lag between time series by seven
days, based on the assumption that if there exists a connection between news
and Wikipedia views it should be visible within a week.
The results of this experiment are presented in Fig. 3, where the circle size
represents correlation strength, the colours represents correlation size: blue means
positive correlation, red negative; the numbers mean the time lag at which the
highest correlation for a given company pair was obtained: positive lag means
that Wikipedia views followed news mentions, negative lag means that news
followed Wikipedia views.
It can be seen from the figure that the largest correlations and the lowest lags
can be found on the diagonal, i.e., between news mention for a company and the
number of views of the company Wikipedia page. Among the 11 companies there
are two exceptions: The Home Depot and Netflix. For Netflix, news mentions
and Wikipedia views do not seem to be strongly correlated with any time series.
News about Alibaba show a surprising correlation with Wikipedia hits on Home
Depot on the following day. At present we do not see a clear explanation for these
phenomena; these can be accidental, or may indicate some hidden connections
(they are both major on-line retailers).
The lag on the diagonal equals to zero in most cases, which means that in
those cases the peaks occur on the same days. At a later time, we can inves-
tigate finer intervals (less than one day). We believe it would be interesting if
a larger study confirmed that we can observe regular patterns in the correla-
tions and the lags are stable—e.g., if a spike in the news regularly precedes a
spike the Wikipedia views—since that would confirm that these models can have
predictive power.
We have presented a study of the interplay between company news, social media
visibility, and stock prices. Information extracted from on-line news by means
the of deep linguistic analysis is used to construct queries to various social media
platforms. We expect that the presented framework would be useful for business
professionals, Web scientists, and researchers from other fields.
The results presented in Sect. 3 demonstrate the utility of collecting and
comparing data from a variety of sources. We were able to discover interesting
correlations between the mentions of a company in the PULS news and the views
of its page in Wikipedia. The correspondence with stock prices was less obvious.
We continue work on refining the forms of data presentation. For example, we
have found that plotting (absolute) differences in stock prices may in some cases
provide better insights than using raw stock prices.
In future work, we plan to cover a wider range of data sources and social plat-
forms, general-purpose (e.g., YouTube or Twitter) and business-specific ones
(e.g., StockTwits). We plan to analyze the social media content as well, e.g.,
2
We use standard R ccp function to calculate cross-correlation.
740 O. Karkulahti et al.
to determine the sentiment of the tweets that mention some particular com-
pany. Covering multiple sources is important due to the different nature of
the social media. Tweets are short Twitter posts, where usually a user shares
her/his impression about an entity (company or product), or posts a related link.
Wikipedia, on the other hand, is used for obtaining more in-depth information
about an entity. YouTube, in turn, is for both the consumption and creation of
reviews, reports, and endorsements.
This phase faces some technical limitations. For example, while Twitter data
can be collected through the Twitter API in near-real time, the API returns
posts only from recent history (7–10 days). This means that keyword extraction
and data collections should be done relatively soon after the company or prod-
uct appears in the news; combined with Twitter API request limits, this poses
challenges to having a comprehensive catalogue of the posts.
Our research plans include building accurate statistical models on top of
the collected data, to explore the correlations, possible cause-effect relations,
etc. We aim to find the particular event types (lay-offs, new products, lawsuits)
that cause reaction on social media and/or in stock prices. We also aim to find
predictive patterns of visibility on social media for companies and products,
based on history or on typical behaviour for a given industry sector.
References
1. Boudoukh, J., Feldman, R., Kogan, S., Richardson, M.: Which news moves stock
prices?. A textual analysis. Technical report, National Bureau of Economic Research
(2013)
2. Du, M., Kangasharju, J., Karkulahti, O., Pivovarova, L., Yangarber, R.: Combined
analysis of news and Twitter messages. In: Joint Workshop on NLP&LOD and
SWAIE: Semantic Web, Linked Open Data and Information Extraction (2013)
3. Du, M., Pierce, M., Pivovarova, L., Yangarber, R.: Improving supervised classifi-
cation using information extraction. In: Biemann, C., Handschuh, S., Freitas, A.,
Meziane, F., Métais, E. (eds.) NLDB 2015. LNCS, vol. 9103, pp. 3–18. Springer,
Heidelberg (2015)
4. Guo, W., Li, H., Ji, H., Diab, M.T.: Linking tweets to news: A framework to enrich
short text data in social media. In: Proceedings of ACL-2013 (2013)
5. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news
media? In: Proceedings of the 19th International Conference on World Wide Web.
ACM (2010)
6. Moat, H.S., Curme, C., Stanley, H., Preis, T.: Anticipating stock market movements
with Google and Wikipedia. In: Matrasulov, D., Stanley, H.E., (eds.) Nonlinear
Phenomena in Complex Systems: From Nano to Macro Scale, pp. 47–59 (2014)
7. Nassirtoussi, A.K., Aghabozorgi, S., Wah, T.Y., Ngo, D.C.L.: Text mining for mar-
ket prediction: a systematic review. Expert Syst. Appl. 41(16), 7653–7670 (2014)
8. Tanev, H., Ehrmann, M., Piskorski, J., Zavarella, V.: Enhancing event descriptions
through Twitter mining. In: Sixth International AAAI Conference on Weblogs and
Social Media (2012)
9. Tetlock, P.C.: Giving content to investor sentiment: the role of media in the stock
market. J. Financ. 62(3), 1139–1168 (2007)
Subtopic Mining Based on Three-Level
Hierarchical Search Intentions
1 Introduction
Many web queries are short and unclear. Some users do not choose appropriate
words for a web search, and others omit specific terms needed to clarify search
intentions, because it is not easy for users to express their search intentions
explicitly through keywords. This intention gap between users and queries results
in queries which are ambiguous and broad.
As one of the solutions for these problems, subtopic mining is proposed,
which can find possible subtopics for a given query and return a ranked list of
them in terms of their relevance, popularity, and diversity [1,2]. A subtopic is
a query which disambiguates and specifies the search intention of the original
query, and good subtopics must be relevant to the query and satisfy both high
popularity and high diversity. The latest subtopic mining task [3] proposed new
subtopic mining that the two-level hierarchy of subtopics consists of at most
“five” first-level subtopics and at most “ten” second-level subtopics for each
first-level subtopic. For example, if a query is “apple,” its first-level subtopics
are “apple fruit” and “apple company,” and second-level subtopics for “apple
This work was partly supported by the ICT R&D program of MSIP/IITP
(10041807), the SYSTRAN International corporation, the BK 21+ Project, and the
National Korea Science and Engineering Foundation (KOSEF) (NRF-2010-0012662).
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 741–747, 2016.
DOI: 10.1007/978-3-319-30671-1 62
742 S.-J. Kim et al.
company” are “apple ipad ” and “apple macbook.” This hierarchy can present
better the structure of diversified search intentions for the queries, and can also
limit the number of subtopics to be shown to users by selecting only first-level
subtopics because users do not want to see too many subtopics.
The state of the art methods [4,5] for the two-level hierarchy of subtopics
used various external resources such as Wikipedia, suggested queries, and web
documents from major web search engines instead of the provided resources
from the subtopic mining task [3]. However, these methods did not consider
the characteristics of resources in terms of popularity and diversity. Meanwhile,
since the titles of web documents represent their overall subjects, these methods
generated first-level subtopics using keywords which were extracted from only
the titles. However, the title based first-level subtopics may not be enough to
satisfy both high popularity and high diversity, because a title as a phrase or
sentence is not more informative than a document.
To solve these issues, we propose a subtopic mining method based on three-
level hierarchical search intentions of queries. Our method is a bottom-up app-
roach which mines second-level subtopics first, and generates first-level subtopics.
We extract various relevant subtopic candidates from web documents using
a simple pattern, and select second-level subtopics from this candidate set.
The selected subtopics are ranked by a proposed popularity measure, and are
expanded and re-ranked considering the characteristics of provided resources.
Using a topic modeling, we make five term-clusters and generate first-level
subtopics consisting of the query and general terms. Our contributions are as
follows:
• Our method uses only the limited resources (suggested queries, query dimen-
sions1 , and web documents) provided from the subtopic mining task [3], and
we reflect the characteristics of resources in terms of popularity and diversity
to the second-level subtopic mining step.
• Our work divides “second-level” subtopics into “higher-level (level 2)” and
“lower-level (level 3)” subtopics considering the hierarchical search intentions.
Higher-level subtopics reflect wider search intentions than their lower-level
ones. We generate high quality first-level (level 1) subtopics using words in
higher-level subtopics as well as the titles of web documents.
candidates as new queries can be derived from nouns rather than others, and
be useful in finding the hidden search intentions of the given query. From this
assumption, we create a simple pattern:
((adjective)? (noun)+ (non-noun)∗ )? (query)((non-noun)∗ (adjective)? (noun)+ )?
where the ? operator means “zero or one”; the + operator “one or more”; and
the * operator “zero or more.”
This pattern is applied to the top 1,000 relevant documents for the query,
and the extracted subtopic candidates are truly relevant because of consisting
of the whole query as well as noun phrases in the documents.
Next, we define that a higher-level subtopic reflects wide search intention,
and its intention has clear distinction among search intentions. If highly relevant
documents for a given query are assumed to represent user’s all possible search
intentions anyhow, and the appearance of subtopic candidates in documents is
interpreted to mean that these subtopic candidates cover some search intentions,
then a higher-level subtopic covers (appears in) many of the highly relevant
documents (i.e., search intentions), and this document set has higher distinctness
than the document sets for the other subtopics. Therefore, from the subtopic
candidate set, to select higher-level subtopics satisfying both of the above two
conditions, we propose a scoring measure, Selection Score (SS ):
|D(st) ∩ US c |
SS(st, US) = × CE(st), (1)
| st ∈ST D(st )|
query dimension are relevant to the common query, we regard all items in this
dimension as relevant items. For each resource, we assume that:
From the first assumption, we add items of relevant query dimensions to the
ranked list of second-level subtopics to improve the diversity of them (Fig. 1).
(1) If a higher-level subtopic contains one of items in a relevant query dimension,
the corresponding item is replaced with the higher-level subtopic, and the origi-
nal place of the higher-level subtopic is replaced with the ranked list of items as
higher-level subtopics. (2) If any higher-level subtopic does not contain items in
a relevant query dimension, the top item in the dimension is added to the ranked
list of second-level subtopics as the last-ranked higher-level subtopic. Meanwhile,
from the second assumption, we reflect the high popularity of suggested queries
to second-level subtopic re-ranking. If a higher-level subtopic contains the i-
ranked suggested query, this subtopic is re-ranked as the i-ranked higher-level
subtopic, and its lower-level subtopics are also re-ranked. The non-matched sug-
gested query is deleted from the original suggested query list.
Table 1. Mean results of methods for relevance, popularity, and diversity of subtopics
References
1. Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii,
N.: Overview of the NTCIR-9 intent task. In: Proceedings of NTCIR-9 Workshop
Meeting, pp. 82–105. National Institute of Informatics, Tokyo, Japan (2011)
2. Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R.: Overview of the
NTCIR-10 INTENT-2 task. In: Proceedings of NTCIR-10 Workshop Meeting, pp.
94–123. National Institute of Informatics, Tokyo, Japan (2013)
3. Liu, Y., Song, R., Zhang, M., Dou, Z., Yamamoto, T., Kato, M., Ohshima, H., Zhou,
K.: Overview of the NTCIR-11 imine task. In: Proceedings of NTCIR-11 Workshop
Meeting, pp. 8–23. National Institute of Informatics, Tokyo, Japan (2014)
2
http://lemurproject.org/clueweb12/.
3
http://nlp.stanford.edu/software/tagger.shtml.
4
http://mecab.sourceforge.net.
Subtopic Mining Based on Three-Level Hierarchical Search Intentions 747
4. Yamamoto, T., Kato, M.P., Ohshima, H., Tanaka, K.: Kuidl at the NTCIR-11 imine
task. In: Proceedings of NTCIR-11 Workshop Meeting, pp. 53–54. National Institute
of Informatics, Tokyo, Japan (2014)
5. Luo, C., Li, X., Khodzhaev, A., Chen, F., Xu, K., Cao, Y., Liu, Y., Zhang, M.,
Ma, S.: Thusam at NTCIR-11 imine task. In: Proceedings of NTCIR-11 Workshop
Meeting, pp. 55–62. National Institute of Informatics, Tokyo, Japan (2014)
6. Dou, Z., Hu, S., Luo, Y., Song, R., Wen, J.R.: Finding dimensions for queries. In:
Proceedings of the 20th ACM International Conference on Information and Knowl-
edge Management, pp. 1311–1320. Association for Computing Machinery, Glasgow,
Scotland, UK (2011)
7. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search
results. In: Proceedings of the 27th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 210–217. Association
for Computing Machinery, Sheffield, South Yorkshire, UK (2004)
8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.
3, 993–1022 (2003)
9. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and
beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009)
Cold Start Cumulative Citation
Recommendation for Knowledge Base
Acceleration
1 Introduction
Recent years have witnessed rapid growth of Knowledge Bases (KBs) such as
Wikipedia. Currently, the maintenance of a KB mainly relies on human editors.
To help the editors keep KBs up-to-date, Cumulative Citation Recommendation
(CCR) is proposed by Text Retrieval Conference (TREC) in Knowledge Base
Acceleration (KBA) track1 since 2012. CCR aims to automatically filter vitally
relevant documents from a chronological document collection and evaluate their
citation-worthiness to target KB entities. The key objective of CCR is to identify
“vital” documents, which would trigger updates to the entry page of target
entities in the KB [6]. Therefore, CCR is also known as vital filtering.
Generally, the target entities are identified by a reference KB like Wikipedia,
so their KB profiles can be employed to perform entity disambiguation and
This work was done when the first author was visiting Microsoft Research Asia.
1
http://trec-kba.org/.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 748–753, 2016.
DOI: 10.1007/978-3-319-30671-1 63
Cold Start Cumulative Citation Recommendation 749
In terms of cold start CCR, we perform relevance estimation for target entities
without KB entries. We follow the four-point scale relevance settings of TREC
KBA, i.e.,vital, useful, unknown and non-referent. The documents which contain
timely information about target entities’ current states, actions or situations are
“vital” documents. Vital documents would motivate a change to an already up-
to-date KB article.
Since vital signals are usually captured in the sentence or short passage sur-
rounding entity mentions in a document [6], it would be better to consider the
sentence mentioning a target entity instead of the whole document. Besides, if
a document contains several sentences mentioning the target entity, we take the
max rating of these sentences as the document’s final rating.
750 J. Wang et al.
Time Range (TR). In vital filtering, we must assess the time lag between the
relevant event and documents. It is intuitive to assume that the later a document
occurs, the smaller relevance score it should get, even if the two documents report
a same event. Hence, we penalize the later documents in event-based clusters by
decreasing the feature value of a document over hours. The first document of a
cluster get a feature value of 1.0, and later documents get smaller values, which
can be expressed by a decay function tr(di ) = 1.0 − (hi − h0 )/72.0, where h0 is
the hour converted from the timestamp of the first document d0 in the cluster,
and hi is the hour of the ith document di . The number 72.0 means 3 days we
used in our experiments.
Local Profile (LP). Some entity mentions in stream documents are ambiguous.
To solve this problem, we create local profile for target entities, which contains
some profile information around the entity mentions in documents. Usually, when
an entity is mentioned, its local profile (e.g., title and profession) is also men-
tioned to let the readers know who is this entity. For example, “All the other
stuff matters not, Lions coach Bill Templeton said ”. From above sentence, we
know the target entity (Bill Templeton) is a coach. Of course, if the mentioned
entity is very popular, its title or profession is usually omitted. Nevertheless,
most entities in cold start CCR are less popular entities.
In our approach, we calculate the cosine similarity between a target entity’s
local profile and the extracted local profile of its possible mention as a feature.
Firstly, we need to construct the local profile for each target entity. We acquire
the title/profession dictionaries from Freebase2 , containing 2,294 titles and 2,440
professions. Secondly, we extract the word-based n-grams (n=1, 2, 3, 4, 5) inside
a sliding window around a target entity mention. These n-grams existing in the
dictionaries form the local profile vector. Lastly, we construct the local profile
vector for each target entity with the n-grams extracted from all vital and useful
documents in the training data.
2
https://www.freebase.com/.
Cold Start Cumulative Citation Recommendation 751
Action Pattern (AP). Vital documents typically contain sentences that describe
events in which the target entities carry out some actions, e.g. scored a goal,
won an election. Therefore, an entity’s action in a document is a key indicator
that the document is vital to the target entity. We find that if a target entity
involves in an action, it usually appears as the subject or object of the sentence.
So we mine triples from sentences mentioning a target entity. If a triple is found
in which the target entity occurs as subject or object, we consider this entity
takes action in the sentence (event).
We adopt Reverb [4] to mine the triples, which is a state-of-the-art open
domain extractor that targets verb-centric relations. Such relations are expressed
in triples <subject, verb, object>. We run Reverb on each sentence mention-
ing a target entity and extract the triples. Then for each triple, we use the
“entity + verb” and “verb + entity” as action patterns. Please note that the
verb is stemmed in our experiments. For example, from the sentence “Public
Lands Commissioner Democrat Peter Goldmark won re-election”, the extracted
triple is <Peter Goldmark, won, re-election>, and the action pattern is “Peter
Goldmark win”. In our system, each action pattern is used as a binary feature, if
the sentence/document has an action pattern, the feature value is 1, otherwise 0.
3 Experiments
3.1 Dataset
3
http://sourceforge.net/p/lemur/wiki/RankLib/.
4
http://trec-kba.org/kba-stream-corpus-2014.shtml.
752 J. Wang et al.
Table 1. Results of all experimental methods. All measures are reported by the KBA
official scorer with cutoff-step-size=10.
4 Conclusion
In this paper, we focus on cold start Cumulative Citation Recommendation
for Knowledge Base Acceleration, in which the target entities do not exist in
the reference KB. Since KB profile is unavailable in cold start CCR, we split
sentences in the stream documents and cluster them chronologically to detect
vital events related to the target entities. Based on the sentence clustering results,
we then extract tree kinds of novel features: time range, local profile and action
pattern. Moreover, we adopt random forest based ranking method to perform
relevance estimation. Experimental results on TREC-KBA-2014 dataset have
demonstrated that this two-step strategy can improve system performance under
cold start circumstances.
Acknowledgement. The authors would like to thank Jing Liu for his valuable sug-
gestions and the anonymous reviewers for their helpful comments. This work is funded
by the National Program on Key Basic Research Project (973 Program, Grant No.
2013CB329600), National Natural Science Foundation of China (NSFC, Grant Nos.
61472040 and 60873237), and Beijing Higher Education Young Elite Teacher Project
(Grant No. YETP1198).
References
1. Balog, K., Ramampiaro, H.: Cumulative citation recommendation: classification vs.
ranking. In: SIGIR, pp. 941–944. ACM (2013)
2. K. Balog, H. Ramampiaro, N. Takhirov, and K. Nørvåg.: Multi-step classification
approaches to cumulative citation recommendation. In: OAIR, pp. 121–128. ACM
(2013)
3. Bonnefoy, L., Bouvier, V., Bellot, P.: A weakly-supervised detection of entity central
documents in a stream. In: SIGIR, pp. 769–772. ACM (2013)
4. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information
extraction. In: EMNLP, pp. 1535–1545. ACL (2011)
5. Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Niu, F., Zhang, C., Re, C.,
Soboroff, I.: Building an entity-centric stream filtering test collection for TREC
2012. In: TREC, NIST (2012)
6. Frank, J.R., Kleiman-Weiner, M., Roberts, D.A., Voorhees, E., Soboroff, I.: Evalu-
ating stream filtering for entity profile updates in TREC 2012, 2013, and 2014. In:
TREC, NIST (2014)
7. Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC,
NIST (2002)
8. Wang, J., Song, D., Lin, C.-Y., Liao, L.: BIT and MSRA at TREC KBA CCR track
2013. In: TREC, NIST (2013)
9. Wang, J., Song, D., Wang, Q., Zhang, Z., Si, L., Liao, L., Lin, C.-Y.: An entity class-
dependent discriminative mixture model for cumulative citation recommendation.
SIGIR 2015, 635–644 (2015)
Cross Domain User Engagement Evaluation
1 Introduction
Twitter is a popular micro-blogging platform, which allows users to share their
opinions and thoughts as fast as possible in very short texts. This makes Twitter
a rich source of information with high speed of information diffusion. Therefore,
several web applications (e.g., IMDb) have been integrated with Twitter to let
people express their opinions about items (e.g., movie) in a popular social net-
work [2,10].
It is shown that the amount of users’ interactions on tweets can be used to
measure the users’ satisfaction. In more detail, user engagements in Twitter has
a strong positive correlation with the interest of users in the received tweets [2].
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 754–760, 2016.
DOI: 10.1007/978-3-319-30671-1 64
Cross Domain User Engagement Evaluation 755
1
In each tweets, the user gives a rate to or likes/dislikes a product.
2
“User Engagement as Evaluation” Challenge, http://2014.recsyschallenge.com/.
3
There are lots of tweets with zero engagement and a few tweets with positive
engagement.
756 A. Montazeralghaem et al.
f = arg min L(f (X), Y ) + σf 2K + λDf,K (Js , Jt ) + γMf,K (Ps , Pt )
f ∈HK
where n(i) denotes the number of training instances with label yi . A similar idea
for coping with imbalanced data has been previously proposed in [1] for single-
task classification and in [7,9] for multi-task learning. We can now redefine the
ARTL learning formulation as follows:
f = arg min W L(f (X), Y ) + σf 2K + λDf,K (Js , Jt ) + γMf,k (Ps , Pt )
f ∈HK
Cross Domain User Engagement Evaluation 757
2.2 Features
We extract 23 features from each tweet, that are partitioned into three categories:
user-based, item-based, and tweet-based. Note that the contents of tweets in our
task are predefined by the web applications and users usually do not edit tweets
contents. These features are previously used in [9,10]. More details about the
exact definition of features can be found in [10]. The list of our features are as
follows:
User-Based Features. Number of followers, Number of followees, Number of
tweets, Number of tweets about domain’s items, Number of liked tweets, Number
of lists, Tweeting frequency, Attracting followers frequency, Following frequency,
Like frequency, Followers/Followees, Followers-Followees.
Item-Based features. Number of tweets about the item.
Tweet-Based Features. Mention count, Number of hash-tags, Tweet age,
Membership age at the tweeting time, Hour of tweet, Day of tweet, Time of
tweet, Holidays or not, Same language or not, English or not.
3 Experiments
In our evaluations, we use the dataset provided by [9], which is gathered from four
diverse and popular web applications (domains): IMDb, YouTube, Goodreads,
and Pandora which contain movies, video clips, books, and musics, respectively.4
Statistics of the dataset are reported in Table 1.
To have a complete and fair evaluation, in our experiments all models are
trained using the same number of training instances. For each domain, we ran-
domly select 16, 361 and 32, 722 instances to create training and test sets, respec-
tively. We repeat this process 30 times using random shuffling. We report the
average of the results obtained on these 30 shuffles and classify tweets with
positive engagement from the tweets with zero engagement.
4
The dataset is freely available at http://ece.ut.ac.ir/node/100770.
758 A. Montazeralghaem et al.
The results obtained by STL and ARTL are reported in Table 2. In this table,
the significant differences between results are shown by star. According to this
table, in some cases STL performs better and in other cases ARTL outperforms
STL. In the following, we analyze the obtained results for each target domain.
IMDb. In the case that IMDb is the target domain, ARTL significantly outper-
forms STL, in terms of BA; however, the accuracy values achieved by SVM are
higher than those obtained by ARTL. This shows that ARTL can classify the
minority class instances (tweets with positive engagement) significantly better
than SVM, but it fails in classifying the instances belonging to the majority
class. The reason is that IMDb is the most imbalanced domain in the dataset
Table 2. Accuracy and balanced accuracy achieved by single-task learning and transfer
learning methods.
Test
IMDb YouTube Goodreads Pandora
Train
STL ARTL STL ARTL STL ARTL STL ARTL
BA 0.6445* 0.6033 0.5802 0.5911* 0.5663* 0.5492
IMDb - -
Acc. 0.7889* 0.6797 0.8616* 0.6924 0.8681* 0.6796
BA 0.5378 0.5542* 0.5534 0.5582* 0.5447* 0.5390
YouTube - -
Acc. 0.9529* 0.9031 0.9350* 0.9197 0.9383* 0.9031
BA 0.5917 0.5933 0.6767* 0.6506 0.5752* 0.5572
Goodreads - -
Acc. 0.7830* 0.7008 0.5745 0.6360* 0.7557* 0.6720
BA 0.5731 0.5820* 0.6602* 0.6368 0.5948 0.5985
Pandora - -
Acc. 0.6835* 0.6525 0.5403 0.6485* 0.6769 0.6682
5
The results without instance weighting is biased toward the majority class. For the
sake of space, the results without instance weighting are not reported.
Cross Domain User Engagement Evaluation 759
(see Table 1) and thus, STL cannot learn a proper model, when there is a large
gap between the feature distribution of the source and the target domains. This
is why the maximum difference between the performance of ARTL and STL is
happened when YouTube is selected as the source domain.
YouTube. Unlike the previous case, when YouTube is chosen as the target
domain, STL outperforms ARTL in terms of BA. In some cases (i.e., training
on Goodreads and Pandora), ARTL achieves higher accuracy compared to STL.
The reason is that other domains are much more imbalanced than YouTube and
in that case, the trained STL model is more accurate in detecting instances from
the minority class, which leads to the better BA, but worse accuracy.
Goodreads. The results achieved over the Goodreads domain are very similar to
those obtained over the IMDb domain. In other words, ARTL is more successful
than STL in detecting tweets with positive engagement, since it achieved higher
balanced accuracy but lower accuracy. As shown in Table 2, the best performance
over this target domain is achieved when the model is trained using the IMDb
or the Pandora domains. The percentage of data with positive engagement in
these two domains are much more similar to Goodreads, compared to YouTube.
Thus, learning from these domains can achieve higher accuracy.
Pandora. According to Table 2, transferring knowledge do not help to improve
the user engagement evaluation performance. The reason could be related to
the different distributions of the data from Pandora and the other domains. As
reported in Table 1, the average engagement in this domain is much lower than
the other domains which leads to have a very different feature distribution.
Acknowledgements. This work was supported in part by the Center for Intelli-
gent Information Retrieval. Any opinions, findings and conclusions or recommenda-
tions expressed in this material are those of the authors and do not necessarily reflect
those of the sponsor.
760 A. Montazeralghaem et al.
References
1. Akbani, R., Kwek, S.S., Japkowicz, N.: Applying support vector machines to imbal-
anced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.)
ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
2. Loiacono, D., Lommatzsch, A., Turrin, R.: An Analysis of the 2014 RecSys chal-
lenge. In: RecSysChallenge, pp. 1–6 (2014)
3. Long, M., Wang, J., Ding, G., Pan, S.J., Yu, P.S.: Adaptation regularization: a
general framework for transfer learning. IEEE Trans. Knowl. Data Eng. 26(5),
1076–1089 (2014)
4. Petrovic, S., Osborne, M., Lavrenko, V.: RT to win! predicting message propagation
in twitter. In: ICWSM, pp. 586–589 (2011)
5. Powers, D.: Evaluation: from precision, recall and F-measure to ROC, informed-
ness, markedness & correlation. J. Mach. Learn. Tech. 2(1), 37–63 (2011)
6. Said, A., Dooms, S., Loni, B., Tikk, D.: Recommender systems challenge 2014. In:
RecSys, pp. 387–388 (2014)
7. de Souza, J.G.C., Zamani, H., Negri, M., Turchi, M., Falavigna, D.: Multitask
learning for adaptive quality estimation of automatically transcribed utterances.
In: NAACL-HLT, pp. 714–724 (2015)
8. Uysal, I., Croft, W.B.: User oriented tweet ranking: a filtering approach to
microblogs. In: CIKM, pp. 2261–2264 (2011)
9. Zamani, H., Moradi, P., Shakery, A.: Adaptive user engagement evaluation via
multi-task learning. In: SIGIR, pp. 1011–1014 (2015)
10. Zamani, H., Shakery, A., Moradi, P.: Regression and learning to rank aggregation
for user engagement evaluation. In: RecSysChallenge, pp. 29–34 (2014)
An Empirical Comparison of Term Association
and Knowledge Graphs for Query Expansion
1 Introduction
Vocabulary gap, when searchers and the authors of relevant documents use dif-
ferent terms to refer to the same concepts, is one of the fundamental problems
in information retrieval. In the context of language modeling approaches to IR,
vocabulary gap is typically addressed by adding semantically related terms to
query and document language models (LM), a process known as query or docu-
ment expansion. Therefore, effective and robust query and document expansion
requires information about term relations, which can be conceptualized as a term
graph. The nodes in this graph are distinct terms, while the edges are weighed
according to the strength of semantic relationship between pairs of terms.
Term association graph is constructed from a given document collection
by calculating a co-occurrence based information theoretic measure, such as
mutual information [7] or hyperspace analog to language [2], between each pair of
terms in the collection vocabulary. Term graphs can also be derived from knowl-
edge bases, such as DBpedia1 , a structured version of Wikipedia, Freebase2 ,
1
http://wiki.dbpedia.org/.
2
http://freebase.com/.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 761–767, 2016.
DOI: 10.1007/978-3-319-30671-1 65
762 S. Balaneshinkordan and A. Kotov
2 Methods
2.1 Statistical Term Association Graphs
3
http://conceptnet5.media.mit.edu/.
An Empirical Comparison of Term Association and Knowledge Graphs 763
term, over which the sliding window is centered. Each word in the local con-
text is assigned a weight according to its distance from the center of the sliding
window (words that are closer to the center receive higher weight). An n × n
HAL space matrix H, which aggregates the local contexts for all the terms
in the vocabulary, is created after traversing an entire corpus. After that, the
global co-occurrence matrix is produced by merging the row and column corre-
sponding to each term in the HAL space matrix. Each distinct term wi in the
vocabulary of the collection corresponds to a row in the global co-occurrence
matrix Hwi = {(wi1 , ci1 ), . . . , (win , cin )}, where ci1 , . . . , cin are the number of
co-occurrences of the term wi with all other terms in the vocabulary. After the
merge, each row Hwi in the global co-occurrence matrix is normalized to obtain
a HAL-based semantic term similarity matrix for the entire collection:
cij
Swi = n
j=1 cij
Due to the context window of smaller size, HAL-based term association graphs
are typically less noisy than MI-based ones.
4
http://wiki.dbpedia.org/Downloads39.
764 S. Balaneshinkordan and A. Kotov
3 Experiments
3.1 Datasets
For all experiments in this work we used AQUAINT, ROBUST and GOV
datasets from TREC, which were pre-processed by removing stopwords and
applying the Porter stemmer. To construct the term association graphs, all rare
terms (that occur in less than 5 documents) and all frequent terms (that occur
in more than 10 % of all documents in the collection) have been removed [3,4].
Term association graphs were constructed using either the top 100 most related
terms or the terms with similarity metric greater than 0.001 for each distinct
term in the vocabulary of a given collection. HAL term association graphs were
constructed using the sliding window of size 20 [4]. The reported results are based
on the optimal settings of the Dirichlet prior μ and interpolation parameter α
empirically determined for all the methods and the baselines. Top 85 terms most
similar to each query term were used for query expansion [1]. KL-divergence
retrieval model with Dirichlet prior smoothing (KL-DIR) and document
expansion based on translation model [3] (TM) were used as the baselines.
5
http://conceptnet5.media.mit.edu/downloads/20130917/associations.txt.gz.
An Empirical Comparison of Term Association and Knowledge Graphs 765
3.2 Results
Retrieval performance of query expansion using different types of term graphs
and the baselines on different collections and query types is summarized in
Tables 1, 2 and 3. The best and the second best values for each metric are
highlighted in boldface and italic, while † and ‡ indicate statistical significance
in terms of MAP (p < 0.05) using Wilcoxon signed rank test over the KL-DIR
and TM baselines, respectively.
Table 1. Retrieval accuracy for (a) all queries and (b) difficult queries on AQUAINT
dataset.
(a) (b)
Method MAP P@20 GMAP Method MAP P@20 GMAP
KL-DIR 0.1943 0.3940 0.1305 KL-DIR 0.0474 0.1250 0.0386
TM 0.2033 0.3980 0.1339 TM 0.0478 0.1250 0.0386
NEIGH-MI 0.2031† 0.3970 0.1326 NEIGH-MI 0.0476 0.1375 0.0393
NEIGH-HAL 0.1989† 0.3900 0.1319 NEIGH-HAL 0.0474 0.1500 0.0378
DB-MI 0.2073†‡ 0.4160 0.1468 DB-MI 0.0528†‡ 0.1906 0.0452
DB-HAL 0.2059†‡ 0.4080 0.1411 DB-HAL 0.0544†‡ 0.1538 0.0455
FB-MI 0.2055†‡ 0.3990 0.1336 FB-MI 0.0534†‡ 0.1333 0.0437
FB-HAL 0.2056†‡ 0.3960 0.1384 FB-HAL 0.0564†‡ 0.1444 0.0471
CNET 0.2051†‡ 0.3900 0.1388 CNET 0.0504†‡ 0.1219 0.044
CNET-MI 0.2042† 0.3920 0.1371 CNET-MI 0.0496† 0.1156 0.0422
CNET-HAL 0.2058†‡ 0.3920 0.1388 CNET-HAL 0.0502† 0.1219 0.0436
Table 2. Retrieval accuracy for (a) all queries and (b) difficult queries on ROBUST
dataset.
(a) (b)
Method MAP P@20 GMAP Method MAP P@20 GMAP
KL-DIR 0.2413 0.3460 0.1349 KL-DIR 0.0410 0.1290 0.0261
TM 0.2426 0.3488 0.1360 TM 0.0458 0.1290 0.0267
NEIGH-MI 0.2432 0.3460 0.1360 NEIGH-MI 0.0429† 0.1323 0.0273
NEIGH-HAL 0.2431 0.3454 0.1333 NEIGH-HAL 0.0419 0.1260 0.0265
DB-MI 0.2482†‡ 0.3524 0.1397 DB-MI 0.0503†‡ 0.1449 0.0301
DB-HAL 0.2426 0.3444 0.1349 DB-HAL 0.0474† 0.1437 0.0273
FB-MI 0.2452†‡ 0.3526 0.1232 FB-MI 0.0381 0.1222 0.0200
FB-HAL 0.2476†‡ 0.3540 0.1261 FB-HAL 0.0393 0.1272 0.0211
CNET 0.2452† 0.3472 0.1407 CNET 0.0559†‡ 0.1487 0.0334
CNET-MI 0.2495†‡ 0.3530 0.1459 CNET-MI 0.0560†‡ 0.1487 0.0326
CNET-HAL 0.2503†‡ 0.3528 0.1463 CNET-HAL 0.0558†‡ 0.1475 0.0323
Table 3. GOV dataset results on (a) all queries and (b) difficult queries.
(a) (b)
Method MAP P@20 GMAP Method MAP P@5 GMAP
KL-DIR 0.2333 0.0464 0.0539 KL-DIR 0.0311 0.0281 0.014
TM 0.2399 0.0476 0.0551 TM 0.0343 0.0304 0.0146
NEIGH-MI 0.2415†‡ 0.0489 0.0518 NEIGH-MI 0.0333† 0.0307 0.013
NEIGH-HAL 0.2419†‡ 0.0456 0.0476 NEIGH-HAL 0.0425†‡ 0.0293 0.0122
DB-MI 0.2346 0.0467 0.0529 DB-MI 0.0312 0.0285 0.0136
DB-HAL 0.2404† 0.0467 0.053 DB-HAL 0.0306 0.0274 0.0134
FB-MI 0.2420†‡ 0.0484 0.0573 FB-MI 0.0350†‡ 0.0319 0.0154
FB-HAL 0.2404† 0.0476 0.0565 FB-HAL 0.0339† 0.0293 0.0152
CNET 0.2407† 0.0489 0.0584 CNET 0.0407 †‡ 0.0333 0.0172
CNET-MI 0.2416†‡ 0.0504 0.0587 CNET-MI 0.0427 †‡ 0.0367 0.0176
CNET-HAL 0.2428†‡ 0.0516 0.0586 CNET-HAL 0.0453†‡ 0.0385 0.0181
for newswire datasets (AQUAINT and ROBUST) on both regular and difficult
queries, with the HAL-based term association graph (NEIGH-HAL) outperform-
ing the term graphs derived from DBpedia and Freebase (DB-HAL and FB-HAL)
for all queries on the GOV collection. For difficult queries on the same dataset,
NEIGH-HAL outperforms Freebase- and DBpedia-based terms graphs and has
comparable performance with the term graphs derived from ConceptNet. We
attribute this to the fact that the term graph for GOV is larger in size and
less dense than the term graphs for AQUAINT and ROBUST, which results in
less noisy term associations. Second, using MI and HAL-based weighs of edges
in ConceptNet graph (CNET-MI and CNET-HAL) results in better retrieval
accuracy that the original ConceptNet weights (CNET) in the majority of cases.
This indicates the utility of tuning the weights in term graphs derived from exter-
nal resources to particular collections. Finally, ConceptNet-based term graphs
outperformed Freebase- and DBpedia-based ones on 2 out of 3 collections used
in evaluation, which indicates the importance of commonsense knowledge in
addition to information about entities.
References
1. Bai, J., Song, D., Bruza, P., Nie, J.-Y., Cao, G.: Query expansion using term
relationships in language models for information retrieval. In: Proceedings of the
14th ACM CIKM, pp. 688–695 (2005)
2. Burgess, C., Livesay, K., Lund, K.: Explorations in context space: words, sentences
and discourse. Discourse Process. 25, 211–257 (1998)
3. Karimzadehgan, M., Zhai, C.: Estimation of statistical translation models based
on mutual information for ad hoc information retrieval. In: Proceedings of the 33rd
ACM SIGIR, pp. 323–330 (2010)
4. Kotov, A., Zhai, C.: Interactive sense feedback for difficult queries. In: Proceedings
of the 20th ACM CIKM, pp. 163–172 (2011)
5. Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging
conceptnet to improve search results for difficult queries. In: Proceedings of the
5th ACM WSDM, pp. 403–412 (2012)
An Empirical Comparison of Term Association and Knowledge Graphs 767
Abstract. The increasing cost of health care has motivated the drive
towards preventive medicine, where the primary concern is recognizing
disease risk and taking action at the earliest stage. We present an appli-
cation of deep learning to derive robust patient representations from
the electronic health records and to predict future diseases. Experiments
showed promising results in different clinical domains, with the best per-
formances for liver cancer, diabetes, and heart failure.
1 Introduction
Developing predictive approaches to maintain health and to prevent diseases,
disability, and death is one of the primary goals of preventive medicine. In this
context, information retrieval applied to electronic health records (EHRs) has
shown great promise in providing search engines that could support physicians
in identifying patients at risk of diseases given their clinical status. Most of the
works proposed in literature, though, focus on only one specific disease at a
time (e.g., cardiovascular diseases [1], chronic kidney disease [2]) and patients
are often represented using ad-hoc descriptors manually selected by clinicians.
While appropriate for an individual task, this approach scales poorly, does not
generalize well, and also misses the patterns that are not known.
EHRs are challenging to represent since they are high dimensional, sparse,
noisy, heterogeneous, and subject to random errors and systematic biases [3].
In addition, the same clinical concept is usually reported in different ways.
For example, a patient with “type 2 diabetes mellitus” can be identified by
hemoglobin A1C lab values greater than 7.0, presence of 250.00 ICD-9 code,
“diabetes mellitus” mentioned in the free-text clinical notes, and so on. Conse-
quently, it is hard to automatically derive robust descriptors for effective patient
indexing and retrieval. Representations based on raw vectors composed of all
the descriptors available in the hospital data warehouse have also been used [4].
However, these representations are sparse, noisy, and repetitive, thus, not ideal
to model the hierarchical information embedded in the EHRs.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 768–774, 2016.
DOI: 10.1007/978-3-319-30671-1 66
Deep Learning to Predict Patient Future Diseases 769
d
LH (x, z) = − [xk log zk + (1 − xk ) log(1 − zk )] . (3)
k=1
3 Experimental Setup
This section describes the evaluation performed to validate the deep learning
framework for future disease prediction using the Mount Sinai data warehouse.
3.1 Dataset
The Mount Sinai Health System generates a high volume of structured, semi-
structured, and unstructured data as part of its healthcare and clinical opera-
tions. The entire EHR dataset is composed of approximately 4.2 million patients
Deep Learning to Predict Patient Future Diseases 771
as of March 2015, with 1.2 million of them having at least one diagnosed dis-
ease expressed as a numerical ICD-9 code. In this context, we considered all the
records till December 31, 2013 (i.e., “split-point”) as training data and all the
diseases diagnosed in 2014 as testing data.
We randomly selected 105,000 patients with at least one new disease diag-
nosed in 2014 and at least ten records before that (e.g., medications, lab tests,
diagnoses). These patients composed validation (i.e., 5,000 patients) and test
(i.e., 100,000 patients) sets. In particular, all the diagnoses in 2014 were used
to validate the predictions computed using the patient data recorded before
the split-point (i.e., clinical status). We then sampled another 350,000 different
patients with at least ten records before the split-point to use as the training set.
The evaluation was performed on a vocabulary of 72 diseases, covering dif-
ferent clinical domains, such as oncology, endocrinology, and cardiology. This
was obtained by initially using the ICD-9 codes to determine the diagnosis of a
disease to a patient. However, since different codes can refer to the same disease,
we mapped the codes to a categorization structure, which groups ICD-9s into
a vocabulary of 231 general disease definitions [10]. This list was then filtered
down to remove diseases not present in the data warehouse or not considered
predictable using EHRs alone (e.g., physical injuries, poisoning), leading to the
final vocabulary.
3
While in this study we favored a basic pipeline to process EHRs, it should be noted
that more sophisticated techniques might lead to better features as well as to better
predictive results.
772 R. Miotto et al.
3.3 Evaluation
We first extracted all the descriptors available in the data warehouse related to
the EHR categories mentioned in Sect. 3.2 and removed those that were either
very frequent or rare in the training set. This led to vector-based patient repre-
sentations of 41,072 entries (i.e., “raw”).
We then applied a 3-layer SDA to the training set to derive the deep fea-
tures. The autoencoders in the network shared the same configuration with 500
hidden units and a noise corruption factor ν = 0.1. For comparison, we also
derived features using principal component analysis (i.e., “PCA” with 100 prin-
cipal components) and k-means clustering (i.e., “kMeans” with 500 centroids)4 .
Predictions were performed using random forests and SVMs with radial basis
function kernel. Deep features were also fine-tuned adding a logistic regression
layer on top of the last autoencoder as described in Sect. 2.1 (i.e., “sSDA”).
Hence, for all the model combinations, we computed the probability of each
test patient to develop every disease in the vocabulary and we evaluated how
many of these predictions were correct in one year interval5 . For each disease,
we measured Area under the ROC curve (i.e., AUC-ROC) and F-score (with
classification threshold equal to 0.6).
Table 1. Future disease prediction results averaged over 72 diseases and 100,000
patients. The symbols (†) and (*) after a numeric value mean that the difference with
the corresponding second best measurement in the classification algorithm and overall,
respectively, is statistically significant (p ≤ 0.05, t-test).
4 Results
Table 1 shows the classification results averaged over all 72 diseases in the vocab-
ulary. As it can be seen SDA features lead to significantly better predictions than
“raw” as well as than PCA and kMeans with both classification models. In addi-
tion, fine-tuning the SDA features for the specific task further improved the final
results, with 50 % and 10 % improvements over “raw” in F-score and AUC-ROC,
respectively. Table 2 reports the top 5 performing diseases for sSDA based on
AUC-ROC, showing promising results in different clinical domains. Some dis-
eases in the vocabulary did not show high predictive power (e.g., HIV, ovarian
4
All parameters in the feature learning models were identified through preliminary
experiments, not reported here for brevity, on the validation set.
5
This experiment only evaluates the prediction of new diseases for each patient,
therefore not considering the re-diagnosis of a disease previously reported.
Deep Learning to Predict Patient Future Diseases 773
Table 2. Top 5 performing diseases for sSDA (with respect to AUC-ROC results).
5 Conclusion
This article demonstrates the feasibility of using deep learning to predict patients’
diseases from their EHRs. Future works will apply this framework to other clin-
ical applications (e.g., therapy recommendation) and will incorporate additional
EHR descriptors as well as more sophisticated pre-processing techniques.
References
1. Kennedy, E., Wiitala, W., Hayward, R., Sussman, J.: Improved cardiovascular risk
prediction using non-parametric regression and electronic health record data. Med
Care 51(3), 251–258 (2013)
2. Perotte, A., Ranganath, R., Hirsch, J.S., Blei, D., Elhadad, N.: Risk prediction for
chronic disease progression using heterogeneous electronic health record data and
time series analysis. J Am Med Inform Assoc 22(4), 872–880 (2015)
3. Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards
better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405
(2012)
4. Wu, J., Roy, J., Stewart, W.: Prediction modeling using EHR data: Challenges,
strategies, and a comparison of machine learning approaches. Med. Care 48(Suppl
6), 106–113 (2010)
5. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
6. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444
(2015)
7. Helmstaedter, M., Briggman, K.L., Turaga, S.C., Jain, V., Seung, H.S., Denk,
W.: Connectomic reconstruction of the inner plexiform layer in the mouse retina.
Nature 500(7461), 168–174 (2013)
8. Ma, J.S., Sheridan, R.P., Liaw, A., Dahl, G.E., Svetnik, V.: Deep neural nets as
a method for quantitative structure-activity relationships. J. Chem. Inf. Model
55(2), 263–274 (2015)
9. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denois-
ing autoencoders: learning useful representations in a deep network with a local
denoising criterion. J. Mach. Learn Res. 11, 3371–3408 (2010)
774 R. Miotto et al.
10. Cowen, M.E., Dusseau, D.J., Toth, B.G., Guisinger, C., Zodet, M.W., Shyr, Y.:
Casemix adjustment of managed care claims data using the clinical classification
for health policy research method. Med. Care 36(7), 1108–1113 (1998)
11. LePendu, P., Iyer, S., Fairon, C., Shah, N.: Annotation analysis for testing drug
safety signals using unstructured clinical notes. J. Biomed. Semant. 3(S–1), S5
(2012)
12. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3,
993–1022 (2003)
Improving Document Ranking for Long Queries
with Nested Query Segmentation
1 Introduction
Query segmentation [1–5] is one of the first steps towards query understand-
ing where complex queries are partitioned into semantically coherent word
sequences. Past research [1–5] has shown that segmentation can potentially lead
to better IR performance. Till date, almost all the works on query segmenta-
tion have dealt with flat or non-hierarchical segmentations, such as: windows xp
home edition | hd video | playback, where pipes (|) represent flat segment
boundaries. For short queries of up to three or four words, such flat segmenta-
tion may suffice. However, slightly longer queries of about five to ten words
are increasing over the years ( 27 % in our May 2010 Bing Australia log) and
present a challenge to the search engine. One of the shortcomings of flat segmen-
tation is that it fails to capture the relationships between segments which can
provide important information towards the document ranking strategy, particu-
larly in the case of a long query.
These relationships can be discovered if we allow nesting of segments inside
bigger segments. For instance, instead of a flat segmentation, our running exam-
ple query could be more meaningfully represented as Fig. 1. Here, the atomic seg-
ments – windows xp and hd video, are progressively joined with other words
This research was completed while the author was at IIT Kharagpur.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 775–781, 2016.
DOI: 10.1007/978-3-319-30671-1 67
776 R. Saha Roy et al.
where ti -s are query terms matched in the document and n(q) is the nested
segmentation for q. However, we do not wish to penalize D when the words
are close by in the document and are relatively far in the tree. This analysis
drives us to create a tree distance threshold (cut-off) parameter δ. In other
words, if td(a, b; n(q)) < δ, only then is the word pair a and b considered in
the computation of RrSV . The original rank for a page (obtained using TF-
IDF scoring, say) and the new rank obtained using RrSV are fused using the
method in Agichtein et al. [8], using w as a tuning parameter. We refer to
this entire strategy as the Tree model. We use three re-ranking baselines: Flat
segmentation (word pairs are limited to cases where both words come from a
single flat segment), document distances only (no scaling using tree distance; Doc
model), and query distances (scaling document distances using query distances
(Sect. 1); Query model).
Dataset Algo Hagen et al. [5] Mishra et al. [4] Saha Roy et al. [2] Huang et al. [9]
SGCL12 Unseg Flat Nested Flat Nested Flat Nested Nested
nDCG@5 0.6839 0.6815 0. 6982 0. 6977 0.6976 0.6746 0. 7000† 0.6996
nDCG@10 0.6997 0.7081 0. 7262† 0.7189 0. 7274 0.7044 0. 7268† 0.7224
nDCG@20 0.7226 0.7327 0. 7433† 0.7389 0. 7435 0.7321 0. 7433† 0.7438
MAP 0.8337 0.8406 0. 8468† 0.8411 0. 8481† 0.8423 0. 8477 0.8456
TREC-WT Unseg Flat Nested Flat Nested Flat Nested Nested
nDCG@5 0.1426 0.1607 0. 1750† 0.1604 0. 1752† 0.1603 0. 1767† 0.1746
nDCG@10 0.1376 0.1710 0. 1880† 0.1726 0. 1882† 0.1707 0. 1884† 0.1845
nDCG@20 0.1534 0.1853 0. 1994† 0.1865 0. 2000† 0.1889 0. 2010† 0.1961
MAP 0.2832 0.2877 0. 3298† 0.3003 0. 3284† 0.3007 0. 3296† 0.3263
the average tree height is 2.96 for our nesting strategy, the same is about 2.23 for
Huang et al. (SGCL12). Note that due to the strict binary partitioning at each
step for Huang et al., one would normally expect a greater average tree height
for this method. Thus, it is the inability of Huang et al. to produce a suitably
deep tree for most queries (inability to discover fine-grained concepts) that is
responsible for its somewhat lower performance. Most importantly, all nesting
strategies faring favorably (none of the differences for Huang et al. with other
nesting methods are statistically significant) with respect to flat segmentation
bodes well for the usefulness of nested segmentation.
(c) Comparison of Re-ranking Strategies: We find the Tree model performs
better than Doc and Query models. We observed that the number of queries on
which Doc, Query and Tree perform the best (possibly multiple strategies) are
102, 94, 107 (SGCL12, 250 test queries) and 30, 29.7, 30.8 (TREC-WT, 40 test
queries, mean over 10 splits) respectively.
4 Conclusions
This research is one of the first systematic explorations of nested query segmen-
tation. We have shown that the tree structure inherent in the hierarchical seg-
mentation can be used for effective re-ranking of result pages ( 7 % nDCG@10
improvement over unsegmented query for SGCL12 and 40 % for TREC-WT).
Importantly, since n-gram scores can be computed offline, our algorithms have
minimal runtime overhead. We believe that this work will generate sufficient
interest and several improvements over the present scheme would be proposed
in the recent future. In fact, nested query segmentation can be viewed as the
first step towards query parsing, and can lead to a generalized query grammar.
References
1. Li, Y., Hsu, B.J.P., Zhai, C., Wang, K.: Unsupervised query segmentation using
clickthrough for information retrieval. In: SIGIR 2011, pp. 285–294 (2011)
2. Saha Roy, R., Ganguly, N., Choudhury, M., Laxman, S.: An IR-based evaluation
framework for web search query segmentation. In: SIGIR 2012, pp. 881–890 (2012)
3. Tan, B., Peng, F.: Unsupervised query segmentation using generative language mod-
els and Wikipedia. In: WWW 2008, pp. 347–356 (2008)
4. Mishra, N., Saha Roy, R., Ganguly, N., Laxman, S., Choudhury, M.: Unsupervised
query segmentation using only query logs. In: WWW 2011, pp. 91–92 (2011)
5. Hagen, M., Potthast, M., Stein, B., Bräutigam, C.: Query segmentation revisited.
In: WWW 2011, pp. 97–106 (2011)
6. Cummins, R., O’Riordan, C.: Learning in a pairwise term-term proximity framework
for information retrieval. In: SIGIR 2009, pp. 251–258 (2009)
Improving Document Ranking for Long Queries 781
7. Chaudhari, D.L., Damani, O.P., Laxman, S.: Lexical co-occurrence, statistical sig-
nificance, and word association. In: EMNLP 2011, pp. 1058–1068 (2011)
8. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating
user behavior information. In: SIGIR 2006, pp. 19–26 (2006)
9. Huang, J., Gao, J., Miao, J., Li, X., Wang, K., Behr, F., Giles, C.L.: Exploring web
scale language models for search query processing. In: WWW 2010, pp. 451–460
(2010)
Sketching Techniques for Very Large
Matrix Factorization
1 Introduction
Collaborative filtering is one of the successful techniques used in modern rec-
ommender systems. It uses the user-item rating relationship, materialized by
a large and sparse matrix R, to provide recommendations. The common sce-
nario in collaborative filtering systems is that maintaining the entire matrix R
in memory is inefficient because it consumes way too much memory. Also, the
observed elements in R are in many cases not even available as a static data
beforehand, but instead as a stream of tuples Alternative representations for R
have been invented, such as the popular latent factor model, which maps both
users and items to a low dimensional representation, retaining pairwise similar-
ity. Matrix factorization learns these representation vectors from some observed
ratings, to predict (by inner product) the missing entries and thereby to fill the
incomplete user-item matrix [4]. Compared to other methods, matrix factoriza-
tion is simple, space efficient and can generalize well with missing information.
It can be easily customized with different loss functions, regularization methods
and optimization techniques. The storage requirements for the latent factors are
significantly lower, however, the memory required for the factors grows linearly
with the number of users and items. This becomes increasingly cumbersome
when the numbers are in millions, a frequent real-world situation for domains
like advertising and search personalization. The situation is complicated further
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 782–788, 2016.
DOI: 10.1007/978-3-319-30671-1 68
Sketching Techniques for Very Large Matrix Factorization 783
when there are new incoming users and/or items. Overall, supporting real-world
large-scale and dynamic recommendation applications asks for designing a much
more compact representation to more efficiently manipulate updates while facil-
itating the insertion of new users and new items. This paper proposes to use
sketching techniques to represent the latent factors in order to achieve the above
goals. Sketching techniques enable to use extremely compact representations for
the parameters, which help scaling. They also, by construction, facilitate updates
and inserts. We find through experimental results that sketch based factorization
improves storage efficiency without compromising much on prediction quality.
2 Background
2.1 Matrix Factorization
where Lu,i = ∂L(ru,i , r̂u,i )/∂ r̂u,i . As the algorithm is sequential, only the latent
factors have to be stored in main memory. However, the number of latent factors
is linear with the number of users and items, which is problematic for large-
scale and dynamic environments. Allocating a d-dimensional vector to every
sporadically recurring new user or item quickly becomes non-tractable. A better
representation for the factors is therefore needed. We propose the use of count
sketches that are typically used in other contexts.
784 R. Balu et al.
reconstruct user and item latent vectors. Both user ID u and component index
l, 1 ≤ l ≤ d, are used as inputs to the k pairs of address and sign hash functions
(hu,l u,l
j = hj (u, l), sj = sj (u, l)) to get a mean estimate of the vector component
as follows (the same holds for item): ∀l ∈ {1...d}
1 u,l 1 i,l
k k
p̃u,l = s · cj,hu,l , q̃i,l = s · cj,hi,l (2)
k j=1 j j k j=1 j j
Yet for the same user or item, the kd accessed cells in the count sketch
structure are not adjacent but at random locations addressed by the hash func-
tions. The estimated rating r̃u,i = p̃ u q̃i is compared with ru,i to get the loss
L(ru,i , r̃u,i ) and the derivative Lu,i wrt r̃u,i . The gradient updates for p̃u and q̃i
are just computed as in (1). Then for each component of p̃u (as well as q̃i ), the
k respective cells C are updated with their sign corrected gradients: ∀l ∈ {1...d}
δcj,hu,l = −ηsu,l j Lu,i .q̃i,l + λp̃u,l , δcj,hi,l = −ηsi,l
j Lu,i .p̃u,l + λq̃i,l (3)
j j
The expression of vector ∂ p̃u /∂cj,m is derived from the read access to the count
−1
sketch (2): Its l-th component equals su,l
j k .[hu,l ==m] (same for ∂ q̃i /∂cj,m ). In
j
the end, this stems into the following update rules: ∀l ∈ {1, · · · , d}
η
δcj,m = − ((su,l L q̃i,l + λp̃u,l )[hu,l ==m] + (si,l
j Lu,i p̃u,l + λq̃i,l )[hi,l )
k j u,i j j ==m]
We find back the same update rules as in (3) up to a factor k −1 due to the read
access to the count sketch based on the mean operator. However, this is not an
issue as the gradient is at the end multiplied by the learning rate η.
786 R. Balu et al.
4 Experiments
In this section, we benchmark our method against regular online matrix fac-
torization and feature hashing based factorization [3] as it is a special case of
our approach (k = 1). We use four publicly available datasets: Movielens1M
and 10 M, EachMovie and Netflix. Data characteristics, along with results are
in Table 1. The data is preprocessed and randomly partitioned into the train-
ing, validation and test sets with proportion [0.8, 0.1, 0.1]. User, item and global
means are subtracted from rating to remove user and item bias. Ratings with
user/item frequency < 10 are removed from the test and validation sets. The
same procedure is repeated 10 times to obtain 10 fold dataset. We use root
mean square error to measure the quality of recommendations: RM SE(R ) =
u q̃i − ru,i ) / R 0 where R is the restriction of R to the testing
( ru,i ∈R (p̃ 2
set. We compare the performance for various configurations (w, k) of the sketch
and different latent factor dimensions d. The sketch depth k is picked from {1, 4}
and the latent factor dimension d is chosen from {1, 2, 3, 4, 6, 8, 11, 16, 23, 32}.
We measure the space gain γ by the ratio of space that the regular factorization
would need for the same dimension d to the space actually utilized by sketch
based factorization. We vary γ within {1, 2, 2.83, 4, 5.66, 8, 11.31, 16, 23, 32}. For
(|U|+|I|)d
a given setup (γ, d, k), we determine the sketch width as w = γk
. We
choose optimal parameters for learning rate η and regularization constant λ by
a two stage line search in log-scale, based on validation set prediction score. We
also initialize the parameters to uniformly sampled, small random values. We
iterate for T = 20 epoch over the training set, before predicting on the testing
set. Learning rate is scaled down at every iteration using the formula ηt = 1+t/T η
.
Fig. 1. (a) Heatmaps of RMSE for feature hashing (left) and count sketch (right) on
EachMovie. (b) Convergence on MovieLens 10 M. (c) Dynamic setting on EachMovie.
increases because it implies smaller sketch width w and hence higher variance
for p̃u and q̃i . We can also observe that there is an improvement in RMSE score
with higher k. This effect is more amplified for low γ values.
5 Conclusion
The memory intensive nature of matrix factorization techniques calls for efficient
representations of the learned factors. This work investigated the use of count
sketch for storing the latent factors. Its compact and controllable representation
makes it a good candidate for efficient storage of these parameters. We show
that the optimization of the latent factors through the count sketch storage is
indeed equivalent to finding the optimal count sketch structure for predicting
the observed ratings. Experimental evaluations show the trade-off between per-
formance and space and also reveal that count sketch factorization needs less
data for training. This property is very useful in dynamic setting.
788 R. Balu et al.
References
1. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams.
In: ICALP (2002)
2. Cormode, G.: Sketch techniques for approximate query processing. In: Foundations
and Trends in Databases. NOW publishers (2011)
3. Karatzoglou, A., Weimer, M., Smola, A.J.: Collaborative filtering on a budget. In:
AISTATS (2010)
4. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender
systems. Computer 8, 30–37 (2009)
Diversifying Search Results Using Time
An Information Retrieval Method for Historians
1 Introduction
Large born-digital document collections are a treasure trove of historical knowl-
edge. Searching these large longitudinal document collections is only possible if
we take into account the temporal dimension to organize them. We present a
method for diversifying search results using temporal expressions in document
contents. Our objective is to specifically address the information need underly-
ing history-oriented queries; we define them to be keyword queries describing a
historical event or entity. An ideal list of search results for such queries should
constitute a timeline of the event or portray the biography of the entity. This
work shall yield a useful tool for scholars in history and humanities who would
like to search large text collections for history-oriented queries without knowing
relevant dates for them apriori.
No work, to the best of our knowledge, has addressed the problem of diver-
sifying search results using temporal expressions in document contents. Prior
approaches in the direction of diversifying documents along time have relied
largely on publication dates of documents. However a document’s publication
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 789–795, 2016.
DOI: 10.1007/978-3-319-30671-1 69
790 D. Gupta and K. Berberich
date may not necessarily be the time that the text refers to. It is quite com-
mon to have articles that contain a historical perspective on a past event from
the current time. Hence, the use of publication dates is clearly insufficient for
history-oriented queries.
In this work, we propose a probabilistic framework to diversify search results
using temporal expressions (e.g., 1990s) from their contents. First, we identify
time intervals of interest to a given keyword query, using our earlier work [7],
which extracts them from pseudo-relevant documents. Having identified time
intervals of interest (e.g., [2000,2004] for the keyword query george w. bush), we
use them as aspects for diversification. More precisely, we adapt a well-known
diversification method [1] to determine a search result that consists of relevant
documents which cover all of the identified time intervals of interest.
Evaluation of historical text can be highly subjective and biased in nature.
To overcome this challenge; we view the evaluation of our approach from a
statistical perspective and take into account an objective evaluation for auto-
matic summarization to measure the effectiveness of our methods. We create
a large history-oriented query collection consisting of long-lasting wars, impor-
tant events, and eminent personalities from reliable encyclopedic resources and
prior research. As a ground truth we use articles from Wikipedia 1 concerning the
queries. We evaluate our methods on two large document collections, the New
York Times Annotated corpus and the Living Knowledge corpus. Our approach
is thus tested on two different types of textual data. One being highly authori-
tative in nature; in form of news articles. Another being authored by real-world
users; in form of web documents. Our results show that using our method of
diversifying search results using time; we can present documents that serve the
information need in a history-oriented query very well.
2 Method
Notation. We consider a document collection D. Each document d ∈ D consists
of a multiset of keywords dtext drawn from vocabulary V and a multiset of
temporal expressions dtime . Cardinalities of the multisets are denoted by |dtext |
and |dtime |. To model temporal expressions such as 1990s where the begin and
end of the interval can not be identified, we utilize the work by Berberich et al. [3].
They allow for this uncertainty in the time interval by associating lower and
upper bounds on begin and end. Thus, a temporal expression T is represented
by a four-tuple bl , bu , el , eu where time interval [b, e] has its begin bounded
as bl ≤ b ≤ bu and its end bounded as el ≤ l ≤ eu . The temporal expression
1990s is thus represented as 1990, 1999, 1990, 1999. More concretely, elements
of temporal expression T are from time domain T and intervals from T × T .
The number of such time intervals that can be generated is given by |T |.
Time Intervals of Interest to the given keyword query qtext are identified
using our earlier work [7]. A time interval [b, e] is deemed interesting if its referred
1
https://en.wikipedia.org/.
Diversifying Search Results Using Time 791
frequently by highly relevant documents of the given keyword query. This intu-
ition is modeled as a two-step generative model. Given, a set of pseudo-relevant
documents R, a time interval [b, e] is deemed interesting with probability:
P ([b, e] | qtext ) = P ([b, e] | dtime )P (dtext | qtext ).
d∈R
To diversify search results, we keep all the time intervals generated with their
probabilities in a set qtime .
Temporal Diversification. To diversify search results we adapt the approach
proposed by Agrawal et al. [1]. Formally, the objective is to maximize the prob-
ability that the user sees at least one result relevant to her time interval of
interest. We thus aim to determine a query result S ⊆ R that maximizes
P ([b, e] | qtext ) 1− (1−P (qtext | dtext )P ([b, e] | dtime )) .
[b,e]∈qtime d∈S
The probability P ([b, e] | qtext ) is estimated as described above and reflects the
salience of time interval [b, e] for the given query. We make an independence
assumption and estimate the probability that document d is relevant and cov-
ers the time interval [b, e] as P (qtext | dtext ) P ([b, e] | dtime ). To determine the
diversified result set S, we use the greedy algorithm described in [1].
3 Evaluation
Document Collections. We used two document collections one from a news
archive and one from a web archive. The Living Knowledge2 corpus is a col-
lection of news and blogs on the Web amounting to approximately 3.8 mil-
lion documents [8]. The documents are provided with annotations for temporal
expressions as well as named-entities. The New York Times (NYT) Annotated3
corpus is a collection of news articles published in The New York Times. It
reports articles from 1987 to 2007 and consists of around 2 million news articles.
The temporal annotations for it were done via SUTime [6]. Both explicit and
implicit temporal expressions were annotated, resolved, and normalized.
Indexing. The document collections were preprocessed and subsequently
indexed using the ElasticSearch software4 . As an ad-hoc retrieval baseline
and for retrieval of pseudo–relevant set of documents we utilized the state-of-
the-art Okapi-BM25 retrieval model implemented in ElasticSearch.
Collecting History-Oriented Queries. In order to evaluate the usefulness of
our method for scholars in history, we need to find keyword queries that are highly
2
http://livingknowledge.europarchive.org/.
3
https://catalog.ldc.upenn.edu/LDC2008T19.
4
https://www.elastic.co/.
792 D. Gupta and K. Berberich
ambiguous in the temporal domain. That is multiple interesting time intervals are
associated with the queries. For this purpose we considered three categories of
history-oriented queries: long-lasting wars, recurring events, and famous person-
alities. For constructing the queries we utilized reliable sources on the Web and
data presented in prior research articles [7,9]. Queries for long-lasting wars were
constructed from the WikiWars corpus [9]. The corpus was created for the purpose
of temporal information extraction. For ambiguous important events we utilized
the set of ambiguous queries used in our earlier work [7]. For famous personali-
ties we use a list of most influential people available on the USA Today5 website.
The names of these famous personalities were used based on the intuition that
there would important events associated with them at different points of time.
The keyword queries are listed in our accompanying technical report [11]. The
entire testbed is publicly available at the following url:
http://resources.mpi-inf.mpg.de/dhgupta/data/ecir2016/.
The objective of our method is to present documents that depict the histor-
ical timeline or biography associated with keyword query describing event or
entity. We thus treat the set of diversified set of documents as a historical sum-
mary of the query. In order to evaluate this diversified summary we obtain the
corresponding Wikipedia (see footnote 1) pages of the queries as ground truth
summaries.
Baselines. We considered three baselines, with increasing sophistication. As a
naı̈ve baseline, we first consider the pseudo-relevant documents retrieved for the
given keyword query. The next two baselines use a well-known implicit diversifi-
cation algorithm maximum marginal relevance (MMR) [5]. Formally it is defined
as: argmaxd∈S/ [λ · sim1 (q, d) − (1 − λ) · maxd ∈S sim2 (d , d)] . MMR was simulated
with sim1 using query likelihoods and sim2 using cosine similarity between the
term-frequency vectors for the documents. The second baseline considered MMR
with λ = 0.5 giving equal importance to query likelihood and diversity. While
the final baseline considered MMR with λ = 0.0 indicating complete diversity.
For all methods the summary is constructed by concatenating all the top-k doc-
uments into one large document.
Parameters. There are two parameters to our system: (i) The number of docu-
ments considered for generating time intervals of interest |R| and (ii) The number
of documents considered for historical summary |S|. We consider the following
settings of these parameters: |R| ∈ {100, 150, 200} and |S| ∈ {5, 10}.
Metrics. We use the Rouge-N measure [12] (implementation6 ) to evaluate the
historical summary constituted by diversified set of documents with respect to
the ground truth. Rouge-N is a recall-oriented metric which reports the number
of n-grams matches between a candidate summary and a reference summary.
The n in ngram is the length of the gram to be considered; we limit ourselves to
n ∈ {1, 3}. We report the recall, precision, and Fβ=1 for each Rouge-N measure.
5
http://usatoday30.usatoday.com/news/top25-influential.htm.
6
http://www.berouge.com/Pages/default.aspx.
Diversifying Search Results Using Time 793
4 Related Work
Diversifying search results using time was first explored in [2]. In their prelimi-
nary study the authors limited themselves to using document publications dates,
but posed the open problem of diversifying search results using temporal expres-
sions in document contents and the challenging problem of evaluation. Both these
aspects have been adequately addressed in our article. More recently, Nguyen
and Kanhabua [10] diversify search results based on dynamic latent topics. The
authors study how the subtopics for a multi-faceted query change with time. For
this they utilize a time-stamped document collection and an external query log.
However for the temporal analysis they limit themselves to document publication
dates. The recent survey of temporal information retrieval by Campos et al. [4]
also highlights the lack of any research that addresses the challenges of utilizing
temporal expressions in document contents for search result diversification.
5 Conclusion
In this work, we considered the task of diversifying search results by using tem-
poral expressions in document contents. Our proposed probabilistic framework
utilized time intervals of interest derived from the temporal expressions present
in pseudo-relevant documents and then subsequently using them as aspects for
diversification along time. To evaluate our method we constructed a novel testbed
of history-oriented queries derived from authoritative resources and their corre-
sponding Wikipedia entries. We showed that our diversification method presents
a more complete retrospective set of documents for the given history-oriented
query set. This work is largely intended to help scholars in history and humani-
ties to explore large born-digital document collections quickly and find relevant
information without knowing time intervals of interest to their queries.
References
1. Agrawal, R., et al.: Diversifying search results. In: WSDM (2009)
2. Berberich, K., Bedathur, S.: Temporal diversification of search results. In: TAIA
(2013)
3. Berberich, K., Bedathur, S., Alonso, O., Weikum, G.: A language modeling app-
roach for temporal information needs. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz,
U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS,
vol. 5993, pp. 13–25. Springer, Heidelberg (2010)
4. Campos, R.: Survey of temporal information retrieval, related applications. ACM
Comput. Surv. 47(2), 15:1–15:41 (2014)
5. Carbonell, J.G., Goldstein, J.: The use of MMR, diversity-based reranking for
reordering documents and producing summaries. In: SIGIR (1998)
6. Chang, A.X., Manning, C.D: A library for recognizing and normalizing time expres-
sions. In: LREC, SUTIME (2012)
7. Gupta, D., Berberich, K.: Identifying time intervals of interest to queries. In: CIKM
(2014)
Diversifying Search Results Using Time 795
8. Joho, H., et al.: NTCIR temporalia: A test collection for temporal information
access research. In: WWW (2014)
9. Mazur, P.P., Dale, R.: A new corpus for research on temporal expressions. In:
EMNLP, Wikiwars (2010)
10. Nguyen, T.N., Kanhabua, N.: Leveraging dynamic query subtopics for time-aware
search result diversification. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X.,
de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp.
222–234. Springer, Heidelberg (2014)
11. Gupta, D., Berberich, K.: Diversifying search results using time. Research Report
MPI-I–5-001 (2016)
12. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL (2004)
On Cross-Script Information Retrieval
1
College of Computer and Information Science, Northeastern University, Boston, USA
[email protected]
2
Center for Intelligent Information Retrieval, College of Information and Computer Sciences,
University of Massachusetts Amherst, Amherst, USA
[email protected]
1 Introduction
The Web contains huge amounts of user-generated text in different writing systems and
languages, but most popular platforms lack the mechanism of implicitly cross-matching
Romanized versus native script texts. Twitter’s language identifiers seem to only attempt
to detect a language when written in its native/official character set. While it succeeds
at identifying Arabic most of the time, Twitter does not detect nor identify Arabizi tweets
as Arabic ones nor does it count Arabizi as a stand-alone language. Therefore, potentially
novel and pertinent content is unreachable by simple search. Our proposed method for
identifying Arabizi is intended to help with that challenge. The contributions of this
paper are the following: (1) We describe an Arabic to Arabizi transliteration that works
N. Naji−This work was done while the author was at the University of Massachusetts
Amherst, supported by the Swiss National Science Foundation Early Postdoc.Mobility
fellowship project P2NEP2_151940.
in the absence of lexica and parallel corpora. (2) We develop an approach to evaluate
the quality of such a transliterator. (3) We demonstrate that our transliterator is superior
to reasonable automatic baselines for identifying valid Arabizi transliterations. (4) We
make the annotated data publicly available for future research1.
2 Related Work
The problem of spelling variation in Romanized Arabic has been studied closely to
perform Named Entity Recognition such as Machine Translation (MT) of Arabic names
[8] and conversion of English to Arabic [9]. However, and to the best of our knowledge,
no work has been done so far on cross-script Information Retrieval (CSIR) for the Arabic
language. Some studies addressed dialect identification in Arabic or Arabizi [1, 3–5]
and statistical MT from Arabizi to English via de-romanization to Arabic [10]. Arabic
to Arabizi conversion has only been done as one-to-one mapping such as Qalam2 and
Buckwalter3 resulting in Romanized vowel-less text. Darwish [2] uses a Conditional
Random Field (CRF) to identify Arabizi from a corpus of mixed English and Arabic
tweets with accuracy of 98.5 %. We are typically transcribing single words or short
phrases, where the CRF rules do not work well. Gupta et al.’s work on mixed-script IR
(MSIR) [6, 7] proposes a query expansion method to retrieve mixed text in English and
Hindi using deep learning and achieving a 12 % increase in MRR over other baselines.
In contrast to their work, we are using a transliteration-based technique that does not
rely on lexica or datasets. Also, we are faced with very short documents lacking the
redundancy that can be used to grasp language features. Bies et al. [11] released a parallel
Arabic-Arabizi SMS and chat corpus of 45,246 case-sensitive tokens. Although it is a
valuable resource, it only covers Egyptian Arabic and Arabizi.
Let q be a query in language l written in script s1. A CSIR system retrieves documents
from a corpus C in language l in response to q, where the documents are written in script
s1 or an alternative script s2 or both s1 and s2, and where s2 is an alternative writing system
for l. The underlying corpus C may consist of documents in n languages and m scripts
such that n ≥ 1 and m ≥ 2. Our definition of the CSIR problem is analogous to Gupta et
al.’s definition of MSIR [6], but in their experimental setup, Gupta et al. focus on bilin‐
gual MSIR (n = 2 and m = 2). We address the problem of a both multi-lingual and multi-
scripted corpus (n ≥ 2, and m ≥ 2) which is a complex task since vocabulary overlap
between different languages is more likely to happen as more languages and more scripts
co-exist in the searchable space. We describe our transliteration and statistical selection
algorithms below:
1
https://ciir.cs.umass.edu/downloads/.
2
Webpage accessed January 3rd 2016, 19:17 http://langs.eserver.org/qalam.
3
Webpage accessed January 3rd 2016, 19:18 http://languagelog.ldc.upenn.edu/myl/ldc/morph/
buckwalter.html.
798 N. Naji and J. Allan
Table 1. Arabic to Arabizi mapping chart. Parenthesized letters are optional. ‘?’ indicates
an optional single character depending on the immediate subsequent character
2- Map and handle long vowels, diphthongs and hamza: (’ )’و,(‘ )’ي,(‘ ’ى,‘ )’اor
(’ ’ء,‘ ’ ٔا,‘ٓ ’ا,‘ ’ ُٔا,‘)‘ ٕا, with an option to introduce‘2’ for hamza either alone or combined
with a long vowel. Since (“ )”ﻛﺘﺎبcontains the long vowel (‘‘ )’اa’ is inserted accord‐
ingly “ktab”.
3- Generate possible tashdeed (emphasis) instance(s) for the second and subsequent
consonants or (‘ )’وor (‘)’ي, then apply the remaining steps on all enumerated
instances. “kttab”, “kttabb”, “kttabb”.
4- Pad consecutive non-emphasis consonants or (‘ )’وor (‘ )’يwith an optional short
vowel (v) (one of ‘a’, ‘e’, ‘i’ ‘o’, or ‘u’). “k(v)tab”, “k(v)ttab”, “k(v)tabb”,
“k(v)ttabb” kitab, kuttab, ktabb, kattabb.
Steps 3 and 4 allow accounting for the dropped diacritics in Arabic. For example,
“ ”ﻣﴫcan be found as ““( ” ِﻣ ْﴫmisr”) (Egypt) and can also be written as “masr”,“m9r”,
etc.
2. For each transliteration WiARZ in WARZ, find the subset of tweets TWi that contain WiARZ
at least once: TWi = {t1Wi, t2Wi, .., tSWi }
3. For each tweet set TWi, find the union of all the tokens appearing in the tweets’ set
TWiUnion
4. Given a predefined set of Arabizi stopwords SW, find the number of stopwords
appearing in TWiUnion: K = | TWiUnion ∩ SW |
A higher K value indicates the presence of more Arabizi stopwords in the tweet union
when the transliteration form in question appears, hence reflecting more potential
Arabiziness. A lower K means that there is less confidence that the word is in Arabizi.
For example, let WTma9r = {WT1ma9r, WT2ma9r, …, WTnma9r} be the set of Arabizi trans‐
literations of “ ” generated by our AR ARZ transliterator such that:
WTma9r = {“m9r”, “ma9r”, “masr”, “masar”, “miser”, “misr”, “mo9ur”,
“mu9irr”}. First, WTma9r elements are projected against the inverted index’s list of words
Ix. Only“mo9ur” doesn’t appear in Ix and is therefore excluded from the resulting
Wma9r. Each transliteration element in Wma9r is then linked to the list of tweets in which
it appears and a set of the words appearing in those tweets is formed. Assume that “masr”
appeared in the following pseudo-tweets: t1masr = “la fe masr.. ana fe masr delwaty fel
beet”, t2masr = “salam keef el 2hal f masr”, t3masr = “creo que en brasil hay masr
argentinos que brasileros”. Whose term union yields the set: TmasrUnion = {“2hal”,
“ana”, “argentinos”, “beet”, “brasil”, “brasileros”, “creo”, “delwaty”, “el”, “en”,
“f”, “fe”, “fel”, “hay”, “keef”, “la”, “masr”, “que”, “salam”}. The last step is to
obtain the number of Arabizi stopwords that appear in TmasrUnion, in this case we have
“el”, “f”, “fe”, “fel”, and “la”. Despite the fact that “el” and “la” overlap with other
languages such as Spanish, the other stopwords do not which makes them distinctive
features for Arabizi in this case. Finally, the K score is equal to the number of stopwords
in TmasrUnion, hence Kmasr = 5. The same process is repeated with the other transliterations
to obtain their respective K values and the transliterations are then sorted accordingly
to reflect their Arabiziness.
Definite articles, prepositions, and conjunctions are attached to the word in Arabic script.
Surprisingly, Arabizi writers tend to separate such articles from words [2]. We expanded
the set of stopwords indicated by Darwish [2] to include more forms with dialectal
variants (54 in total).
Our results are shown in Table 2 which reports the MRR and MAP values. As expected,
allHuman performs perfectly. The allCommon run is our operational baseline. The K
score results show that ranking by overlap of stopwords improves results: MAP increases
from 56.28 % to 64.18 %, an almost 8 % absolute gain and a 14 % relative improvement
over allCommon. The top-ranked choice improves with MRR increasing by just over
7 % absolute, or almost 11 % relative. We originally hypothesized that very low stopword
overlap may indicate that a word is unlikely to be Arabizi. Dropping all terms with zero
overlap (+1SW) causes a large drop in MAP and a modest drop in MRR. Each successful
drop of candidates lowers both scores consistently. It seems that a weak (in terms of K
score) match is better than no match at all. Both K score and +1SW returned matches
for all 50 queries. However, K score clearly outperforms +1SW as it always returns
relevant matches with 58 % percent of the time at ranks as early as the first one. The
degradation in performance is proportional to the cutoff value K. A close examination
of the results shows that unanswered queries are experienced starting at +2SW and
gradually worsens as K increases (Fig. 1). The K score run is the second highest run at
On Cross-Script Information Retrieval 801
low recall and it maintains the highest precision across all levels of recall. As expected,
the Buckwalter representation does not constitute a suitable real-life Arabizi transliter‐
ation system as can be seen from Table 2.
Table 2. K score and two baselines evaluation. * and † denote statistically significant difference
with respect to allCommon and K score runs (two-tailed t-test, α = 5 %)
Fig. 1. Interpolated Precision-Recall curves for the K score and CSIR baselines.
Our system can be seen as a module that existing search engines can integrate into their
retrieval pipeline to cater for languages that are alternatively Romanized such as Arabic,
Hindi, Russian, and the like. By doing so, relevant transliterated documents will be
retrieved at an average rank as early as the second or first as opposed to not being
retrieved at all. We plan to extend this work to handle multi-term queries, inflectional
and morphological variants and attached articles and pronouns. We believe that it is
fairly feasible to implement our work on other Romanizable languages given our
preliminary work in other languages, in which non-linguist Arabizi users were able to
802 N. Naji and J. Allan
cover about 80 % of the mapping and conversion rules within a reasonably short amount
of time (less than 30 min) as opposed to the creation of parallel corpora – which is far
more costly and time-consuming.
Acknowledgements. This work is supported by the Swiss National Science Foundation Early
Postdoc.Mobility fellowship project P2NEP2_151940 and is supported in part by the Center for
Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily reflect those of the
sponsor.
References
1. Chalabi, A., Gerges, H.: Romanized arabic transliteration. In: Proceedings of the Second
Workshop on Advances in Text Input Methods, pp. 89–96 (Mumbai, India, 2012). The
COLING 2012 Organizing Committee (2012)
2. Darwish, K.: Arabizi detection and conversion to Arabic (2013). arXiv:1306.6755 [cs.CL],
arXiv. http://arxiv.org/abs/1306.6755
3. Al-Badrashiny, M., Eskander, R., Habash, N., Rambow, O.: Automatic transliteration of
romanized dialectal arabic. In: Proceedings of the 18th Conference on Computational
Language Learning (Baltimore, Maryland USA, 2014) (2014)
4. Habash, N., Ryan, R., Owen, R., Ramy, E., Nadt, T.: Morphological analysis and
disambiguation for dialectal arabic. In: Proceedings of Conference of the North American
Association for Computational Linguistics (NAACL) (Atlanta, Georgia, 2013) (2013)
5. Arfath, P., Al-Badrashiny, M., Diab, T.M., Habash, N., Pooleery, M., Rambow, O., Roth,
M.R., Altantawy, M.: DIRA: Dialectal arabic information retrieval assistant. demo paper. In:
Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP)
(Nagoya, Japan, 2013) (2013)
6. Gupta, P., Bali, P., Banchs, E., R., Choudhury, M., Rosso, P.: Query expansion for mixed-
script information retrieval. In: Proceedings of the 37th International ACM SIGIR 2014. New
York, NY, USA, pp. 677–686 (2014)
7. Saha Roy, R., Choudhury, M., Majumder, P., Agarwal, K.: Overview and datasets of FIRE
2013 track on transliterated search. In: 5th Forum for Information Retrieval Evaluation (2013)
8. Al-Onaizan, Y., Knight, K.: Machine transliteration of names in arabic text. In: Proceedings
of the ACL Workshop on Computational Approaches to Semitic Languages (2002)
9. AbdulJaleel, N., Larkey, S.L.: Statistical transliteration for English-Arabic cross language
information retrieval. In: Proceedings of the 12th International Conference on Information
and Knowledge Management (CIKM 2003). ACM (New York, NY, USA, 2003), pp. 139–
146 (2003)
10. May, J., Benjira, Y., Echihabi, A.: An arabizi-english social media statistical machine
translation system. In: Proceedings of the Eleventh Biennial Conference of the Association
for Machine Translation in the Americas, Vancouver, Canada (2014)
11. Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N.,
Eskander, R., Rambow, O.: Transliteration of arabizi into arabic orthography: developing a
parallel annotated arabizi-arabic script SMS/Chat corpus. In: Proceedings of the EMNLP
2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (Doha, Qatar,
2014)
LExL: A Learning Approach for Local Expert
Discovery on Twitter
1 Introduction
Identifying experts is a critical component for many important tasks. For exam-
ple, the quality of movie recommenders can be improved by biasing the under-
lying models toward the opinions of experts [1]. Making sense of information
streams – like the Facebook newsfeed and the Twitter stream – can be improved
by focusing on content contributed by experts. Along these lines, companies like
Google and Yelp are actively soliciting expert reviewers to improve the coverage
and reliability of their services [7].
Indeed, there has been considerable effort toward expert finding and recom-
mendation, e.g., [2,3,6,10,11]. These efforts have typically sought to identify
general topic experts – like the best Java programmer on github – often by min-
ing information sharing platforms like blogs, email networks, or social media.
However, there is a research gap in our understanding of local experts. Local
experts, in contrast to general topic experts, have specialized knowledge focused
around a particular location. Note that a local expert in one location may not
be knowledgeable about a different location. To illustrate, consider the following
two local experts:
• A “health and nutrition” local expert in Chicago is someone who may be knowl-
edgeable about Chicago-based pharmacies, local health providers, local health
insurance options, and markets offering specialized nutritional supplements or
restricted diet options (e.g., for gluten allergies or strictly vegan diets).
• An “emergency response” local expert in Seattle is someone who could connect
users to trustworthy information in the aftermath of a Seattle-based disaster,
including evacuation routes and the locations of temporary shelters.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 803–809, 2016.
DOI: 10.1007/978-3-319-30671-1 71
804 W. Niu et al.
3 Evaluation
Our experiments rely on the dataset described in [4], totaling 15 million list
relationships in which the coordinates of labeler and candidate are known.
806 W. Niu et al.
3.2 Results
Comparison Versus Baselines. We begin by comparing the proposed learn-
ing method (LExL) versus the two baselines. Figure 1 shows the Precision@10,
Recall@10, and NDCG@10 of each method averaged over all queries.1 We con-
sider the LambdaMART version of LExL, in addition to methods using Ranknet,
MART and Random Forest. First, we observe that three versions of LExL clearly
outperform all alternatives, resulting in a Precision@10 of around 0.78, an aver-
age Rating@10 of more than 3, and an NDCG of around 0.8.
1
Note that the results reported here for LocalRank differ from the results in [4] as the
experimental setups are different. First, our rating has 5 scales, which is intended to
capture more detailed expertise level. Second, [4] only considers ideal ranking order
for the top 10 results from LocalRank when calculating maximum possible (ideal)
DCG@10, while we consider a much larger corpus.
LExL: A Learning Approach for Local Expert Discovery on Twitter 807
Fig. 1. Evaluating the proposed learning-based local expertise approach versus two
alternatives. ‘+’ marks statistical significant difference with LExL[LambdaMART]
according to paired t-test at significance level 0.05 (Colour figure online).
4 Conclusion
In this paper, we have proposed and evaluated a geo-spatial learning-to-rank
framework for identifying local experts that leverages the fine-grained GPS coor-
dinates of millions of Twitter user and carefully curated Twitter list data. We
introduced four categories of features for learning model, including a group of
location-sensitive graph random walk features that captures both the dynamics
of expertise propagation and physical distances. Through extensive experimen-
tal investigation, we find the proposed learning framework produces significant
improvement compared to previous methods.
References
1. Amatriain, X., Lathia, N., Pujol, J.M., et al.: The wisdom of the few: a collaborative
filtering approach based on expert opinions from the web. In: SIGIR (2009)
2. Balog, K., Azzopardi, L., De Rijke, M.: Formal models for expert finding in enter-
prise corpora. In: SIGIR (2006)
3. Campbell, C.S., Maglio, P.P., et al.: Expertise identification using email commu-
nications. In: CIKM (2003)
4. Cheng, Z., Caverlee, J., Barthwal, H., Bachani, V.: Who is the barbecue king of
texas?: a geo-spatial approach to finding local experts on twitter. In: SIGIR (2014)
5. Fleiss, J.L., et al.: Large sample standard errors of kappa and weighted kappa.
Psychol. Bull. 72(5), 323 (1969)
6. Ghosh, S., Sharma, N., et al.: Cognos: crowdsourcing search for topic experts in
microblogs. In: SIGIR (2012)
7. Google: overview of local guides (2015). https://goo.gl/NFS0Yz
LExL: A Learning Approach for Local Expert Discovery on Twitter 809
Abstract. This paper proposes a new model for the detection of click-
bait, i.e., short messages that lure readers to click a link. Clickbait is
primarily used by online content publishers to increase their readership,
whereas its automatic detection will give readers a way of filtering their
news stream. We contribute by compiling the first clickbait corpus of
2992 Twitter tweets, 767 of which are clickbait, and, by developing a
clickbait model based on 215 features that enables a random forest clas-
sifier to achieve 0.79 ROC-AUC at 0.76 precision and 0.76 recall.
1 Introduction
Clickbait refers to a certain kind of web content advertisement that is designed
to entice its readers into clicking an accompanying link. Typically, it is spread on
social media in the form of short teaser messages that may read like the following
examples:
– A Man Falls Down And Cries For Help Twice. The Second Time, My Jaw Drops
– 9 Out Of 10 Americans Are Completely Wrong About This Mind-Blowing Fact
– Here’s What Actually Reduces Gun Violence
When reading such and similar messages, many get the distinct impression that
something is odd about them; something unnamed is referred to, some emo-
tional reaction is promised, some lack of knowledge is ascribed, some authority
is claimed. Content publishers of all kinds discovered clickbait as an effective tool
to draw attention to their websites. The level of attention captured by a web-
site determines the prize of displaying ads there, whereas attention is measured
in terms of unique page impressions, usually caused by clicking on a link that
points to a given page (often abbreviated as “clicks”). Therefore, a clickbait’s
target link alongside its teaser message usually redirects to the sender’s website
if the reader is afar, or else to another page on the same site. The content found
at the linked page often encourages the reader to share it, suggesting clickbait
for a default message and thus spreading it virally. Clickbait on social media has
been on the rise in recent years, and even some news publishers have adopted
this technique. These developments have caused general concern among many
outspoken bloggers, since clickbait threatens to clog up social media channels,
and since it violates journalistic codes of ethics.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 810–817, 2016.
DOI: 10.1007/978-3-319-30671-1 72
Clickbait Detection 811
2 Related Work
The rationale why clickbait works is widely attributed to teaser messages open-
ing a so-called “curiosity gap,” increasing the likelihood of readers to click the
target link to satisfy their curiosity. Loewenstein’s information-gap theory of
curiosity [19] is frequently cited to provide a psychological underpinning (p. 87):
“the information-gap theory views curiosity as arising when attention becomes
focused on a gap in one’s knowledge. Such information gaps produce the feeling
of deprivation labeled curiosity. The curious individual is motivated to obtain the
missing information to reduce or eliminate the feeling of deprivation.” Loewen-
stein identifies stimuli that may spark involuntary curiosity, such as riddles
or puzzles, event sequences with unknown outcomes, expectation violations,
information possessed by others, or forgotten information. The effectiveness by
which clickbait exploits this cognitive bias results from data-driven optimiza-
tion. Unlike with printed front page headlines, for example, where feedback
about their potential contribution to newspaper sales is indirect, incomplete,
and delayed, clickbait is optimized in real-time, recasting the teaser message to
maximize click-through [16]. Some companies allegedly rely mostly on clickbait
for their traffic. Their success on social networks recently caused Facebook to
take action against clickbait as announced by El-Arini and Tang [8]. Yet, little
is known about Facebook’s clickbait filtering approach; no corresponding pub-
lications have surfaced. El-Arini and Tang’s announcement mentions only that
context features such as dwell time on the linked page and the ratio of clicks to
likes are taken into account.
To the best of our knowledge, clickbait has been subject to research only
twice to date, both times by linguists: first, Vijgen [26] studies articles that
compile lists of things, so-called “listicles.” Listicles are often under suspicion to
be clickbait. The authors study 720 listicles published at BuzzFeed in two weeks
of January 2014, which made up about 30 % of the total articles published in this
period. The titles of listicles, which are typically shared as teaser messages, exert
a very homogeneous structure: all titles contain a cardinal number—the number
of items listed—and 85 % of the titles start with it. Moreover, these titles contain
strong nouns and adjectives to convey authority and sensationalism. Moreover,
the main articles consistently achieve easy readability according to the Gunning
fog index [10]. Second, Blom and Hansen [3] study phoricity in headlines as a
means to arouse curiosity. They analyze 2000 random headlines from a Danish
news website and identify two common forms of forward-references: discourse
deixis and cataphora. The former are references at discourse level (“This news
812 M. Potthast et al.
will blow your mind”.), and the latter at phrase level (“This name is hilarious”.).
Based on a dictionary of basic deictic and cataphoric expressions, the share of
such phoric expressions at 10 major Danish news websites reveals that they
occur mostly in commercial, ad-funded, and tabloid news websites. However, no
detection approach is proposed.
Besides, some dedicated individuals have taken the initiative: Gianotto [9]
implements a browser plugin that transcribes clickbait teaser messages based
on a rule set so that they convey a more “truthful,” or rather ironic meaning.
We employ the rule set premises as features and as a baseline for evaluation.
Beckman [2], Mizrahi [20], Stempeck [24], and Kempe [15] manually re-share
clickbait teaser messages, adding spoilers. Eidnes [7] employs recurrent neural
networks to generate nonsense clickbait for fun.
Table 1. Left: Top 20 publishers on Twitter according to NewsWhip [21] in 2014. The
darker a cell, the more prolific the publisher; white cells indicate missing data. Right:
Our clickbait corpus in terms of tweets with links posted in week 24, 2015, tweets
sampled for manual annotation, and tweets labeled as clickbait (absolute and relative)
by majority vote of three assessors.
(2) Linked web page. Analyzing the web pages linked from a tweet, Features 204–
209 are again bag-of-words features, whereas Features 210 and 211 measure
readability and length of the main content when extracted with Boilerpipe [17].
(3) Meta information. Feature 212 encodes a tweet’s sender, Feature 213 whether
media (e.g., an image or a video) has been attached to a tweet, Feature 214
whether a tweet has been retweeted, and Feature 215 the part of day in which
the tweet was sent (i.e., morning, afternoon, evening, night).
5 Evaluation
We randomly split our corpus into datasets for training and testing at a 2:1
training-test ratio. To avoid overfitting, we discard all features that have non-
trivial weights in less than 1 % of the training dataset only. The features listed
in Table 2 remained, whereas many individual features from the bag-of-words
feature types were discarded (see the feature IDs marked with a ∗ ). Before train-
ing our clickbait detection model, we balance the training data by oversampling
clickbait. We compare the three well-known learning algorithms logistic regres-
sion [18], naive Bayes [14], and random forest [4] as implemented in Weka 3.7 [12]
using default parameters. To assess detection performance, we measure preci-
sion and recall for the clickbait class, and the area under the curve (AUC) of
the receiver operating characteristic (ROC). We evaluate the performance of all
features combined, each feature category on its own, and each individual feature
(type) in isolation. Table 2 shows the results.
All features combined achieve a ROC-AUC of 0.74 with random forest,
0.72 with logistic regression, and 0.69 with naive Bayes. The precision scores
on all features do not differ much across classifiers, the recall ranges from 0.66
with naive Bayes to 0.73 with random forest. Interestingly, the teaser message
features (1a) alone compete or even outperform all features combined in terms
of precision, recall, and ROC-AUC, using naive Bayes and random forest. The
character n-gram features and the word 1-gram feature (IDs 1–4) appear to
contribute most to this performance. Character n-grams are known to capture
writing style, which may partly explain their predictive power for clickbait. The
other features from category (1a) barely improve over chance as measured by
ROC-AUC, yet, some at least achieve high precision, recall, or both. We fur-
ther employ feature selection based on the χ2 test to study the dependency of
performance on the number of high-performing features. Selecting the top 10,
100, and 1000 features, overall performance with random forest outperforms that
of feature category (1a) with 0.79 ROC-AUC. Features from all categories are
selected, but mostly n-gram features from the teaser message and the linked web
page.
Finally, as a baseline for comparison, the Downworthy rule sets [9] achieve
about 0.69 recall at about 0.64 precision, whereas their ROC-AUC is only 0.54.
This baseline is not only outperformed by combinations of other features,
but also individual features, such as the General Inquirer dictionary “You”
(9 pronouns indicating another person is being addressed directly) as well as
Clickbait Detection 815
Table 2. Evaluation of our clickbait detection model. Some features are feature types
that expand to many individual frequency-weighted features (i.e., IDs 1–9 and IDs
204–209). As classifiers, we evaluate logistic regression (LR), naive Bayes (NB), and
random forest (RF).
6 Conclusion
This paper presents the first machine learning approach to clickbait detection:
the goal is to identify messages in a social stream that are designed to exploit
cognitive biases to increase the likelihood of readers clicking an accompanying
link. Clickbait’s practical success, and the resulting flood of clickbait in social
media, may cause it to become another form of spam, clogging up social networks
and being a nuisance to its users. The adoption of clickbait by news publishers
is particularly worrisome. Automatic clickbait detection would provide for a
solution by helping individuals and social networks to filter respective messages,
and by discouraging content publishers from making use of clickbait. To this end,
we contribute the first evaluation corpus as well as a strong baseline detection
model. However, the task is far from being solved, and our future work will be
on contrasting clickbait between different social media, and improving detection
performance.
References
1. Ajani, S.: A full 63% of buzzfeed’s posts are clickbait (2015). http://keyhole.co/
blog/buzzfeed-clickbait/
2. Beckman, J.: Saved you a click—don’t click on that. I already did (2015).
https://twitter.com/savedyouaclick
3. Blom, J.N., Hansen, K.R.: Click bait: forward-reference as lure in online news
headlines. J. Pragmat. 76, 87–100 (2015)
4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
5. Rocca, J.: Dale-Chall easy word list (2013). http://countwordsworth.com/
download/DaleChallEasyWordList.txt
6. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves.
In: Proceedings of ICML 2006, pp. 233–240 (2006)
7. Eidnes, L.: Auto-generating clickbait with recurrent neural networks (2015).
http://larseidnes.com/2015/10/13/auto-generating-clickbait-with-recurrent-
neural-networks/
8. El-Arini, K., Tang, J.: News feed FYI: click-baiting (2014). http://newsroom.fb.
com/news/2014/08/news-feed-fyi-click-baiting/
9. Gianotto, A.: Downworthy—a browser plugin to turn hyperbolic viral headlines
into what they really mean (2014). http://downworthy.snipe.net
10. Gunning, R.: The fog index after twenty years. J. Bus. Commun. 6(2), 3–13 (1969)
11. Hagey, K.: Henry Blodget’s Second Act (2011). http://www.wsj.com/articles/
SB10000872396390444840104577555180608254796
12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
13. Imagga Image Tagging Technology (2015). http://imagga.com
14. John, G.H., langley, P.: Estimating continuous distributions in bayesian classifiers.
In: Proceedings of UAI 1995, pp. 338–345 (1995)
15. Kempe, R.: Clickbait spoilers—channeling traffic from clickbaiting sites back to
reputable providers of original content (2015). http://www.clickbaitspoilers.org
16. Koechley, P.: Why the title matters more than the talk (2012). http://blog.
upworthy.com/post/26345634089/why-the-title-matters-more-than-the-talk
Clickbait Detection 817
17. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow
text features. In: Proceedings of WSDM 2010, pp. 441–450 (2010)
18. le Cessie, S., van Houwelingen, J.C.: Ridge estimators in logistic regression. Appl.
Stat. 41(1), 191–201 (1992)
19. Loewenstein, G.: The psychology of curiosity: a review and reinterpretation. Psy-
chol. Bull. 116(1), 75 (1994)
20. Mizrahi, A.: HuffPo spoilers—I give in to click-bait so you don’t have to (2015).
https://twitter.com/huffpospoilers
21. NewsWhip Media Tracker (2015). http://www.newswhip.com
22. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: a
high performance and scalable information retrieval platform. In: OSIR @ SIGIR
(2006)
23. Smith, B.: Why buzzfeed doesn’t do clickbait (2015). http://www.buzzfeed.com/
bensmith/why-buzzfeed-doesnt-do-clickbait
24. Stempeck, M.: Upworthy spoiler—words that describe the links that follow (2015).
https://twitter.com/upworthyspoiler
25. Stone, P.J., Dunphy, D.C., Smith, M.S., Inquirer, T.G.: A Computer Approach to
Content Analysis. MIT Press, Cambridge (1966)
26. Vijgen, B.: The listicle: an exploring research on an interesting shareable new media
phenomenon. Stud. Univ. Babes-Bolyai-Ephemerides 1, 103–122 (2014)
Informativeness for Adhoc IR Evaluation:
A Measure that Prevents Assessing
Individual Documents
1 Introduction
Information Retrieval (IR) aims at retrieving the relevant information from a
large volume of available documents. Evaluating IR implies to define evaluation
frameworks. In adhoc retrieval, Cranfield framework is the prevailing framework
[1]; it is composed of documents, queries, relevance assessments and measures.
Moreover, document relevance is considered as independent from the document
rank and generally as a Boolean function (a document is relevant or not to
a given query) even though levels of relevance can be used [7]. Effectiveness
measurement is based on comparing the retrieved documents with the reference
list of relevant documents. Moreover, it is based on the assessment assumption,
that is the relevance of documents is known in advance for each query. It implies
that the collection is static since it is assessed by humans. Cranfield paradigm
facilitates reproductibility of experiments: at any time it is possible to evaluate a
new IR method and to compare it against previous results; this is one of its main
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 818–823, 2016.
DOI: 10.1007/978-3-319-30671-1 73
Informativeness for Adhoc IR Evaluation 819
1,0 |Dj |
cP i=d (D, R) ≥ i=d
i=1 |Di |
i=1 |Di |
but conversely, D ∩ R = ∅ does not imply cPλ (D) = 0 since there can be some
overlap between n-grams in documents and in the reference.
So our approach based on document contents instead of document IDs does
not require exhaustive references and therefore, can be applied to incomplete
references based on pools of relevant documents. However, meanwhile adhoc IR
returns a ranked list of documents independently of their respective lengths,
relevance judgments can be used to automatically generate text references
(t-rels) by concatenating the textual content of relevant documents.
Table 1. Summary of TREC test collections and size in tokens of generated t-rels used
for evaluation.
the official measure. For all collections, the official measure is the Mean Average
Precision (MAP), except for Web2010 and Web2011 where Expected Reciprocal
Rank (ERR@20) was preferred. These official rankings constitute the ground
truth ranking, against which we will compare the rankings produced by:
– cP103 based on the 1000 tokens of each run and on their log frequencies.
– cPn based on all tokens of each run.
In this section we report the correlation results of the ground truth ranking
(TREC official measure depending on the track) and the content-based ranking
produced by the cPλ informativeness measure. All correlations reported are sig-
nificantly different from zero with a p-value < 0.001. While we chose Kendall’s τ
as the correlation measure, we also report the Pearson’s linear correlation coef-
ficient for convenience. A τ > 0.5 typically indicates a strong correlation since
it implies an agreement between the two measures over more than half of all
ordered pairs.
Table 2. Retrieval systems ranking correlations between the official ground truth and
the cPλ informativeness measure. cPλ1,1 stands for uniterms while cPλ2,2 corresponds
to bigrams with skip. We use either 103 terms or all the terms from the ordered list of
retrieved documents.
When looking at Table 2, we see that cPλ accurately reproduces official rank-
ing based on MAP for early TREC tracks (TREC6-7-8, Web2000) as well as for
Robust2004, Terabyte2004-5 and Web2011. LogSim-score applied to all tokens
in runs is often the most effective whenever systems are ranked based on MAP.
2,2
However on early TREC tracks, cP10 3 can perform better even though only the
first 1000 tokens of each run are considered after concatenating ranked retrieved
documents. Indeed, the traditional TREC adhoc and Robust tracks used news-
paper articles as document collection. Since a single article often deals with a
single subject, relevant concepts are likely to occur together, which might be
less the case in web pages for example. A relevant news article is very likely to
contain only relevant information, whereas a long web document that deals with
several subjects might not be relevant as a whole.
4 Conclusion
In this paper, we proposed a framework to evaluate adhoc IR using the LogSim
informativeness measure based on token n-grams. To evaluate this measure, we
compared the ranks of the systems we obtained with the official rankings based
on document relevance, considering various TREC collections on adhoc tasks.
We showed that (1) rankings obtained based on n-gram informativeness and with
Mean Average Precision are strongly correlated; and (2) LogSim informativeness
can be estimated on top ranked documents in a robust way. The advantage of
this evaluation framework is that it does not rely on an exhaustive reference and
can be used in a changing environment in which new documents occur, and for
which relevance has not been assessed. In future work, we will evaluate various
LogSim parameters influence.
References
1. Cleverdon, C.: The cranfield tests on index language devices. In: Aslib Proceedings,
vol. 19. No. 6. MCB UP Ltd (1967)
2. Hauff, C.: Predicting the effectiveness of queries and retrieval systems. Ph.D thesis,
Enschede, SIKS Dissertation Series No. 2010-05, January 2010
3. Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data
fusion. Inf. Process. Manage. 42(3), 595–614 (2006)
4. Pavlu, V., Rajput, S., Golbus, P.B., Aslam, J.A.: IR system evaluation using
nugget-based test collections. In: Proceedings of WSDM (2012)
5. SanJuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX
2011 question answering track (QA@INEX). In: Geva, S., Kamps, J., Schenkel, R.
(eds.) INEX 2011. LNCS, vol. 7424, pp. 188–206. Springer, Heidelberg (2012)
6. SanJuan, E., Moriceau, E., Tannier, X., Bellot, P., Mothe, J.: Overview of the
INEX 2012 tweet contextualization track. In: CLEF (2012)
7. Spink, A., Greisdorf, H., Bateman, J.: From highly relevant to not relevant: exam-
ining different regions of relevance. Inf. Process. Manage. 34(5), 599–621 (1998)
8. Tague-Sutcliffe, J.: Measuring the informativeness of a retrieval process. In: Pro-
ceedings of the International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, pp. 23–36 (1992)
What Multimedia Sentiment Analysis Says
About City Liveability
1 Introduction
Posting messages on social networks is a popular means for people to communi-
cate and to share thoughts and feelings about their daily lives. Previous studies
investigated the correlation between sentiment extracted from user-generated
text and various demographics [6]. However, as technology improves, the band-
width available for users also increases. As a result, users can share images and
videos with greater ease. This led to a change in types of media being shared
on these online networks. More particularly, user-generated content often consist
of a combination of modalities, e.g., text, images, video and audio. As a result,
more recent studies have tried to predict sentiment from visual content too [2].
A recent study on urban computing conducted by Zheng et al. underlines the
potential of utilizing user-generated content for solving various challenges a mod-
ern metropolis is facing with, ranging from urban planning and transportation
J.B. Flaes—This research was performed while the first author was a student at the
UvA.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 824–829, 2016.
DOI: 10.1007/978-3-319-30671-1 74
What Multimedia Sentiment Analysis Says About City Liveability 825
to public safety and security [7]. In this paper we investigate whether the senti-
ment analysis of spontaneously generated social multimedia can be utilized for
detecting areas of the city facing such problems. More specifically, we aim at
creating a sentiment map of Amsterdam that may help paint a clearer picture of
the city liveability. Since visual content may contain complementary information
to text, in our approach we choose to utilize them jointly. Additionally, we make
use of automatically captured metadata (i.e., geotags) to analyse the messages
in context of the locations where they were posted.
A direct evaluation of our results would require a user study in which the
participants would be asked about their sentiment on different neighbourhoods
of a city. As conducting such study would be both extremely time consum-
ing and labour intensive, here we propose the use of open data as an indirect
ground-truth. We consider a large number of demographic, economic and safety
parameters comprising the liveability index of a neighbourhood and investigate
their association with the automatically produced sentiment scores in different
scenarios.
2 Research Methodology
Our approach consists of three steps outlined in Fig. 1 and described below.
The data used in this research comes from two different online social networks,
which both include textual and visual content. The emphasis of the analysis will
be on modalities dominant on a particular platform, i.e., textual data in case of
Twitter and visual data for Flickr. However, visual data shared on Twitter and
text shared on Flickr will also be used for a better understanding of the content
hosted on these platforms and an increased quality of the sentiment analysis.
For about two months, 64 thousand tweets were collected within a 10-mile
radius of the city center of Amsterdam. The dataset only includes tweets that
have a geo-location available. A total of 64 thousand images were downloaded
from Flickr that are taken in and around the city of Amsterdam.
Regression analysis
Open data 17 variables
& correlations testing
From open data as provided by the city of Amsterdam we utilize the following
neighbourhood variables: percentages of non-western immigrants, western immi-
grants, autochthonous inhabitants, income, children in households with minimum
incomes, people working, people living on welfare, people with low, average and
high level of education, recreation area size, housing prices, physical index, nui-
sance index, social index, and liveability index [5].
Visual Content. We analyse visual content of the images using SentiBank [2]
and detect 1200 adjective noun pairs (ANPs). For example, using SentiBank a
‘happy person’ can be detected, which combines the adjective ‘happy’ with the
noun ‘person’. Each of the ANPs detected has a sentiment score associated with
it and for each image, we compute the average of the top 10 detected ANPs.
3 Experimental Results
Our analysis shows that no significant relationship can be found between the
visual sentiment scores from Flickr and the selected liveability indicators. The
most significant relationships are found with the percentage of people living on
governmental welfare checks and the level of education.
The combined sentiment scores showed more promising results. The safety
index and the people living on governmental welfare showed significant associa-
tions with the sentiment scores (p = 0.037 and p = 0.028, cf. Fig. 2).
To compute reliable scores, only neighbourhoods with more than 40 tweets were
taken into account. However, no significant relationship is found between the
open data and the scores based on the analysis of textual content.
However, the combined score shows multiple significant relationships with the
liveability indicators. The first interesting relationship is found between ethnic
composition of the neighbourhood and sentiment scores. More particularly, there
is a positive association between sentiment scores and the percentage of native
Dutch inhabitants (cf. Fig. 3). Similarly, a positive association can be found
between level of education and the sentiment scores. This is not surprising as
these two variables are strongly correlated.
scores based on Twitter data. Further research is needed to investigate the nature
of these relationships. However, it is interesting to observe that for both plat-
forms the economic indicators (i.e., people living on welfare checks and income)
show significant relationships with our computed sentiment scores. On the other
hand, the liveability or social index of a neighbourhood showed no significant
relationship. Since these indices are designed to measure the subjective feelings
of the inhabitants, we would have expected these to be more significant in our
research.
Using the Twitter data shows more significant relationships than using the
data from Flickr. A possible explanation for this might be that people do not
tend to share opinions or feelings on this platform but mainly use it as a method
to share their photographs.
To further improve this research, it would also be interesting to see if the
sentiment prediction could be adjusted to factors that are important for resi-
dents of a city. The examples are the detectors created specifically for urban
phenomena like noise nuisance or graffiti, known to influence the liveability of
the city [4]. Finally, combining sentiment scores from user-generated data and
open data allows for new research opportunities.
Our research shows that sentiment scores may give additional insights in a
geographic area. The big advantage of training on social multimedia data is that
it provides for real-time insights. Additionally, sentiment in these areas can prove
to be an indication of important factors like crime rate or infrastructure quality.
This may be useful for government services to know what area to improve or for
new businesses to find a convenient location.
References
1. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL
on Interactive Presentation Sessions, pp. 69–72. Association for Computational Lin-
guistics (2006)
2. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.-F.: Large-scale visual sentiment
ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM
International Conference on Multimedia, pp. 223–232. ACM (2013)
3. De Smedt, T., Daelemans, W.: Pattern for python. J. Mach. Learn. Res. 13,
2063–2067 (2012)
4. Dignum, K., Jansen, J., Sloot, J.: Growth and decline - demography as a driving
force. PLAN Amsterdam 17(5), 25–27 (2011)
5. C.G.A.O. for Research and Statistics. Fact sheet leefbaarheidsindex periode
2010–2013, Febraury 2014. https://www.amsterdam.nl/publish/pages/502037/fact
sheet 6 leefbaarheidsindex 2010 - 2013 opgemaakt def.pdf
6. Mitchell, L., Frank, M.R., Harris, K.D., Dodds, P.S., Danforth, C.M.: The geogra-
phy of happiness: connecting twitter sentiment and expression, demographics, and
objective characteristics of place. PLoS ONE 8(5), e64417 (2013)
7. Zheng, Y., Capra, L., Wolfson, O., Yang, H.: Urban computing: concepts, method-
ologies, and applications. ACM Trans. Intell. Syst. Technol. (TIST) 5(3), 38 (2014)
Demos
Scenemash: Multimodal Route Summarization
for City Exploration
1 Introduction
When visiting a city, tourists often have to rely on travel guides to get informa-
tion about interesting places in their vicinity or between two locations. Existing
crowdsourced tourist websites, such as TripAdvisor primarily focus on providing
point of interest (POI) reviews. The available data on social media platforms
allows for new use-cases, stemming from a much richer impression about places.
Efforts to utilize richness of social media for tourism applications have been
made by e.g., extracting user demographics from visual content of the images
[3], modelling POIs and user mobility patterns by analysing Wikipedia pages
and image metadata [2] or by representing users and venues by topic modelling
in both text and visual domains [7].
We propose Scenemash1 , a system that supports way-finding for tourists by
automatically generating multimodal summaries of several alternative routes
between locations in a city and describing geographic area around a given
location. To represent geographic areas along the route, we make use of user-
contributed images and their associated annotations. For this purpose, we sys-
tematically collect information about venues and the images depicting them
from location-based social networking platform Foursquare and we turn to con-
tent sharing website Flickr for a richer set of images and metadata capturing
J. van den Berg—This research was performed while the first author was a student
at the UvA.
1
Scenemash demo: https://www.youtube.com/watch?v=oAnj6A1oq2M.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 833–836, 2016.
DOI: 10.1007/978-3-319-30671-1 75
834 J. van den Berg et al.
2 Approach Overview
The pipeline of our approach is illustrated in Fig. 1. In this section we describe
data collection and analysis steps as well as the procedure for generating multi-
modal representations of the geographic areas (i.e., steps 1 and 2).
the current location provided by GPS sensor. Scenemash features “explore” and
“get route” functions. On the server side, a graph illustrated in step 3 of Fig. 1
is used to get the neighbouring nodes/grid cells of a node containing user coor-
dinates when explore function is selected. In the get route mode, we apply the
breadth-first search algorithm on the same graph for computing a route between
two locations. Alternative routes are computed by selecting different neighbour
nodes of the origin node. To give the users an opportunity to avoid crowded
places, we create a weighted version of the same graph which uses the number
of images captured in a geographic cell as a proxy for crowdedness. If the crowd
avoidance feature is selected, we deploy Dijkstra’s shortest path algorithm for
computing the route between two locations.
The data collection and analysis steps described in Sect. 2 are precomputed
offline, in order to reduce online computation load. Figure 2 illustrates the user
interfaces. Each relevant geographic area (i.e., on the route or in user’s vicinity)
is represented by a circular thumbnail displayed in Google Maps. If the smart-
phone is paired with a smartwatch, the images are shown as a slideshow on
the smartwatch. When a user interacts with the map by tapping on one of the
images, the image is enlarged and an info-box with the most relevant terms for
the area is shown. The effectiveness of the prototype gives us confidence that
the Scenemash could be implemented in other cities as well.
References
1. Ah-Pine, J., Clinchant, S., Csurka, G., Liu, Y.: Xrce’s participation in imageclef.
In: Working Notes of CLEF 2009 Workshop Co-located with the 13th European
Conference on Digital Libraries (ECDL 2009) (2009)
2. Brilhante, I., Macedo, J.A., Nardini, F.M., Perego, R., Renso, C.: TripBuilder:
a tool for recommending sightseeing tours. In: de Rijke, M., Kenter, T., de Vries,
A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS,
vol. 8416, pp. 771–774. Springer, Heidelberg (2014)
3. Cheng, A.-J., Chen, Y.-Y., Huang, Y.-T., Hsu, W.H., Liao, H.-Y.M.: Personalized
travel recommendation by mining people attributes from community-contributed
photos. In: Proceedings of the 19th ACM International Conference on Multimedia,
MM 2011, pp. 83–92. ACM, New York (2011)
4. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science
315(5814), 972–976 (2007)
5. Li, X., Snoek, C., Worring, M.: Learning social tag relevance by neighbor voting.
IEEE Trans. Mult. 11(7), 1310–1322 (2009)
6. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, June
2015
7. Zahálka, J., Rudinac, S., Worring, M.: New yorker melange: interactive brew of
personalized venue recommendations. In: Proceedings of the ACM International
Conference on Multimedia, MM 2014, pp. 205–208. ACM, New York (2014)
8. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora.
In: Proceedings of LREC Workshop New Challenges for NLP Frameworks, pp.
46–50. University of Malta, Valletta (2010)
Exactus Like: Plagiarism Detection
in Scientific Texts
1 Introduction
Plagiarism is a serious problem in education and science. Improper citations,
textual borrowings, and plagiarism often occur in student and research papers.
Academics, peer reviewers, and editors of scientific journals should detect pla-
giarism in all forms and prevent substandard works from being published [1].
Numerous computer-assisted plagiarism detection systems (CaPD) were
recently developed: Turnitin, Antiplagiat.ru, The Plagiarism Checker, PlagScan,
Chimpsky, Copyscape, PlagTracker, Plagiarisma.ru. The difference between
these systems lies in search engines used to find similar textual fragments, rank-
ing schemas, and result presentations. Most of the aforementioned systems imple-
ment simple techniques to detect “copy-and-paste” borrowings based on exact
textual matching or w-shingling algorithms [2,3]. Such an approach shows good
computational performance, but it cannot find heavily disguised plagiarism [4].
In this demonstration we present Exactus Like1 – an applied plagiarism detec-
tion system, which finds besides simple copy-and-paste plagiarism also moder-
ately disguised borrowings (word/phrase reordering, substitution of some words
with synonyms). To do this, the system leverages deep parsing techniques.
Fig. 2. Found reused fragments in the checked document and their sources
4 Conclusion
The demo of Exactus Like is available online at http://like.exactus.ru/index.
php/en/. We are working on computational performance of our linguistic tools
to provide a faster detection. Our current research is focused on the detection
of heavily disguised plagiarism.
References
1. Osipov, G., Smirnov, I., Tikhomirov, I., Sochenkov, I., Shelmanov, A., Shvets, A.:
Information retrieval for R&D support. In: Paltoglou, G., Loizides, F., Hansen,
P. (eds.) Professional Search in the Modern World. LNCS, vol. 8830, pp. 45–69.
Springer, Heidelberg (2014)
2. Stein, B.: Fuzzy-fingerprints for text-based information retrieval. In: Proceedings of
the 5th International Conference on Knowledge Management, pp. 572–579 (2005)
3. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital doc-
uments. In: Proceedings of the 1995 ACM SIGMOD International Conference on
Management of Data, vol. 24, pp. 398–409 (1995)
4. Hagen, M., Potthast, M., Stein, B.: Source retrieval for plagiarism detection from
large web corpora: recent approaches. In: Working Notes of CLEF 2015 - Conference
and Labs of the Evaluation Forum (2015)
5. Osipov, G., Smirnov, I., Tikhomirov, I., Shelmanov, A.: Relational-situational
method for intelligent search and analysis of scientific publications. In: Proceed-
ings of the Workshop on Integrating IR technologies for Professional Search, in
conjunction with the 35th European Conference on Information Retrieval, vol. 968,
pp. 57–64 (2013)
6. Shvets, A., Devyatkin, D., Sochenkov, I., Tikhomirov, I., Popov, K., Yarygin, K.:
Detection of current research directions based on full-text clustering. In: Proceedings
of Science and Information Conference, pp. 483–488. IEEE (2015)
7. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections
with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association
for Computational Linguistics on Human Language Technologies: Short Papers,
pp. 265–268 (2008)
8. Takuma, D., Yanagisawa, H.: Faster upper bounding of intersection sizes. In:
Proceedings of the 36th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 703–712 (2013)
9. Zubarev, D., Sochenkov, I.: Using sentence similarity measure for plagiarism source
retrieval. In: Working Notes for CLEF 2014 Conference, pp. 1027–1034 (2014)
Jitter Search: A News-Based Real-Time
Twitter Search Interface
1 Introduction
This demo presents a real-time search system that monitors news sources on
the social network Twitter and provides an online search interface that performs
query expansion on the users’ queries using terms extracted from news head-
lines. We tackle a specific problem related to news aggregated search, which
deals with the integration of fresh content extracted from news article collections
into the microblog retrieval process. In this work, news published by reputable
news sources on Twitter are indexed into different shards to be used in query
expansion. The demo is available online at https://jitter.io.
2 Jitter Search
Our system listens to the Twitter 1 % public sample stream and indexes this
information in real-time into Lucene1 indexes and provides facilities for this data
to be searched in near real-time. To improve results and provide an improved
experience the system also indexes all the tweets published by a curated list of
news sources and other media producers, which are later used for query expan-
sion. A manually curated source-topic mapping was created to build topical
shards i.e., each shard indexes a set of sources and corresponding documents
by topic. Different topical shards are created inspired by the topics suggested
by Twitter in their account creation process and the subjects of the TREC
Microblog queries. We arrived at the following manually curated topic-based
1
https://lucene.apache.org.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 841–844, 2016.
DOI: 10.1007/978-3-319-30671-1 77
842 F. Martins et al.
2.1 Frontpage
The front page (see Fig. 1) presents a large traditional search box where the
users can enter keyword queries. In addition, a live news stream is shown as a
timeline below the search box with the most recent news appearing at the top.
It shows the news posted on Twitter by known media producers hinting that
the system is capable of monitoring Twitter in real-time. The user can observe
what news are being disseminated at this moment by different media outlets on
a variety of topics: places and events, brands and products, and other entities.
This feature enhances discoverability by letting the user peek into the stream of
news.
2.2 Searching
Once the user submits a query, the system shows the results page (see Fig. 2). The
system retrieves a ranked list of the most interesting tweets and presents them
in the main area of the results page in the middle. Each tweet is accompanied by
its age and authorship information including the username and a profile image.
This ranked list of tweets is obtained by augmenting the original query with
additional high quality terms extracted from a Twitter news corpus. First, the
system uses the original query to retrieve a set of candidate pseudo-relevant doc-
uments from the indexed news stream. Second, the system uses standard resource
selection algorithms over this candidate set to obtain a rank of the most probable
topics for the query. Then the most probable topics are selected and the candidate
set is filtered to remove news not belonging to these topics. Finally, the filtered can-
didate set of news is given as input to a pseudo-relevance feedback method and the
terms obtained are added to the original query to obtain the final expanded query,
which is used to retrieve the results from the tweets index.
A sidebar on the left shows the top 5 topics and the top news sources detected
for that query. The user is able override the topics selected automatically and fine
tune their search to one of the topics presented. He can select the desired topic
and obtain a new ranked list of tweets. The sidebar on the right presents tweets
from the news stream, which are filtered by the topics selected at the time i.e., the
topics selected automatically by the system or the user’s selected topic.
Fig. 3. Jitter world map: real-time heatmap of geotagged tweets (Color figure online).
The regions of the heatmap with a higher volume of tweets show up in shades
of red and other locations show up in shades of green.
References
1. Martins, F., Magalhães, J., Callan, J.: Barbara made the news: mining the behavior
of crowds for time-aware learning to rank. In: Proceedings of the WSDM 2016.
ACM, San Francisco (2016)
2. Shokouhi, M.: Central-rank-based collection selection in uncooperative distributed
information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007.
LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007)
3. Si, L., Callan, J.: Relevant document distribution estimation method for resource
selection. In: Proceedings of SIGIR 2003, pp. 298–305. ACM, New York (2003)
TimeMachine: Entity-Centric Search
and Visualization of News Archives
1 Introduction
Online publication of news articles has become a standard behavior of news out-
lets, while the public joined the movement either using desktop or mobile termi-
nals. The resulting setup consists of a cooperative dialog between news outlets
and the public at large. Latest events are covered and commented by both parties
in a continuous basis through the social networks, such as Twitter. At the same
time, it is necessary to convey how story elements are developed over time and to
integrate the story in the larger context. This is extremely challenging when jour-
nalists have to deal with news archives that are growing everyday in a thousands
scale. Never before has computation been so tightly connected with the practice
of journalism. In recent years, computer science community have researched [1–8]
and developed1 new ways of processing and exploring news archives to help jour-
nalists perceiving news content with an enhanced perspective.
TimeMachine, as a computational journalism tool, brings together a set of
Natural Language Processing, Text Mining and Information Retrieval technolo-
gies to automatically extract and index entity related knowledge from the news
articles [5–11]. It allows users to issue queries containing keywords and phrases
about news stories or events, and retrieves the most relevant entities mentioned
1
NewsExplorer (IBM Watson): http://ibm.co/1OsBO1a.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 845–848, 2016.
DOI: 10.1007/978-3-319-30671-1 78
846 P. Saleiro et al.
in the news articles through time. TimeMachine provides readable and user
friendly insights and visual perspective of news stories and entities evolution
along time, by presenting co-occurrences networks of public personalities men-
tioned on news, following a force atlas algorithm [12] for the interactive and
real-time clustering of entities.
3 Demonstration
The setup for demonstration uses a news archive of Portuguese news. It com-
prises two different datasets: a repository from the main Portuguese news agency
(1990–2010), and a stream of online articles provided by the main web portal
in Portugal (SAPO) which aggregates news articles from 50 online newspapers.
By the time of writing this paper, the total number of news articles used in this
demonstration comprises over 12 million news articles. The system is working
on a daily basis, processing articles as they are collected from the news stream.
TimeMachine allows users to explore its news archive through an entity search
box or by selecting a specific date. Both options are available on the website
homepage and in the top bar on every page. There are a set of “stories” recom-
mendations on the homepage suited for first time visitors. The entity search box
is designed to be the main entry point to the website as it is connected to the
entity retrieval module of TimeMachine.
Fig. 2. Cristiano Ronaldo page headline (left) and egocentric network (right).
Users may search for surface names of entities (e.g. “Cristiano Ronaldo”) if
they know which entities they are interested to explore in the news, although the
most powerful queries are the ones containing keywords or phrases describing
topics or news stories, such as “eurozone crisis” or “ballon d’or nominees”. When
selecting an entity from the ranked list of results, users access the entity pro-
file page which containing a set of automatically extracted entity specific data:
name, profession, a set of news articles, quotations from the entity and related
entities. Figure 2, left side, represents an example of the entity profile headline.
The entity timeline allows users to navigate entity specific data through time. By
selecting a specific period, different news articles, quotations and related entities
are retrieved. Furthermore, users have the option of “view network” which con-
sists in a interactive network depicting connections among entities mentioned in
news articles for the selected time span. This visualization is depicted in Fig. 2,
848 P. Saleiro et al.
right side, and it is implemented using the graph drawing library Sigma JS,
together with “Force Atlas” algorithm for the clustering of entities. Nodes con-
sist of entities and edges represent a co-occurrence of mentioned entities in the
same news article. The size of the nodes and the width of edges is proportional
to the number of mentions and co-occurrences, respectively. Different node col-
ors represent specific news topics where entities were mentioned. By selecting
a date interval on the homepage, instead of issuing a query, users get a global
interactive network of mentions and co-occurrences of the most frequent entities
mentioned in the news articles for the selected period of time.
As future work we plan to enhance TimeMachine with semantic extraction
and retrieval of relations between mentioned entities.
References
1. Demartini, G., Missen, M.M.S., Blanco, R., Zaragoza, H.: Taer: time aware entity
retrieval. In: CIKM. ACM, Toronto, Canada (2010)
2. Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P., Zaragoza, H.:
Searching through time in the new york times. In: Human-Computer Interaction
and Information Retrieval, pp. 41–44 (2010)
3. Balog, K., Rijke, M., Franz, R., Peetz, H., Brinkman, B., Johgi, I.: and Max
Hirschel. Sahara: discovering entity-topic associations in online news. In: ISWC
(2009)
4. Alonso, O., Berberich, K., Bedathur, S., Weikum, G.: Time-based exploration of
news archives. In: HCIR 2010 (2010)
5. Saleiro, P., Amir, S., Silva, M., Soares, C.: Popmine: Tracking political opinion on
the web. In IEEE IUCC (2015)
6. Teixeira, J., Sarmento, L., Oliveira, E.: Semi-automatic creation of a reference news
corpus for fine-grained multi-label scenarios. In: CISTI (2011)
7. Sarmento, L., Nunes, S., Teixeira, J., Oliveira, E.: Propagating fine-grained topic
labels in news snippets. In: IEEE/WIC/ACM WI-IAT (2009)
8. Abreu, C., Teixeira, J., Oliveira, E.: Encadear encadeamento automático de
notı́cias. Linguistica, Informatica e Traducao: Mundos que se Cruzam, Oslo Studies
in Language 7(1), 2015 (2015)
9. Saleiro, P., Rei, L., Pasquali, A., Soares, C.: Popstar at replab 2013: name ambi-
guity resolution on twitter. In: CLEF (2013)
10. Saleiro, P., Sarmento, L.: Piaf vs Adele: classifying encyclopedic queries using auto-
matically labeled training data. In OAIR (2013)
11. Teixeira, J., Sarmento, L., Oliveira, E.: A bootstrapping approach for training a
NER with conditional random fields. In: Antunes, L., Pinto, H.S. (eds.) EPIA 2011.
LNCS, vol. 7026, pp. 664–678. Springer, Heidelberg (2011)
12. Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: Forceatlas2, a continuous
graph layout algorithm for handy network visualization designed for the gephi
software. PLoS ONE (2014)
OPMES: A Similarity Search Engine
for Mathematical Content
1 Introduction
Mathematical content is widely used in technical documents such as the publi-
cations and course materials from STEM fields. To better utilize such a valuable
digitalized mathematical asset, it is important to offer search users the ability
to find similar mathematical expressions. For example, some students may want
to collect additional information about the formula that they have learned in
the class, and others may want to find an existing proof for an equation. Unfor-
tunately, major search engines do not support the similarity-based search for
mathematical content.
The goal of this paper is to present our efforts on developing OPMES
(Operator-tree Pruning based Math Expression Search), a similarity-based
search engine for mathematical content. Given a query written as a mathemati-
cal expression, the system will return a ranked list of relevant math expressions
from the underlying math collections.
Compared with existing mathematical search systems, such as MIaS1 , Tan-
gent2 , and Zentralblatt math from Math Web Search3 (MWS), the developed
OPMES is unique in that operator trees [1] are leveraged in all the system com-
ponents to enable efficient and effective search.
More specifically, OPMES parses an math expression into an operator tree,
and then extracts leaf-root paths from the operator tree to represent structural
1
https://mir.fi.muni.cz/mias/.
2
http://saskatoon.cs.rit.edu/tangent/random.
3
http://search.mathweb.org.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 849–852, 2016.
DOI: 10.1007/978-3-319-30671-1 79
850 W. Zhong and H. Fang
root
× k y 2
var num
k + × + +
add times add
y 2 × ×
resentation of math generated from the (3) Tree index (after tok-
expression k(y + 2). operator tree. enizing)
2 System Description
We now provide the details for three major components in our system: parser,
indexer and searcher.
OPMES parser is responsible to extract math mode LATEX markups from
HTML files. A LALR (look-ahead LR) parser implemented by Bison/Flex is
used to transform math mode LATEX markup into in-memory operator tree from
OPMES: A Similarity Search Engine for Mathematical Content 851
+ /
+ /
+ a #1 #2
b c a b c a b a b
posting lists under that common directory. This process is repeated recursively to
prune indexes (directories) that are not common at the deeper level. The second
step is to rank all the structurally relevant expressions identified in the first step
based on their symbolic similarities with the query. The scoring algorithm Mark
and Cross, which addresses both symbol set equivalence and α-equivalence, is
fully explained in the first author’s master thesis [5].
3 Demonstration Plan
In our demo, we first illustrate some key ideas mentioned above. We will choose
a simple query, show its operator tree representation in ASCII graph as well
as its leaf-root paths (through the output of parser). Then we demonstrate the
structure of our index tree, and walk through the steps and directories where the
searcher goes and finds relevant expressions for input query. Users are invited to
enter queries and experience our search engine based on a collection (with over
8 million math expressions scrawled from Math Stack Exchange website) that
contains most frequently used and elementary math expressions.
References
1. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions.
Int. J. Doc. Anal. Recogn. (IJDAR) 15(4), 331–357 (2012)
2. Ichikawa, H., Hashimoto, T., Tokunaga, T., Tanaka, H.: New methods of retrieve
sentences based on syntactic similarity. IPSJ SIG Technical Reports, pp. 39–46
(2005)
3. Hijikata, Y., Hashimoto, H., Nishida, S.: An investigation of index formats for the
search of mathml objects. In: Web Intelligence/IAT Workshops, pp. 244–248. IEEE
(2007)
4. Yokoi, K., Aizawa, A: An approach to similarity search for mathematical expressions
using MathML. In: Towards a Digital Mathematics Library, Grand Bend, Ontario,
Canada (2009)
5. Zhong, W.: A Novel Similarity-Search Method for Mathematical Content in LaTeX-
Markup and Its Implementation (2015). http://tkhost.github.io/opmes/thesis-ref.
pdf
SHAMUS: UFAL Search and Hyperlinking
Multimedia System
1 Introduction
and the user needs to skim through the retrieved recordings, what can be very
tedious. Therefore, our system retrieves relevant video segments instead of full
videos. Relevant segments are retrieved in both Search and Hyperlinking com-
ponents, and the most informative segments are suggested by the Anchoring
component.
The information retrieval framework Terrier [6] is used in all system components.
We segment all recordings into 60-second long passages first and use them as
documents in our further setup. In the Search component, the retrieval is run on
the list of created segments and the segments relevant to the user-typed query
are retrieved. In the Hyperlinking component, the currently played segment is
converted into a text-based query using the subtitles and the same setup as the
one in the Search component is used. To formulate the Hyperlinking query, we
use the 20 most frequent words lying inside the query segment and filter out the
most common words using the stopwords list.
Both retrieval components, including the segmentation method, were tuned
on almost 2700 hours of BBC TV programmes provided in the Search and Hyper-
linking Task at the MediaEval 2014 Benchmark. The setup used in the Search
and Hyperlinking components is in more detail described in the Task report
papers [2,3]. Even though the proposed methods are simple, they outperformed
the rest of the methods submitted to the Benchmark. For the Anchoring, we
use our system proposed for the MediaEval 2015 Search and Anchoring Task
[4]. We assume that the most informative segments of videos are often similar
to the video description as the description usually provides the summary of the
document. Therefore, we convert the available metadata description of each file
into a textual query. Information retrieval is then applied on the video segments
and the highest retrieved segments are considered as the Anchoring ones. The
list of the Anchoring segments is pre-generated in advance for each video.
SHAMUS: UFAL Search and Hyperlinking Multimedia System 855
3 User Interface
The user interface consists of three main sites. The first one serves for the Search
query input. The second site displays the results of the Search including the
metadata, transcript of the beginning of the retrieved segment and time of this
segment. The video with its title, description, source, marked Anchoring seg-
ments, and list of related segments are displayed in the third site (see Fig. 2).
Fig. 2. Video player with detected most informative Anchoring segments and recom-
mended links to related video segments.
The JWPlayer is used for the video playback. The Anchoring segments are
marked as individual chapters of the video. The transcript of the beginning of
each segment is retrieved and used instead of the chapter name – users can thus
overview the most important segments of the video without the need to navigate
there. The list of three most related segments is displayed on the right side, next
to the video player.
SHAMUS demo interface currently works with a collection of 1219 TED
talks. We used a list of talks available in the TED dataset [7] and downloaded
available subtitles and videos for each talk. However, the system should be easily
adaptable to work with any collection of videos for which subtitles or automatic
transcripts and metadata descriptions are available.
856 P. Galuščáková et al.
References
1. Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., Gargi, U., Gupta,
S., He, Y., Lambert, M., Livingston, B., Sampath, D.: The YouTube video rec-
ommendation system. In: Proceedings of RecSys, pp. 293–296, Barcelona, Spain
(2010)
2. Galuščáková, P., Kruliš, M., Lokoč, J., Pecina, P.: CUNI at MediaEval 2014 search
and hyperlinking task: visual and prosodic features in hyperlinking. In: Proceedings
of MediaEval, Barcelona, Spain (2014)
3. Galuščáková, P., Pecina, P.: CUNI at MediaEval 2014 search and hyperlinking task:
search task experiments. In: Proceedings of MediaEval, Barcelona, Spain (2014)
4. Galuščáková, P., Pecina, P.: CUNI at MediaEval 2015 search and anchoring in
video archives: anchoring via information retrieval. In: Proceedings of MediaEval,
Wurzen, Germany (2015)
5. Moyal, A., Aharonson, V., Tetariy, E., Gishri, M.: Phonetic Search Methods for
Large Speech Databases. Springer, New York (2013)
6. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier:
a high performance and scalable information retrieval platform. In: Proceedings
of ACM SIGIR 2006 Workshop on Open Source Information Retrieval, pp. 18–25,
Seattle, Washington, USA (2006)
7. Pappas, N., Popescu-Belis, A.: Combining content with user preferences for TED
lecture recommendation. In: Proceedings of CBMI, pp. 47–52. IEEE, Veszprém,
Hungary (2013)
8. Schafer, J.B., Frankowski, D., Herlocker, J., Sen, S.: Collaborative filtering recom-
mender systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web
2007. LNCS, vol. 4321, pp. 291–324. Springer, Heidelberg (2007)
9. Schoeffmann, K.: A user-centric media retrieval competition: the video browser
showdown 2012–2014. IEEE Multimedia 21(4), 8–13 (2014)
10. Schoeffmann, K., Hopfgartner, F., Marques, O., Boeszoermenyi, L., Jose, J.M.:
Video browsing interfaces and applications: a review. Spie rev. 1(1), 018004 (2010)
Industry Day
Industry Day Overview
1 Introduction
The goal of the Industry Day track is to bring an exciting programme that con-
tains a mix of invited talks by industry leaders with presentations of novel and
innovative ideas from the search industry. The final program consists of four
invited talks by Domonkos Tikk (Gravity), Etienne Sanson (Criteo), Debora
Donato (StumbleUpon) and Nicola Montecchio (Spotify), and four accepted pro-
posal talks.
Talk proposals were selected based on problem space, real-world system readi-
ness, and technical depth. The list of accepted proposed talks are:
events that occur in a newspaper article we can assume that aggressive events
(attack, hit, tackle) in sports related articles are not assigned the semantic mean-
ing of similarly named events in warfare, etc. In this talk, we will present how
we use state of the art multilingual generative topic modeling using LDA in
combination with multiclass SVM classifiers to categorize documents into seven
base categories. This talk will focus on the engineering aspects of building a
classification system. In the talk we will present the software architecture, the
combination of open source software used, and the problems encountered while
implementing, testing and deploying the system. The focus will be on the usage
open source software in order to build and deploy state of the art information
retrieval systems
Thinking Outside the Search Box: A Visual Approach to Complex
Query Formulation. UXLabs is developing 2dSearch: a radical alternative
to traditional keyword search. Instead of a one dimensional search box, users
express and manipulate concepts as objects on a two-dimensional canvas, using
a visual syntax that is simpler and more transparent than traditional query for-
mulation methods. This guides the user toward the formulation of syntactically
correct expressions, exposing their semantics in a comprehensible manner. Con-
cepts may be combined to form aggregate structures, such as lists (unordered
sets sharing a common operator) or composites (nested structures containing a
combination of sub-elements). In this way 2dSearch supports the modular cre-
ation of queries of arbitrary complexity. 2dSearch is deployed as a framework and
interactive development environment for managing complex queries and search
strategies. It provides support for all stages in the query lifecycle, from creation
and editing, through to sharing and execution. In so doing, it offers the potential
to improve query quality and efficiency and promote the adoption and sharing
of templates and best practices.
Get on with it! Recommender System Industry Challenges Move
Towards Real-World, Online Evaluation. Recommender systems have enor-
mous commercial importance in connecting people with content, information,
and services. Historically, the recommender systems community has benefited
from industry-academic collaborations and the ability of industry challenges to
drive forward the state of the art. However, today’s challenges look very dif-
ferent from the NetFlix Prize in 2009. This talk features speakers represent-
ing two ongoing recommendation challenges that typify the direction in which
such challenges are rapidly evolving. The first direction is the move to leverage
information beyond the user-item matrix. Success requires algorithms capable
of integrating multiple sources of information available in real-world scenarios.
The second direction is the move from evaluation on offline data sets to evalu-
ation with online systems. Success requires algorithms that are able to produce
recommendations that satisfy users, but that are also able to satisfy technical
constraints (i.e., response time, system availability) as well as business metrics
(i.e., item coverage). The challenges covered in this talk are supported by the
EC-funded project CrowdRec, which studies stream recommendation and user
engagement for next-generation recommender systems. We describe each in turn.
862 O. Alonso and P. Serdyukov
1
GESIS – Leibniz Institute for the Social Sciences, Unter Sachsenhausen 6-8,
50667 Cologne, Germany
[email protected]
2
Department of Computer Science and Technology, University of Bedfordshire, Luton, UK
[email protected]
3
Department of Computer Science, IRIT UMR 5505 CNRS, University of Toulouse,
118 Route de Narbonne, 31062 Toulouse Cedex 9, France
[email protected]
Abstract. The BIR workshop brings together experts in Bibliometrics and Infor‐
mation Retrieval. While sometimes perceived as rather loosely related, these
research areas share various interests and face similar challenges. Our motivation
as organizers of the BIR workshop stemmed from a twofold observation. First,
both communities only partly overlap, albeit sharing various interests. Second, it
will be profitable for both sides to tackle some of the emerging problems that
scholars face today when they have to identify relevant and high quality literature
in the fast growing number of electronic publications available worldwide.
Bibliometric techniques are not yet used widely to enhance retrieval processes in
digital libraries, although they offer value-added effects for users. Information
professionals working in libraries and archives, however, are increasingly
confronted with applying bibliometric techniques in their services. The first BIR
workshop in 2014 set the research agenda by introducing each group to the other,
illustrating state-of-the-art methods, reporting on current research problems, and
brainstorming about common interests. The second workshop in 2015 further
elaborated these themes. This third BIR workshop aims to foster a common
ground for the incorporation of bibliometric-enhanced services into scholarly
search engine interfaces. In particular we will address specific communities, as
well as studies on large, cross-domain collections like Mendeley and Research‐
Gate. This third BIR workshop addresses explicitly both scholarly and industrial
researchers.
1 Introduction
IR and Bibliometrics are two fields in Information Science that have grown apart in
recent decades. But today ‘big data’ scientific document collections (e.g., Mendeley,
ResearchGate) bring together aspects of crowdsourcing, recommendations, interactive
retrieval, and social networks. There is a growing interest in revisiting IR and biblio‐
metrics to provide cutting-edge solutions that help to satisfy the complex, diverse, and
long-term information needs that scientific information seekers have, in particular the
challenge of the fast growing number of publications available worldwide in workshops,
conferences and journals that have to be made accessible to researchers. This interest
was shown in the well-attended recent workshops, such as “Computational Sciento‐
metrics” (held at iConference and CIKM 2013), “Combining Bibliometrics and Infor‐
mation Retrieval” (at the ISSI conference 2013) and the previous BIR workshops at
ECIR. Exploring and nurturing links between bibliometric techniques and IR is bene‐
ficial for both communities (e.g., Abbasi and Frommholz, 2015; Cabanac, 2015,
Wolfram, 2015). The workshops also revealed that substantial future work in this direc‐
tion depends on a rise in ongoing awareness in both communities, manifesting itself in
tangible experiments/exploration supported by existing retrieval engines.
It is also of growing importance to combine bibliometrics and information retrieval
in real-life applications (see Jack et al., 2014; Hienert et al., 2015). These include moni‐
toring the research front of a given domain and operationalizing services to support
researchers in keeping up-to-date in their field by means of recommendation and inter‐
active search, for instance in ‘researcher workbenches’ like Mendeley /ResearchGate
or search engines like Google Scholar that utilize large bibliometric collections. The
resulting complex information needs require the exploitation of the full range of biblio‐
metric information available in scientific repositories. To this end, this third edition of
the BIR workshop will contribute to identifying and explorating further applications and
solutions that will bring together both communities to tackle this emerging challenging
task.
The first two bibliometric-enhanced Information Retrieval (BIR) workshops at ECIR
20141 and ECIR 20152 attracted more than 40 participants (mainly from academia) who
engaged in lively discussions and future actions. For the third BIR workshop3 we build
on this experience.
Our workshop aims to engage the IR community with possible links to bibliometrics.
Bibliometric techniques are not yet widely used to enhance retrieval processes in digital
libraries, yet they offer value-added effects for users (Mutschke, et al., 2011). To give
an example, recent approaches have shown the possibilities of alternative ranking
methods based on citation analysis can lead to an enhanced IR.
Our interests include information retrieval, information seeking, science modelling,
network analysis, and digital libraries. Our goal is to apply insights from bibliometrics,
scientometrics, and informetrics to concrete, practical problems of information retrieval
and browsing. More specifically we ask questions such as:
1
http://www.gesis.org/en/events/events-archive/conferences/ecirworkshop2014/.
2
http://www.gesis.org/en/events/events-archive/conferences/ecirworkshop2015/.
3
http://www.gesis.org/en/events/events-archive/conferences/ecirworkshop2016/.
Bibliometric-Enhanced Information Retrieval 867
The workshop will start with an inspirational keynote by Marijn Koolen “Bibliometrics
in Online Book Discussions: Lessons for Complex Search Tasks” to kick-start thinking
and discussion on the workshop topic (for the keynote from 2015 see Cabanac, 2015).
This will be followed by paper presentations in a format that we found to be successful
at BIR 2014 and 2015: each paper is presented as a 10 min lightning talk and discussed
for 20 min in groups among the workshop participants followed by 1-minute pitches
from each group on the main issues discussed and lessons learned. The workshop will
conclude with a round-robin discussion of how to progress in enhancing IR with biblio‐
metric methods.
4 Audience
The audiences (or clients) of IR and bibliometrics partially overlap. Traditional IR serves
individual information needs, and is– consequently – embedded in libraries, archives
and collections alike. Scientometrics, and with it bibliometric techniques, has a matured
serving science policy.
We propose a half-day workshop that should bring together IR and DL researchers
with an interest in bibliometric-enhanced approaches. Our interests include information
retrieval, information seeking, science modelling, network analysis, and digital libraries.
The goal is to apply insights from bibliometrics, scientometrics, and informetrics to
concrete, practical problems of information retrieval and browsing.
The workshop is closely related to the BIR workshops at ECIR 2014 and 2015 and
strives to feature contributions from core bibliometricians and core IR specialists who
already operate at the interface between scientometrics and IR. In this workshop,
however, we focus more on real experimentations (including demos) and industrial
participation.
4
http://www.gesis.org/fileadmin/upload/issi2013/BMIR-workshop-ISSI2013-Larsen.pdf.
868 P. Mayr et al.
5 Output
The papers presented at the BIR workshop 2014 and 2015 have been published in the
online proceedings http://ceur-ws.org/Vol-1143, http://ceur-ws.org/Vol-1344. We plan
to set up online proceedings for BIR 2016 again. Another output of our BIR initiative was
prepared after the ISSI 2013 workshop on “Combining Bibliometrics and Information
Retrieval” as a special issue in Scientometrics. This special issue has attracted eight high
quality papers and will appear in early 2015 (see Mayr and Scharnhorst, 2015). We aim
to have a similar dissemination strategy for the proposed workshop, but now oriented
towards core-IR. In this way we shall build up a sequence of explorations, visions, results
documented in scholarly discourse, and create a sustainable bridge between bibliometrics
and IR.
References
1 Motivations
In the last few years the phenomenon of multilingual information overload has
received significant attention due to the huge availability of information coded
in many different languages. We have in fact witnessed a growing popularity of
tools that are designed for collaboratively editing through contributors across the
world, which has led to an increased demand for methods capable of effectively
and efficiently searching, retrieving, managing and mining different language-
written document collections. The multilingual information overload phenom-
enon introduces new challenges to modern information retrieval systems. By
better searching, indexing, and organizing such rich and heterogeneous infor-
mation, we can discover and exchange knowledge at a larger world-wide scale.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 869–873, 2016.
DOI: 10.1007/978-3-319-30671-1 83
870 D. Ienco et al.
1
http://events.dimes.unical.it/multilingmine/.
MultiLingMine 2016: Modeling, Learning and Mining 871
3 Advisory Board
4 Related Events
A COLING’08 workshop [1] was one of the earliest events that emphasized
the importance of analyzing multilingual document collections for information
extraction and summarization purposes. The topic also attracted attention from
the semantic web community: in 2014, [2] solicited works to discuss principles
on how to publish, link and access mono and multilingual knowledge data col-
lections; in 2015, another workshop [3] took place on similar topics in order to
allow researchers continue to address multilingual knowledge management prob-
lems. A tutorial on Multilingual Topic Models was presented at WSDM 2014 [4]
focusing on how statistically model document collections written in different lan-
guages. In 2015, a WWW workshop aimed at advancing the state-of-the-art in
Multilingual Web Access [5]: the contributing papers covered different aspects
of multilingual information analysis, leveraging attention on the lack of current
information retrieval techniques and the necessity of new techniques especially
tailored to manage, search, analysis and mine multilingual textual information.
The main event related to our workshop is the CLEF initiative [6] which has
long provided a premier forum for the development of new information access and
evaluation strategies in multilingual contexts. However, differently from Multi-
LingMine, it does not have emphasized research contributions on tasks such as
searching, indexing, mining and modeling of multilingual corpora.
Our intention is to continue the lead of previous events about multilingual
related topics, however from a broader perspective which is relevant to various
information retrieval and document mining fields. We aim at soliciting contri-
butions from scholars and practitioners in information retrieval that are inter-
ested in Multi/Cross-lingual document management, search, mining, and eval-
uation tasks. Moreover, differently from previous workshops, we would empha-
size some specific trends, such as cross-view cross/multilingual IR, as well as the
growing tightly interaction between knowledge-based and statistical/algorithmic
approaches in order to deal with multilingual information overload.
References
1. Bandyopadhyay, S., Poibeau, T., Saggion, H., Yangarber, R.: Proceedings of the
Workshop on Multi-source Multilingual Information Extraction and Summarization
(MMIES). ACL (2008)
2. Chiarcos, C., McCrae J.P., Montiel, E., Simov, K., Branco, A., Calzolari, N.,
Osenova, P., Slavcheva, M., Vertan, C.: Proceedings of the 3rd Workshop on Linked
Data in Linguistics: Multilingual Knowledge Resources and NLP (LDL) (2014)
MultiLingMine 2016: Modeling, Learning and Mining 873
3. McCrae, J.P., Vulcu, G.: CEUR Proceedings of the 4th Workshop on the Multilin-
gual Semantic Web (MSW4), vol. 1532 (2015)
4. Moens, M.-F., Vulié, I.: Multilingual probabilistic topic modeling and its applica-
tions in web mining and search. In: Proceedings of the 7th ACM WSDM Conference
(2014)
5. Steichen, B., Ferro, N., Lewis, D., Chi, E.E.: Proceedings of the International Work-
shop on Multilingual Web Access (MWA) (2015)
6. The CLEF Initiative. http://www.clef-initiative.eu/
Proactive Information Retrieval: Anticipating
Users’ Information Need
1 Motivation
The ultimate goal of an IR system is to fulfill the user’s information need. Tra-
ditionally, IR systems have been reactive in nature, wherein the system would
react only after the information need has been expressed by the user as a query.
Oftentimes, users are unable to express their needs clearly using a few keywords
leading to considerable research efforts to close the gap between the informa-
tion need in the user’s mind and the data residing in the system’s index by
approaches such as query expansion, query suggestions, relevance feedback and
click-through information, and using personalization techniques to fine tune the
search results to users’ liking. These techniques, though helpful, increase the
complexity of the system and many times also require additional efforts from the
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 874–877, 2016.
DOI: 10.1007/978-3-319-30671-1 84
Proactive Information Retrieval: Anticipating Users’ Information Need 875
user, especially when they are using a mobile device (typing queries, providing
feedback, selecting and trying multiple suggestions, etc.). Given that the search
traffic from mobile devices is set to overtake desktop/PCs in near future1 , and
ever increasing amount of user data provided by the growing popularity of Inter-
net enabled wearable devices (smartwatches, smart wrist bands, Google Glass,
etc.), it behooves the IR community to move beyond the keyword query-10 blue
links paradigm and to develop systems that are more proactive in nature, and
can anticipate and fulfill user’s information need with minimal efforts from the
user. Increasingly, push models have replaced pull models in various platforms,
the result pages of commercial search engines have changed to display results in
ways that are more easily consumable by the user.
Systems like Google Now, Apple Siri and IBM Watson proactively provide
useful and personalized information such as weather updates, flight and traf-
fic alerts, trip-planning2 etc.; and even try to provide answers to often noisy
and underspecified user questions. For example, given a user’s location and
preferences (vegetarian, specific cuisines, etc.) suggesting nearby restaurants,
suggesting fitness articles/diet plans based on exercise logs, suggesting possible
recipients based on an email content, etc.
The underlying theme of the workshop will be systems that can proactively
anticipate and fulfill the information needs of the user. To cover the various
aspects of proactive IR systems, the topic areas of the workshop include (but
are not limited to):
4 Intended Audience
We expect to bring together researchers from both industry and academia with
diverse backgrounds spanning information retrieval, natural language process-
ing, speech processing, user modeling and profiling, data mining, machine learn-
ing, human computer interface design, to propose new ideas, identify promising
research directions and potential challenges. We also wish to attract practitioners
who seek novel ideals for applications. Participation and Selection Process:
We plan to encourage attendance and attract quality submissions by:
Abstract. The news industry has gone through seismic shifts in the
past decade with digital content and social media completely redefining
how people consume news. Readers check for accurate fresh news from
multiple sources throughout the day using dedicated apps or social media
on their smartphones and tablets. At the same time, news publishers rely
more and more on social networks and citizen journalism as a frontline
to breaking news. In this new era of fast-flowing instant news delivery
and consumption, publishers and aggregators have to overcome a great
number of challenges. These include the verification or assessment of a
source’s reliability; the integration of news with other sources of infor-
mation; real-time processing of both news content and social streams in
multiple languages, in different formats and in high volumes; dedupli-
cation; entity detection and disambiguation; automatic summarization;
and news recommendation. Although Information Retrieval (IR) applied
to news has been a popular research area for decades, fresh approaches
are needed due to the changing type and volume of media content avail-
able and the way people consume this content. The goal of this workshop
is to stimulate discussion around new and powerful uses of IR applied
to news sources and the intersection of multiple IR tasks to solve real
user problems. To promote research efforts in this area, we released a new
dataset consisting of one million news articles to the research community
and introduced a data challenge track as part of the workshop.
2 Workshop Goals
The main goal of the workshop is to bring together scientists conducting rele-
vant research in the field of news and information retrieval. In particular, scien-
tists can present their latest breakthroughs with an emphasis on the application
of their findings to research from a wide range of areas including: information
retrieval; natural language processing; journalism (including data journalism);
network analysis; and machine learning. This will facilitate discussion and debate
about the problems we face and the solutions we are exploring, hopefully find-
ing common grounds and potential synergies between different approaches. We
aim to have a substantial representation from industry, from small start-ups to
large enterprises, to strengthen their relationships with the academic community.
This also represents a unique opportunity to understand the different problems
1
https://groups.google.com/forum/#!forum/news-ir.
880 M. Martinez-Alvarez et al.
and priorities of each community and to recognize areas that are not currently
receiving much academic attention but are nonetheless of considerable commer-
cial interest. Finally, to accompany the workshop, we have released a new dataset
suitable for conducting research on news IR. We describe the dataset in the next
section. Detailed information about the workshop can be found on the workshop
website2 .
The workshop also includes a panel discussion with members drawn from
academia, from large companies and from SMEs. This includes Dr. Jochen
Leidner (Thomson Reuters), Dr. Gabriella Kazai (Lumi) and Dr. Julio Gonzalo
(UNED). This panel focuses on the commonalities and differences between the
communities as they face related challenges in news-based information retrieval.
5 Programme Committee
References
1. De Francisci, G., Morales, A.G., Lucchese, C.: From chatter to headlines: harnessing
the real-time web for personalized news recommendation. In: Proceedings of WSDM
(2012)
2. Mathioudakis, M., Koudas, N.: Twittermonitor: trend detection over the Twitter
stream. In: Proceedings of SIGMOD (2010)
3. Aslam, J., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., Sakai, T.: TrREC temporal
summarization. In: Proceedings of TREC (2013)
Tutorials
Collaborative Information Retrieval:
Concepts, Models and Evaluation
2 Outline
Part 1: Collaborative Information Retrieval Fundamental Notions.
In this part, our primary objective is specifically to propose a broad review of
collaborative search by presenting a detailed notion of collaboration in a search
context including its definition [6,20], dimensions [2,5], paradigms [4,9], and
underlying behavioral search process [3,7].
possible types of division of labor, only the algorithmic and the role-based ones
are appropriate for the CIR domain [9]. One of the main approaches relies on
the search strategy differences between collaborators using roles [12,15–17,19],
where the algorithmic-based division of labor considers that users have similar
objectives (in this case, users could be seen as peers) [4,11]. These approaches
are contrasted to the ones surrounded by a division of labor guided by collabo-
rators’ roles in which users are characterized by asymmetric roles with distinct
search strategies or intrinsic peculiarities.
References
1. Azzopardi, L., Pickens, J., Sakai, T., Soulier, L., Tamine, L.: Ecol: first international
workshop on the evaluation on collaborative information seeking and retrieval. In:
CIKM 2015, pp. 1943–1944 (2015)
2. Capra, R., Velasco-Martin, J., Sams, B.: Levels of “working together” in collabo-
rative information seeking and sharing. In: CSCW 2010, ACM (2010)
3. Evans, B.M., Chi, E.H.: An elaborated model of social search. Inf. Process. Manag.
(IP&M) 46(6), 656–678 (2010)
4. Foley, C., Smeaton, A.F.: Synchronous collaborative information retrieval: tech-
niques and evaluation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy,
C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 42–53. Springer, Heidelberg (2009)
5. Golovchinsky, G., Qvarfordt, P., Pickens, J.: Collaborative information seeking.
IEEE Comput. 42(3), 47–51 (2009)
6. Gray, B.: Collaborating: Finding Common Ground for Multiparty Problems. Jossey
Bass Business and Management Series. Jossey-Bass, San Francisco (1989)
7. Hyldegärd, J.: Beyond the search process - exploring group members’ information
behavior in context. IP&M 45(1), 142–158 (2009)
8. Joho, H., Hannah, D., Jose, J.M.: Revisiting IR techniques for collaborative search
strategies. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR
2009. LNCS, vol. 5478, pp. 66–77. Springer, Heidelberg (2009)
9. Kelly, R., Payne, S.J.: Division of labour in collaborative information seeking:
current approaches and future directions. In: CIS Workshop at CSCW 2013, ACM
(2013)
10. Morris, M.R.: Collaborative search revisited. In: CSCW 2013, pp. 1181–1192. ACM
(2013)
11. Morris, M.R., Teevan, J., Bush, S.: Collaborative web search with personalization:
groupization, smart splitting, and group hit-highlighting. In: CSCW 2008, pp. 481–
484. ACM (2008)
12. Pickens, J., Golovchinsky, G., Shah, C., Qvarfordt, P., Back, M.: Algorithmic medi-
ation for collaborative exploratory search. In: SIGIR 2008, pp. 315–322. ACM
(2008)
13. Shah, C.: Collaborative Information Seeking - The Art and Science of Making the
Whole Greater than the Sum of All. pp. I–XXI, 1–185. Springer, Heidelberg (2012)
14. Shah, C., González-Ibáñez, R.: Evaluating the synergic effect of collaboration in
information seeking. In: SIGIR 2011, pp. 913–922. ACM (2011)
15. Shah, C., Pickens, J., Golovchinsky, G.: Role-based results redistribution for collab-
orative information retrieval. Inf. Process. Manag. (IP&M) 46(6), 773–781 (2010)
16. Soulier, L., Shah, C., Tamine, L.: User-driven system-mediated collaborative infor-
mation retrieval. In: SIGIR 2014, pp. 485–494. ACM (2014)
17. Soulier, L., Tamine, L., Bahsoun, W.: On domain expertise-based roles in collab-
orative information retrieval. Inf. Process. Manag. (IP&M) 50(5), 752–774 (2014)
18. Spence, P.R., Reddy, M.C., Hall, R.: A survey of collaborative information seek-
ing practices of academic researchers. In: SIGGROUP Conference on Supporting
Group Work, GROUP 2005, pp. 85–88. ACM (2005)
19. Tamine, L., Soulier, L.: Understanding the impact of the role factor in collaborative
information retrieval. In: CIKM 2015, ACM, October 2015
20. Twidale, M.B., Nichols, D.M., Paice, C.D.: Browsing is a collaborative process.
Inf. Process. Manag. (IP&M) 33(6), 761–783 (1997)
Group Recommender Systems: State of the Art,
Emerging Aspects and Techniques, and Research
Challenges
Ludovico Boratto(B)
1 Tutorial Outline
Recommender systems are designed to provide information items that are
expected to interest a user [11]. Given their capability to increase the revenue
in commercial environments, nowadays they are employed by the most relevant
websites, such as Amazon and Netflix.
Group recommender systems are a class of systems designed for contexts in
which more than one person is involved in the recommendation process [7]. Group
recommendation has been highlighted as a challenging research area, with the
first survey on the topic [7] being placed in the Challenges section of the widely-
known book “The Adaptive Web”, and recent research indicating it as a future
direction in recommender systems, since it presents numerous open issues and
challenges [10].
With respect to classic recommendation, a system that works with groups
has to complete a set of specific and additional tasks. This tutorial will present
how the state-of-the-art approaches in the literature handle these tasks in order
to produce recommendations to groups.
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 889–892, 2016.
DOI: 10.1007/978-3-319-30671-1 87
890 L. Boratto
2 Target Audience
This tutorial is aimed at anyone interested in the topic of producing recommen-
dation to groups of users, from data mining and machine learning researchers to
practitioners from industry. For those not familiar with group recommendation
or recommender systems in general, the tutorial will cover the necessary back-
ground material to understand these systems and will provide a state-of-the-art
survey on the topic. Additionally, the tutorial aims to offer a new perspective
that will be valuable and interesting even for more experienced researchers that
work in this domain, by providing the recent advances in this area and by illus-
trating the current research challenges.
3 Instructor
Ludovico Boratto is a research assistant at the University of Cagliari (Italy).
His main research area are Recommender Systems, with special focus on those
that work with groups of users and in social environments. In 2010 and 2014, he
spent 10 months at the Yahoo! Lab in Barcelona as a visiting researcher.
References
1. Amer-Yahia, S., Omidvar-Tehrani, B., Basu Roy, S., Shabib, N.: Group recommen-
dation with temporal affinities. In: Proceedings of 18th International Conference
on Extending Database Technology (EDBT), pp. 421–432. OpenProceedings.org
(2015)
2. Boratto, L., Carta, S.: State-of-the-art in group recommendation and new
approaches for automatic identification of groups. In: Soro, A., Vargiu, E., Armano,
G., Paddeu, G. (eds.) IRMDE 2010. SCI, vol. 324, pp. 1–20. Springer, Heidelberg
(2010)
3. Boratto, L., Carta, S.: Using collaborative filtering to overcome the curse of dimen-
sionality when clustering users in a group recommender system. In: ICEIS 2014 -
Proceedings of 16th International Conference on Enterprise Information Systems,
vol. 2. pp. 564–572. SciTePress (2014)
892 L. Boratto
labs in practice are now in place and are available to the community. Specifically
the Living Labs for IR Evaluation (LL4IR)1 initiative runs as a benchmark-
ing campaign at CLEF, but also operates monthly challenges so that people do
not have to wait for a yearly evaluation cycle. The most recent initiative is the
OpenSearch track at TREC2 , which focuses on academic literature search.
Understanding the differences between online and offline evaluation is still a
largely unexplored area of research. There is a lot of fundamental research to
happen in this space that has not happened yet because of the lack availability of
experimental resources to the academic community. With recent developments,
we believe that online evaluation will be an exciting area to work on in the
future. The motivation for this tutorial is twofold: (1) to raise awareness and
promote this form of evaluation (i.e., online evaluation with living labs) in the
community, and (2) to help people get started by working through all the steps of
the development and deployment process, using the LL4IR evaluation platform.
This half-day tutorial aims to provide a comprehensive overview of the under-
lying theory and complement it with practical guidance. The tutorial is organized
in two 1,5 hours sessions with a break in between. Each session interleaves theo-
retical, practical, and interactive elements to keep the audience engaged. For the
practical parts, we break with the traditional format by using hands-on instruc-
tional techniques. We will make use of an online tool, called DataJoy,3 that
proved invaluable in our previous classroom experience. This allows participants
to (1) run Python code in a browser window without having to install anything
locally, (2) follow the presenter’s screen on their own laptop and, (3) at the same
time, have their own private copy of the project on a different browser tab.
1
http://living-labs.net.
2
http://trec-open-search.org/.
3
http://getdatajoy.com.
Living Labs for Online Evaluation: From Theory to Practice 895
References
1. Balog, K., Kelly, L., Schuth, A.: Head first: living labs for ad-hoc search evaluation.
In: CIKM 2014, pp. 1815–1818. ACM Press, New York, USA, November 2014
2. Belkin, N.J.: Salton award lecture: people, interacting with information. In: Pro-
ceedings of 38th International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, SIGIR 2015, pp. 1–2. ACM (2015)
3. Chuklin, A., Markov, I., de Rijke, M.: Click Models for Web Search. Synthesis
Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool
Publishers, San Rafael (2015)
4. Cleverdon, C.W., Keen, M.: Aslib Cranfield research project-factors determining
the performance of indexing systems; Volume 2, Test results, National Science
Foundation (1966)
5. Diaz, F., White, R., Buscher, G., Liebling, D.: Robust models of mouse move-
ment on dynamic web search results pages. In: CIKM, pp. 1451–1460. ACM Press,
October 2013
6. Guo, Q., Agichtein, E.: Understanding “abandoned” ads: towards personalized
commercial intent inference via mouse movement analysis. In: SIGIR-IRA (2008)
7. Guo, Q., Agichtein, E.: Towards predicting web searcher gaze position from mouse
movements. In: CHI EA, 3601p, April 2010
8. Hassan, A., Shi, X., Craswell, N., Ramsey, B.: Beyond clicks: query reformulation
as a predictor of search satisfaction. In: CIKM (2013)
9. He, J., Zhai, C., Li, X.: Evaluation of methods for relative comparison of retrieval
systems based on clickthroughs. In: CIKM 2009, ACM (2009)
10. He, Y., Wang, K.: Inferring search behaviors using partially observable markov
model with duration (POMD). In: WSDM (2011)
11. Hersh, W., Turpin, A.H., Price, S., Chan, B., Kramer, D., Sacherek, L., Olson, D.:
Do batch and user evaluations give the same results? In: SIGIR, pp. 17–24 (2000)
12. Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring
preferences from clicks. In: CIKM 2011, ACM (2011)
896 A. Schuth and K. Balog
13. Jeff, H., Thomas, L., Ryen, W.: No search result left behind. In: WSDM, 203p
(2012)
14. Joachims, T., Granka, L.A., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Eval-
uating the accuracy of implicit feedback from clicks and query reformulations in
web search. ACM Trans. Inf. Syst. 25(2), 7 (2007)
15. Kim, Y., Hassan, A., White, R., Zitouni, I.: Modeling dwell time to predict click-
level satisfaction. In: WSDM (2014)
16. Kohavi, R.: Online controlled experiments: introduction, insights, scaling, and
humbling statistics. In: Proceedings of UEO 2013 (2013)
17. Lagun, D., Hsieh, C.H., Webster D., Navalpakkam, V.: Towards better measure-
ment of attention and satisfaction in mobile search. In: SIGIR (2014)
18. Li, J., Huffman, S., Tokuda, A.: Good abandonment in mobile and pc internet
search. In: SIGIR 2009, pp. 43–50 (2009)
19. Liu, T.-Y.: Learning to Rank for Information Retrieval. Springer, Heidelberg (2011)
20. Radlinski, F., Craswell, N.: Optimized interleaving for online retrieval evaluation.
In: WSDM 2013, ACM (2013)
21. Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval
quality? In: CIKM 2008, ACM (2008)
22. Sanderson, M.: Test collection based evaluation of information retrieval systems.
Found. Trends Inf. Retrieval 4(4), 247–375 (2010)
23. Schuth, A., Balog, K., Kelly, L.: Overview of the living labs for information retrieval
evaluation (ll4ir) clef lab. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K.,
Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS,
pp. 484–496. Springer, Heidelberg (2015)
24. Schuth, A., Bruintjes, R.-J., Büttner, F., van Doorn, J., Groenland, C., Oosterhuis,
H., Tran, C.-N., Veeling, B., van der Velde, J., Wechsler, R., Woudenberg, D., de
Rijke, M.: Probabilistic multileave for online retrieval evaluation. In: Proceedings
of SIGIR (2015)
25. Schuth, A., Hofmann, K., Radlinski, F.: Predicting search satisfaction metrics with
interleaved comparisons. In: SIGIR 2015 (2015)
26. Schuth, A., Hofmann, K., Whiteson, S., de Rijke, M.: Lerot: an online learning to
rank framework. In: LivingLab 2013, pp. 23–26. ACM Press, November 2013
27. Schuth, A., Sietsma, F., Whiteson, S., Lefortier, D., de Rijke, M.: Multileaved
comparisons for fast online evaluation. In: CIKM 2014 (2014)
28. Song, Y., Shi, X., White, R., Hassan, A.: Context-aware web search abandonment
prediction. In: SIGIR (2014)
29. Teevan, J., Dumais, S., Horvitz, E.: The potential value of personalizing search.
In: SIGIR, pp. 756–757 (2007)
30. Turpin, A., Hersh, W.: Why batch and user evaluations do not give the same
results. In: SIGIR, pp. 225–231 (2001)
31. Turpin, A., Scholar, F.: User performance versus precision measures for simple
search tasks. In: SIGIR, pp. 11–18 (2006)
32. Wang, K., Walker, T., Zheng, Z.: PSkip: estimating relevance ranking quality from
web search clickthrough data. In: KDD, pp. 1355–1364 (2009)
33. Wang, K., Gloy, N., Li, X.: Inferring search behaviors using partially observable
Markov (POM) model. In: WSDM (2010)
34. Yilmaz, E., Verma, M., Craswell, N., Radlinski, F., Bailey, P.: Relevance and effort:
an analysis of document utility. In: CIKM (2014)
35. Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a
dueling bandits problem. In: ICML 2009, pp. 1201–1208 (2009)
Real-Time Bidding Based Display Advertising:
Mechanisms and Algorithms
1 Introduction
c Springer International Publishing Switzerland 2016
N. Ferro et al. (Eds.): ECIR 2016, LNCS 9626, pp. 897–901, 2016.
DOI: 10.1007/978-3-319-30671-1 89
898 J. Wang et al.
2 Outlines
The tutorial will be structured as follows:
1. Background (b) Click-through rate and conver-
(a) The history and evolution of sion prediction
computational advertising; (c) Bidding strategies
(b) The emergence of RTB; (d) Attribution models
2. The Framework and platform (e) Fraud detection
(a) Auction mechanisms 4. Datasets and evaluations
(b) The current eco-system of (a) Datasets and evaluation
RTB; methodologies
(c) Mining RTB auctions; (b) Live test and APIs
3. Research problems and techniques 5. Panel discussion: research chal-
(a) Dynamic pricing and informa- lenges and opportunities for the IR
tion matching with economics community
constraints
4 Description of Topics
Online advertising is now one of the fastest advancing areas in IT industry. In dis-
play and mobile advertising, the most significant development in recent years is
the growth of Real-Time Bidding (RTB), which allows selling and buying online
display advertising per ad impression in real time [16]. Since then, RTB has
fundamentally changed the landscape of the digital media market by scaling the
buying process across a large number of available inventories. It also encourages
behaviour targeting, audience extension, look-alike modelling etc. and makes a
significant shift toward buying focused on user data, rather than contextual data.
Real-Time Bidding Based Display Advertising: Mechanisms and Algorithms 899
For the purpose of evaluation and promote research in the field, we will also
cover a few datasets which are publicly available.
References
1. Chen, B., Yuan, S., Wang, J.: A dynamic pricing model for unifying programmatic
guarantee and real-time bidding in display advertising. In: Proceedings of the 8th
International Workshop on Data Mining for Online Advertising (ADKDD 2014),
p. 9 (2014). http://arxiv.org/abs/1405.5189
900 J. Wang et al.
2. Elmeleegy, H., Li, Y., Qi, Y., Wilmot, P., Wu, M., Kolay, S., Dasdan, A., Chen, S.:
Overview of turn data management platform for digital advertising. Proc. VLDB
Endowment 6(11), 1138–1149 (2013)
3. Ghosh, A., Rubinstein, B.: Adaptive bidding for display advertising. In: Proceed-
ings of the 18th International Conference on World Wide Web, pp. 251–260 (2009).
http://dl.acm.org/citation.cfm?id=1526744
4. Gummadi, R., Key, P.B., Proutiere, A.: Optimal bidding strategies in dynamic
auctions with budget constraints. In: 2011 49th Annual Allerton Conference on
Communication, Control, and Computing, Allerton 2011, p. 588 (2011)
5. Harvey, M., Crestani, F., Carman, M.J.: Building user profiles from topic models
for personalised search. In: Proceedings of the 22nd ACM International Conference
on Information & Knowledge Management - CIKM 2013, pp. 2309–2314 (2013).
http://dl.acm.org/citation.cfm?d=2505515.2505642
6. Jaworska, J., Sydow, M.: Behavioural targeting in on-line advertising: an empirical
study. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.)
WISE 2008. LNCS, vol. 5175, pp. 62–76. Springer, Heidelberg (2008)
7. Kitts, B., Wei, L., Au, D., Powter, A., Burdick, B.: Attribution of conversion events
to multi-channel media. In: Proceedings - IEEE International Conference on Data
Mining, ICDM, pp. 881–886 (2010)
8. McMahan, H.B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L.,
Phillips, T., Davydov, E., Golovin, D., Chikkerur, S., Liu, D., Wattenberg, M.,
Hrafnkelsson, A.M., Boulos, T., Kubica, J.: Ad click prediction: a view from the
trenches. In: Proceedings of the 19th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 1222–1230 (2013). http://dl.acm.org/
citation.cfm?id=2488200
9. Shao, X., Li, L.: Data-driven multi-touch attribution models. In: Proceedings of
the 17th ACM SIGKDD, pp. 258–264 (2011)
10. Stitelman, O., Perlich, C., Dalessandro, B., Hook, R., Raeder, T., Provost, F.:
Using co-visitation networks for detecting large scale online display advertising
exchange fraud. In: Proceedings of the 19th ..., pp. 1240–1248 (2013). http://dl.
acm.org/citation.cfm?id=2487575.2488207
11. Tagami, Y., Ono, S., Yamamoto, K., Tsukamoto, K., Tajima, A.: CTR prediction
for contextual advertising. In: Proceedings of the Seventh International Workshop
on Data Mining for Online Advertising - ADKDD 2013 (2013)
12. Wang, J., Chen, B.: Selling futures online advertising slots via option contracts. In:
WWW, vol. 1000, pp. 627–628 (2012). http://dl.acm.org/citation.cfm?id=2188160
13. Yan, J., Liu, N., Wang, G., Zhang, W.: How much can behavioral targeting help
online advertising? In: Proceeding WWW 2009 Proceedings of the 18th Interna-
tional Conference on World Wide Web, pp. 261–270 (2009). http://portal.acm.
org/citation.cfm?id=1526745
14. Yuan, S., Chen, B., Wang, J., Mason, P., Seljan, S.: An empirical study of reserve
price optimisation in real-time bidding. In: Proceeding of the 20th ACM SIGKDD
Conference (2014)
15. Yuan, S., Wang, J.: Sequential selection of correlated ads by POMDPs. In: Pro-
ceedings of the 21st ACM International Conference on Information and Knowl-
edge Management - CIKM 2012, p. 515 (2012). http://dl.acm.org/citation.cfm?
id=2396761.2396828
Real-Time Bidding Based Display Advertising: Mechanisms and Algorithms 901
16. Yuan, S., Wang, J., Zhao, X.: Real-time bidding for online advertising: measure-
ment and analysis. In: Proceedings of the Seventh International Workshop on Data
Mining for Online Advertising (2013). http://dl.acm.org/citation.cfm?id=2501980
17. Zhang, L., Guan, Y.: Detecting click fraud in pay-per-click streams of online adver-
tising networks. In: Proceedings - The 28th International Conference on Distributed
Computing Systems, ICDCS 2008, pp. 77–84 (2008)
Author Index