Information Storage
and Retrieval
CS418
Dr. Ebtsam AbdelHakam
Computer Science Dept.
Minia University
What is Information Retrieval?
• You have a collection of documents
‣ Books, web pages, journal articles, photographs, video clips,
tweets, a weather database, …
• You have an information need (query)
‣ “How many species of sparrow are native to New England?”
‣ “Find a new musician I’d enjoy listening to.”
‣ “Is it cold outside?”
• You want the documents that best satisfy that need
Web Search
Site-specific Search
Product Search
But also grouping related documents
And mining the web for knowledge
And answering everyday questions
[Link]
Course Goals
• To help you understand the fundamentals of search
engines.
‣ How to crawl, index, and search documents
‣ How to evaluate and compare different search engines
‣ How to modify search engines for specific applications
• To provide broad coverage of the major issues in
information retrieval
• To take a closer look at particular applications of
Information Retrieval in industry
Course Materials
• Suggested books:
‣ Search Engines: Information Retrieval in Practice,
by Croft, Metzler, and Strohman
‣ Introduction to Information Retrieval, by Manning,
Raghavan, and Schütze
‣ Available for free online!
Course Topics
• Architecture of a search engine
• Data acquisition / web Crawling
• Text representation
• Information extraction
• Indexing
• Query processing
• Ranking
• Evaluation
• Classification and clustering
• Social search
• More…
A brief history of IR
• Search of digital libraries was one of the earliest tasks
computers were used for.
• By the 1950s, rudimentary search systems could find
documents that contained particular terms.
• Documents were ranked based on how often the specific
search terms appeared in them — term frequency
weighting
A brief history of IR
• In the 60s, new techniques were developed that treated a document as a
term vector.
‣ Using a “bag of words” model: assuming that the number of
occurrences of each term matters but term order does not
‣ A query can also be represented as a term vector, and the vectors can
be compared to measure similarity between the document and query
• Work also started on clustering documents with similar content
• The concept of relevance feedback was introduced: the best few
documents are assumed to be matches, and documents which are similar
to them are assumed to also be relevant to the original query.
• Some of the first commercial systems appeared in the 60s, sold to
companies who wanted to search their private records
A brief history of IR
• Before the Internet, search was mainly about finding documents in your own collection
• The emphasis was largely on recall — making sure you find every relevant document
• Documents were mainly text files, and did not contain references to other documents
• Just after the Internet, this was all changed
‣ Collection sizes jumped to billions of documents
‣ Documents are structured in networks, providing extra relevance information, and
often have other useful metadata (e.g. how many FaceBook likes?)
‣ You can’t possibly know what’s in every document
‣ A “document” can be pages long or just 120 characters, or could be an image or
video clip, a file download, an abstract fact, or something else entirely
‣ You usually care more about precision — making sure your first few results are
relevant — because people only look at the first few results (except for when they
don’t…)
Challenges of IR
• Text documents are generally free-form
‣ The metadata is there, but you have to find it
‣ Most web pages contain lots of extra content —
ads, navigation bars, comments — that might or
might not be of interest
‣ Spam filtering is hard
• Searching multimedia content has its own challenges
‣ What are the features? How do you extract them?
Challenges of IR
• Running a query is hard
‣ You have less than one second to search the full text of
billions of documents to find the best ten matches
‣ …and the user only gave you two or three words
‣ …and one was misspelled, and one was “the”
‣ …and maybe throw a good relevant ad in, so you can
pay the bills
• Working at web scale means massive distributed systems,
sub-linear algorithms, and careful use of heuristics
Challenges of IR
• Comparing the query text to the document text and
determining what is a good match is the core issue of
information retrieval
‣ Exact matching of words is not enough
‣ Many different ways to write the same thing in a “natural
language” like English
‣ e.g., does a news story containing the text “bank
director in Amherst steals funds” match the query “bank
scandals in western mass?”
‣ Some stories will be better matches than others
Relevance
• What is relevance?
• Simple (and simplistic) definition: A relevant
document contains the information that a person
was looking for when they submitted a query to the
search engine
• Many factors influence a person’s decision about
what is relevant: e.g., task, context, novelty, style
Relevance
• Retrieval models define a particular view of
relevance based on some idea of what users want
• Ranking algorithms used in search engines are
based on retrieval models
• Most models are based on statistical properties of
text rather than deep linguistic analysis
• i.e., counting simple text features such as words
instead of parsing and analyzing the sentences
Users and Information Needs
• Search evaluation is user-centered
• Keyword queries are often poor descriptions of
actual information needs
• Interaction and context are important for
understanding user intent
• Query refinement techniques such as query
expansion, query suggestion, relevance feedback
improve ranking
Research and Industry
• A search engine is the practical application of information
retrieval techniques to large scale text collections
• Web search engines are the best-known examples, but
there are many others
• Open source search engines are important for research
and development
• e.g., Lucene, Lemur/Indri, Galago
• Researchers are focused on many, but not all, of the
tasks that industry search engines care about
Research and Industry
Research Tasks Search Engines
• Performance
• Relevance
‣ Efficient search and indexing
‣ Effective ranking • Incorporating new data
‣ Coverage and freshness
• Evaluation • Scalability
‣ Growing with data and users
‣ Testing and
measuring • Adaptability
‣ Tuning for applications
• Information needs
• Specific problems
‣ User interaction ‣ e.g. Spam
Search Engine Issues
• Performance
• Measuring and improving the efficiency of search
• e.g., reducing response time, increasing query
throughput, increasing indexing speed
• Indexes are data structures designed to improve
search efficiency
• Designing and implementing them are major
issues for search engines
Search Engine Issues
• Dynamic data
• The “collection” for most real applications is constantly
changing in terms of updates, additions, deletions
• e.g., web pages
• Acquiring or “crawling” the documents is a major task
• Typical measures are coverage (how much has been
indexed) and freshness (how recently was it indexed)
• Updating the indexes while processing queries is also a
design issue
Search Engine Issues
• Scalability
• Making everything work with millions of users
every day, and many terabytes of documents
•Distributed processing is essential
• Adaptability
• Changing and tuning search engine components
such as ranking algorithms, indexing strategies,
interfaces for different applications
Search Engine Issues
• Spam
• For web search, spam in all its forms is one of the major
issues
• Affects the efficiency of search engines and, more
seriously, the effectiveness of the results
• Proliferation of spam varieties
• e.g. spamdexing or term spam, link spam, “optimization”
• New subfield called adversarial IR, since spammers are
“adversaries” with different goals