Section a-UNIT 1
Section a-UNIT 1
Introduction:
Introduction: History of IR, Components of IR, The IR Problem, The IR
System, The Software Architecture of the IR System, The impact of the
web on IR, The role of artificial intelligence (AI) in IR, IR Versus Web
Search, Components of a Search engine. [5 Hours]
https://en.wikipedia.org/wiki/Information_retrieval#:~:text=The%20first
%20description%20of%20a,1957%20romantic%20comedy%2C
%20Desk%20Set.
Information retrieval (IR) in computing and information science is the process of
obtaining information system resources that are relevant to an information need from a collection
of those resources. Searches can be based on full-text or other content-based indexing.
Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and
for databases of texts, images or sounds.
Automated information retrieval systems are used to reduce what has been called information
overload. An IR system is a software system that provides access to books, journals and other
documents; it also stores and manages those documents. Web search engines are the most visible
IR applications.
In 1992, the US Department of Defense along with the National Institute of Standards and
Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER
text program. The aim of this was to look into the information retrieval community by supplying
the infrastructure that was needed for evaluation of text retrieval methodologies on a very large
text collection. This catalyzed research on methods that scale to huge corpora. The introduction
of web search engines has boosted the need for very large scale retrieval systems even further.
Applications.
Areas where information retrieval techniques are employed include (the entries are in alphabetical
order within each category):
General applications
Digital libraries
Information filtering
o Recommender systems
Media search
o Blog search
o Image retrieval
o 3D retrieval
o Music retrieval
o News search
o Speech retrieval
o Video retrieval
Search engines
o Site search
o Desktop search
o Enterprise search
o Federated search
o Mobile search
o Social search
o Web search
Domain-specific applications
https://www.geeksforgeeks.org/what-is-information-retrieval/
Information Retrieval (IR) can be defined as a software process that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
Not only librarians, professional searchers, etc engage themselves in the activity of information
retrieval but nowadays hundreds of millions of people engage in IR every day when they use
web search engines. Information Retrieval is believed to be the dominant form of Information
access. The IR system assists the users in finding the information they require but it does not
explicitly return the answers to the question. It notifies regarding the existence and location of
documents that might consist of the required information. Information retrieval also extends
support to users in browsing or filtering document collection or processing a set of retrieved
documents. The system searches over billions of documents stored on millions of computers. A
spam filter, manual or automatic means are provided by Email program for classifying the
mails so that it can be placed directly into particular folders.
An IR system has the ability to represent, store, organize, and access information items. A set
of keywords are required to search. Keywords are what people are searching for in search
engines. These keywords summarize the description of the information.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by a
matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set of
descriptors, called terms, belonging to a vocabulary V. An IR model determines the query-
document matching function according to four main approaches:
The estimation of the probability of user’s relevance rel for each document d and
query q with respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model/ Software Architecture of IR System
Acquisition: In this step, the selection of documents and other objects from various
web resources that consist of text-based documents takes place. The required data is
collected by web crawlers and stored in the database.
Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data,
and metadata.
File Organization: There are two types of file organization methods.
i.e., Sequential: It contains documents by document data. Inverted: It contains term
by term, list of records under each term. Combination of both.
Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs, for example, search strings in web search
engines. In information retrieval, a query does not uniquely identify a single object
in the collection. Instead, several objects may match the query, perhaps with
different degrees of relevancy.
The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, and database management system such as ODBMS. It
evaluation of information from is A process of identifying and retrieving the data
document repositories particularly from the database, based on the query provided by
textual information. user or application.
Small errors are likely to go unnoticed. A single error object means total failure.
Does not provide a solution to the user Provides solutions to the user of the database
of the database system. system.
The User Task: The information first is supposed to be translated into a query by the user. In
the information retrieval system, there is a set of words that convey the semantics of the
information that is required whereas, in a data retrieval system, a query expression is used to
convey the constraints which are satisfied by the objects. Example: A user wants to search for
something but ends up searching with another thing. This means that the user is browsing and
not searching. The above figure shows the interaction of the user through different tasks.
Logical View of the Documents: A long time ago, documents were represented
through a set of index terms or keywords. Nowadays, modern computers represent
documents by a full set of words which reduces the set of representative keywords.
This can be done by eliminating stop words i.e. articles and connectives. These
operations are text operations. These text operations reduce the complexity of the
document representation from full text to set of index terms.
1. Early Developments: As there was an increase in the need for a lot of information, it
became necessary to build data structures to get faster access. The index is the data structure
for faster retrieval of information. Over centuries manual categorization of hierarchies was
done for indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies, and
the search was based on author name and title. In the second generation, it included searching
by subject heading, keywords, etc. In the third generation, it consisted of graphical interfaces,
electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it
provides greater access to networks due to digital communication and it gives free access to
publish on a larger medium.
https://www.geeksforgeeks.org/issues-in-
information-retrieval/
https://www.upgrad.com/blog/information-
retrieval-system-explained/
Impact of Web on Information Retrieval
The science surrounding search engines is commonly referred to as information retrieval, in
which algorithmic principles are developed to match user interests to the best information about
those interests.
Google started as a result of our founders' attempt to find the best matching between the user
queries and Web documents, and do it really fast. During the process, they uncovered a few basic
principles: 1) best pages tend to be those linked to the most; 2) best description of a page is
often derived from the anchor text associated with the links to a page. Theories were
developed to exploit these principles to optimize the task of retrieving the best documents
for a user query.
Search and Information Retrieval on the Web has advanced significantly from those early days:
1) the notion of ""information"" has greatly expanded from documents to much richer
representations such as images, videos, etc., 2) users are increasingly searching on their Mobile
devices with very different interaction characteristics from search on the Desktops; 3) users are
increasingly looking for direct information, such as answers to a question, or seeking to complete
tasks, such as appointment booking. Through our research, we are continuing to enhance and
refine the world's foremost search engine by aiming to scientifically understand the implications
of those changes and address new challenges that they bring.
Information retrieval is a key technology for knowledge management. It deals with the search for
information and the representation, storage and organization of knowledge. Information retrieval
is concerned with search processes in which a user needs to identify a subset of information
which is relevant for his information need within a large amount of knowledge. The information
seeker formulates a query trying to describe his information need. The query is compared to
document representations which were extracted during an indexing phase. The representations of
documents and queries are typically matched by a similarity function such as the Cosine. The
most similar documents are presented to the users who can evaluate the relevance with respect to
their problem (Belkin, 2000). The problem to properly represent documents and to match
imprecise representations has soon led to the application of techniques developed within
Artificial Intelligence to information retrieval.
BACKGROUND
In the early days of computer science, information retrieval (IR) and artificial intelligence
(AI) developed in parallel. In the 1980s, they started to cooperate and the term intelligent
information retrieval was coined for AI applications in IR. In the 1990s, information retrieval has
seen a shift from set based Boolean retrieval models to ranking systems like the vector space
model and probabilistic approaches. These approximate reasoning systems opened the door for
more intelligent value added components. The large amount of text documents available in
professional databases and on the internet has led to a demand for intelligent methods in text
retrieval and to considerable research in this area. The need for better preprocessing to extract
more knowledge from data has become an important way to improve systems. Off the shelf
approaches promise worse results than systems adapted to users, domain and information needs.
Today, most techniques developed in AI have been applied to retrieval systems with more or less
success. When data from users is available, systems use often machine learning to optimize their
results.
Artificial intelligence methods are employed throughout the standard information retrieval
process and for novel value added services. The first section gives a brief overview of
information retrieval. The subsequent sections are organized along the steps in the retrieval
process and give examples for applications.
Information Retrieval
Information retrieval deals with the storage and representation of knowledge and the retrieval of
information relevant for a specific user problem. The information seeker formulates a query
trying to describe his information need. The query is compared to document representations. The
representations of documents and queries are typically matched by a similarity function such as
the Cosine or the Dice coefficient. The most similar documents are presented to the users who
can evaluate the relevance with respect to their problem.
Indexing usually consists of the several phases. After word segmentation, stop words are
removed. These common words like articles or prepositions contain little meaning by themselves
and are ignored in the document representation. Second, word forms are transformed into their
basic form, the stem. During the stemming phase, e.g. houses would be transformed into house.
For the document representation, different word forms are usually not necessary. The importance
of a word for a document can be different. Some words better describe the content of a document
than others.
This weight is determined by the frequency of a stem within the text of a document (Savoy,
2003).
In multimedia retrieval, the context is essential for the selection of a form of query and
document representation. Different media representations may be matched against each other or
transformations may become necessary (e.g. to match terms against pictures or spoken language
utterances against documents in written text).
As information retrieval needs to deal with vague knowledge, exact processing methods are
not appropriate. Vague retrieval models like the probabilistic model are more suitable. Within
these models, terms are provided with weights corresponding to their importance for a document.
These weights mirror different levels of relevance.
The result of current information retrieval systems are usually sorted lists of documents
where the top results are more likely to be relevant according to the system. In some approaches,
the user can judge the documents returned to him and tell the systems which ones are relevant for
him. The system then resorts the result set. Documents which contain many of the words present
in the relevant documents are ranked higher. This relevance feedback process is known to greatly
improve the performance. Relevance feedback is also an interesting application for machine
learning. Based on a human decisions, the optimization step can be modeled with several
approaches, e.g. with rough sets (Singh & Dey 2005). In Web environments, a click is often
interpreted as an implicit positive relevance judgment (Joachims & Radlinski, 2007).
In order to represent documents in natural language, the content of these documents needs to
be analyzed. This is a hard task for computer systems. Robust semantic analysis for large text
collections or even multimedia objects has yet to be developed. Therefore, text documents are
represented by natural language terms mostly without syntactic or semantic context. This is often
referred to as the bag-of-words approach. These keywords or terms can only imperfectly
represent an object because their context and relations to other terms are lost.
However, great progress has been made and systems for semantic analysis are getting
competitive. Advanced syntactic and semantic parsing for robust processing of mass data has
been derived from computational linguistics (Hartrumpf, 2006).
For application and domain specific knowledge, another approach is taken to improve the
representation of documents. The representation scheme is enriched by exploiting knowledge
about concepts of the domain (Lin & Demner-Fushman, 2006).
Once the representation has been derived, a crucial aspect of an information retrieval system
is the similarity calculation between query and document representation. Most systems use
mathematical similarity functions such as the Cosine. The decision for a specific function is
based on heuristics or empirical evaluations. Several approaches use machine learning for long
term optimization of the matching between term and document. E.g. one approach applies
genetic algorithm to adapt a weighting function to a collection (Almeida et al., 2007).
Neural networks have been applied widely in IR. Several network architectures have been
applied for retrieval tasks, most often the so-called spreading activation networks are used.
Spreading activation networks are simple Hopfield-style networks, however, they do not use the
learning rule of Hopfield networks. They typically consist of two layers representing terms and
documents. The weights of connections between the layers are bi-directional and initially set
according to the results of the traditional indexing and weighting algorithms (Belkin, 2000). The
neurons corresponding to the terms of the user’s query are activated in the term layer and
activation spreads along the weights into the document layer and back. Activation represents
relevance or interest and reaches potentially relevant terms and documents. The most highly
activated documents are presented to the user as result. A closer look at the models reveals that
they very much resemble the traditional vector space model of Information Retrieval (Mandl,
2000). It is not until after the second step that associative nature of the spreading activation
process leads to results different from a vector space model. The spreading activation networks
successfully tested with mass data do not take advantage of this associative property. In some
systems the process is halted after only one step from the term layer into the document layer,
whereas others make one more step back to the term layer to facilitate learning (Kwok &
Grunfeld, 1996).
Queries in information retrieval systems are usually short and contain few words. Longer
queries have a higher probability to achieve good results. As a consequence, systems try to add
good terms to a query entered by a user. Several techniques have been applied. Either these terms
are taken from top ranked documents or terms similar to the original ones are used. Another
technique is to use terms from documents from the same category. For this task, classification
algorithms from machine learning are used (Sebastiani, 2002).
Link analysis applies well known measures from bibliometric analysis to the Web. The
number links pointing to a Web page is used as an indicator for its quality (Borodin et al., 2005).
PageRank assigns an authority value to each Web page which is primarily a function of its back
links. Additionally, it assumes that links from pages with high authority should be weighed
higher and should result in a higher authority for the receiving page. To account for the different
values each page has to distribute, the algorithm is carried out iteratively until the result
converges (Borodin et al., 2005). Machine Learning approaches complement link analysis.
Decisions of humans about the quality of Web pages are used to determine design features of
these pages which are good indicators of their quality. Machine learning models are applied to
determine the quality of pages not judged yet (Mandl, 2006, Marti & Hearst, 2002).
Learning from users has been an important strategy to improve systems. In addition to the
content, artificial intelligence methods have been used to improve the user interface.
Adaptive information retrieval approaches intend to tailor the results of a system to one user
and his interests and preferences. The most popular representation scheme relies on the
representation scheme used in information retrieval where a document-term-matrix stores the
importance or weight of each term for each document. When a term appears in a document, this
weight should be different form zero. User interest can also be stored like a document. Then the
interest is a vector of terms. These terms can be ones that a user has entered or selected in a user
interface or which the system has extracted from documents for which the user has shown
interest by viewing or downloading them (Agichtein et al., 2006).
An example for such a system is UCAIR which can be installed as a browser
plugin. UCAIR relies on a standard web search engine to obtain a search result and a primary
ranking. This ranking is now being modified by re-ranking the documents based on implicit
feedback and a stored user interest profile (Shen et al., 2005).
Most systems use this method of storing the user interest in a term vector. However, this
method has several drawbacks. The interest profile may not be stable and the user may have a
variety of diverging interests for work and leisure which are mixed in one profile.
Advanced individualization techniques personalize the underlying system functions. The
results of empirical studies have shown that relevance feedback is an effective technique to
improve retrieval quality. Learning methods for information retrieval need to extend the range of
relevance feedback effects beyond the modification of the query in order to achieve long-term
adaptation to the subjective point of view of the user. The mere change of the query often results
in improved quality; however, the information is lost after the current session.
Some systems change the document representation according to the relevance feedback
information. In a vector space metaphor, the relevant documents are moved toward the query
representation. This approach also comprises some problems. Because only a fraction of the
documents are affected by the modifications, the basic data from the indexing process is changed
to a somewhat heterogeneous state. The original indexing result is not available anymore.
Certainly, this technique is inadequate for fusion approaches where several retrieval methods
are combined. In this case, several basic representations would need to be changed according to
the influence of the corresponding methods on the relevant documents. The indexes are usually
heterogeneous, which is often considered an advantage of fusion approaches. A high
computational overload would be the consequence.
The MIMOR (Multiple Indexing and Method-Object Relations) approach does not rely on
changes to the document or the query representation when processing relevance feedback
information for personalization. Instead, it focuses on the central aspect of a retrieval function,
the calculation of the similarity between document and query. Like other fusion methods,
MIMOR accepts the result of individual retrieval systems like from a black box. These results
are fused by a linear combination which is stored during many sessions. The weights for the
systems experience a change through learning. They adapt according to relevance feedback
information provided by users and create a long-term model for future use. That way, MIMOR
learns which systems were successful in the past (Mandl & Womser-Hacker, 2004).
FUTURE TRENDS
Information retrieval systems are applied in more and more complex and diverse
environments. Searching e-mail, social computing collections and other specific domains pose
new challenges which lead to innovative systems. These retrieval applications require thorough
and user oriented evaluation. New evaluation measures and standardized test collections are
necessary to achieve reliable evaluation results.
In user adaptation, recommendation systems are an important trend for future improvement.
Recommendation systems need to be seen in the context of social computing applications.
System developers face the growth of user generated content which allows new reasoning
methods.
New application like question answering relying on more intelligent processing can be
expected to gain more market share in the near future (Hartrumpf, 2006)
CONCLUSION
Machine learning can be applied to find optimized functions for collections or queries.
KEY TERMS
Search Engine
A search engine is a software program that provides information according to the user query. It
finds various websites or web pages that are available on the internet and gives related results
according to the search. For example, a student wants to learn C++ language so he searches the
“C++ tutorial” in the search engine. So the student gets a list of links that contain the tutorial
linksOr we can say that a search engine is an internet-based software program whose main task
is to collect a large amount of data or information about what is on the internet, then categorize
the data or information and then help user to find the required information from the categorized
information. Google, Yahoo, Bing are the most popular Search Engines.
1. Crawling: Search engines have a number of computers programs that are responsible for
finding information that is publicly available on the internet. These programs scan the web and
create a list of all available websites. Then they visit each website and by reading HTML code
they try to understand the structure of the page, the type of the content, the meaning of the
content, and when it was created or updated. Why crawling is important? Because your first
concern when optimizing your website for search engines is to make sure that they can access
it correctly. If they cannot find your content you won’t get any ranking or search engine traffic.
2. Indexing: Information identified by the crawler needs to be organized, Sorted, and Stored so
that it can be processed later by the ranking algorithm. Search engines don’t store all the
information in your index, but they keep things like the Title and description of the page, The
type of content, Associated keywords Number of incoming and outgoing links, and a lot of
other parameters that are needed by the ranking algorithm. Why indexing is important?
Because if your website is not in their index it will not appear for any searches this also means
that if you have any pages indexed you have more chances of appearing in the search results
for a related query.
3. Ranking: Ranking is the position by which your website is listed in any Search Engine.
(There are three steps in which ranking works).
Step 1: Analyze user query – This step is to understand what kind of information
the user is looking for. To do that analyze the user’s query by breaking it down into
a number of meaningful keywords. A keyword is a word that has a specific meaning
and purpose, for example when you type how to make a chocolate cupcake search
engines know that you are looking for specific information so the results will
contain recipes and step-by-step instructions. They can also understand the meaning
of how to change a light bulb is the same as how to replace a light bulb search
engines are clever enough to interpret spelling mistakes also.
Step 2: Finding matching pages – This step is to look into their index and find the
best matching pages, for example, if you search dark wallpaper then it gives you the
result of images, not text.
Step 3: Present the results to the users – A typical search results page includes ten
organic results in most cases it is enriched with other elements like paid Ads, direct
answers for specific queries, etc.
There three components in search engine. They are web crawler, data base, and search
interface:
Web crawler: A search engine uses multiple web crawlers to crawl through world
wide web and gather information. It is basically a software which is also known bat
or spider.
Data base: The information which is gathered by web crawler by crawling through
internet is stored on the database.
Search Interface: Search interface is just an interface to the data base which is
employed by the user to search through the data base.
Basic building blocks of search engine:
There are basically two building blocks which perform various activities.
Indexing
Querying
Searching for information: People use a search engine to search for any kind of
information present on the internet. For example, Rohit wants to buy a mobile
phone but he does not know which one is the best mobile phone. So he searches
“best mobile phones in 2021” in the search engine and gets the list of best mobile
phones along with their features, reviews, and prices.
Searching images and videos: Search engines are also used to search images and
videos. There are so many videos and images available on the internet in different
categories like plants, animals, flowers, etc., you can search them according to your
need.
Searching location: Search engines are also used to find locations. For example,
Seema is on a Goa trip but she doesn’t know the location of Palolem beach. So she
searches “Palolem beach” on the search engine and then the search engine gives the
best route to reach Palolem beach.
Searching people: Search engines are also used to find people on the internet
around the world.
Shopping: Search engines are also used for shopping. Search engines optimize the
pages to meet the needs of the user and give the lists of all the websites that contain
the specified product according to the best price, reviews, free shipping, etc.
Entertainment: Search engines are also used for entertainment purposes. It is used
to search videos, movies, games, movie trailers, reviews of movies, social
networking sites, etc. For example, Rohan wants to watch a movie named “Ram”,
then he searches this movie on a search engine and the search engine returns a list
of links (of the websites) that contain the Ram movie.
Education: Search engines are also used for education. With the help of search
engines, people can learn anything they wanted to learn like cooking, programming
languages, home decorations, etc. It is like an open school where you can learn
anything for free.
How do We Use a Search Engine?
Search engines are easy to use. There are billions of searches are performed using search
engines each day. It’s estimated that more than 5.6 billion searches are made per day. For
example, searching on Google, so to this simply open your web browser. Then type
“www.google.com” in the search bar of your web browser and press “Enter”. Then the google
search engine will open and now we are ready to search any information on the google search
engine. Always remember the result returned by the search engine may not all be relevant to
search because it will return search results that have the search words, they are not necessarily
in the same order you typed them in.