0% found this document useful (0 votes)
13 views25 pages

Section a-UNIT 1

Uploaded by

Tania Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

Section a-UNIT 1

Uploaded by

Tania Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Part A

Introduction:
Introduction: History of IR, Components of IR, The IR Problem, The IR
System, The Software Architecture of the IR System, The impact of the
web on IR, The role of artificial intelligence (AI) in IR, IR Versus Web
Search, Components of a Search engine. [5 Hours]
https://en.wikipedia.org/wiki/Information_retrieval#:~:text=The%20first
%20description%20of%20a,1957%20romantic%20comedy%2C
%20Desk%20Set.
Information retrieval (IR) in computing and information science is the process of
obtaining information system resources that are relevant to an information need from a collection
of those resources. Searches can be based on full-text or other content-based indexing.
Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and
for databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called information
overload. An IR system is a software system that provides access to books, journals and other
documents; it also stores and manages those documents. Web search engines are the most visible
IR applications.

History of Information Retrieval


The idea of using computers to search for relevant pieces of information was popularized in the
article As We May Think by Vannevar Bush in 1945. It would appear that Bush was inspired by
patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that
searched for documents stored on film. The first description of a computer searching for
information was described by Holmstrom in 1948, detailing an early mention of
the Univac computer. Automated information retrieval systems were introduced in the 1950s:
one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large
information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s
several different retrieval techniques had been shown to perform well on small text corpora such
as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as
the Lockheed Dialog system, came into use early in the 1970s.

In 1992, the US Department of Defense along with the National Institute of Standards and
Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER
text program. The aim of this was to look into the information retrieval community by supplying
the infrastructure that was needed for evaluation of text retrieval methodologies on a very large
text collection. This catalyzed research on methods that scale to huge corpora. The introduction
of web search engines has boosted the need for very large scale retrieval systems even further.

Applications.
Areas where information retrieval techniques are employed include (the entries are in alphabetical
order within each category):

General applications

 Digital libraries
 Information filtering
o Recommender systems
 Media search
o Blog search
o Image retrieval
o 3D retrieval
o Music retrieval
o News search
o Speech retrieval
o Video retrieval
 Search engines
o Site search
o Desktop search
o Enterprise search
o Federated search
o Mobile search
o Social search
o Web search

Domain-specific applications

 Expert search finding


 Genomic information retrieval
 Geographic information retrieval
 Information retrieval for chemical structures
 Information retrieval in software engineering
 Legal information retrieval
 Vertical search

https://www.geeksforgeeks.org/what-is-information-retrieval/

Information Retrieval (IR) can be defined as a software process that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.

Not only librarians, professional searchers, etc engage themselves in the activity of information
retrieval but nowadays hundreds of millions of people engage in IR every day when they use
web search engines. Information Retrieval is believed to be the dominant form of Information
access. The IR system assists the users in finding the information they require but it does not
explicitly return the answers to the question. It notifies regarding the existence and location of
documents that might consist of the required information. Information retrieval also extends
support to users in browsing or filtering document collection or processing a set of retrieved
documents. The system searches over billions of documents stored on millions of computers. A
spam filter, manual or automatic means are provided by Email program for classifying the
mails so that it can be placed directly into particular folders.
An IR system has the ability to represent, store, organize, and access information items. A set
of keywords are required to search. Keywords are what people are searching for in search
engines. These keywords summarize the description of the information.

What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by a
matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set of
descriptors, called terms, belonging to a vocabulary V. An IR model determines the query-
document matching function according to four main approaches:

The estimation of the probability of user’s relevance rel for each document d and
query q with respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model/ Software Architecture of IR System

 Acquisition: In this step, the selection of documents and other objects from various
web resources that consist of text-based documents takes place. The required data is
collected by web crawlers and stored in the database.
 Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data,
and metadata.
 File Organization: There are two types of file organization methods.
i.e., Sequential: It contains documents by document data. Inverted: It contains term
by term, list of records under each term. Combination of both.
 Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs, for example, search strings in web search
engines. In information retrieval, a query does not uniquely identify a single object
in the collection. Instead, several objects may match the query, perhaps with
different degrees of relevancy.

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, and database management system such as ODBMS. It
evaluation of information from is A process of identifying and retrieving the data
document repositories particularly from the database, based on the query provided by
textual information. user or application.

Determines the keywords in the user query and


Retrieves information about a subject.
retrieves the data.

Small errors are likely to go unnoticed. A single error object means total failure.

Not always well structured and is


Has a well-defined structure and semantics.
semantically ambiguous.

Does not provide a solution to the user Provides solutions to the user of the database
of the database system. system.

The results obtained are approximate


The results obtained are exact matches.
matches.
Information Retrieval Data Retrieval

Results are ordered by relevance. Results are unordered by relevance.

It is a probabilistic model. It is a deterministic model.

User Interaction with Information Retrieval System

The User Task: The information first is supposed to be translated into a query by the user. In
the information retrieval system, there is a set of words that convey the semantics of the
information that is required whereas, in a data retrieval system, a query expression is used to
convey the constraints which are satisfied by the objects. Example: A user wants to search for
something but ends up searching with another thing. This means that the user is browsing and
not searching. The above figure shows the interaction of the user through different tasks.
 Logical View of the Documents: A long time ago, documents were represented
through a set of index terms or keywords. Nowadays, modern computers represent
documents by a full set of words which reduces the set of representative keywords.
This can be done by eliminating stop words i.e. articles and connectives. These
operations are text operations. These text operations reduce the complexity of the
document representation from full text to set of index terms.

Past, Present, and Future of Information Retrieval

1. Early Developments: As there was an increase in the need for a lot of information, it
became necessary to build data structures to get faster access. The index is the data structure
for faster retrieval of information. Over centuries manual categorization of hierarchies was
done for indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies, and
the search was based on author name and title. In the second generation, it included searching
by subject heading, keywords, etc. In the third generation, it consisted of graphical interfaces,
electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it
provides greater access to networks due to digital communication and it gives free access to
publish on a larger medium.

Issues in Information Retrieval


Indexing is the most vital part of any Information Retrieval System. It is a process in which
the documents required by the users are transformed into searchable data structures. Indexing
can be also referred to as the process of extraction rather than analysis of particular content. It
creates a core functionality of the IR process since it is the first step in IR and assists in
efficient information retrieval.
In the process, first, the document surrogates are created to represent each document.
Secondly, it requires analysis of original documents that include simple (identifying meta-
information e.g., author, title, subject etc.) and complex (linguistic analysis of content) data.
Indexes are the data structures that are used to make the search faster.

Evaluation in Information Retrieval is the process of systematically determining a subject’s


merit, worth, and significance by using certain criteria that are governed by a set of standards.
Issues in Information Retrieval:
The main issues of the Information Retrieval (IR) are Document and Query Indexing, Query
Evaluation, and System Evaluation.

1. Document and Query Indexing –


Main goal of Document and Query Indexing is to find important meanings and
creating an internal representation. The factors to be considered are accuracy to
represent semantics, exhaustiveness, and facility for a computer to manipulate.
2. Query Evaluation –
In the retrieval model how can a document be represented with the selected
keywords and how are documents and query representations compared to calculate
a score. Information Retrieval (IR) deals with issues like uncertainty and vagueness
in information systems.
 Uncertainty :
The available representation does not typically reflect true semantics of
objects such as images, videos etc.
 Vagueness :
The information that the user requires lacks clarity, is only vaguely
expressed in a query, feedback or user action.
3. System Evaluation –
System Evaluation tells about the importance of determining the impact of
information given on user achievement. Here, we see if the efficiency of the
particular system related to time and space.

https://www.geeksforgeeks.org/issues-in-
information-retrieval/
https://www.upgrad.com/blog/information-
retrieval-system-explained/
Impact of Web on Information Retrieval
The science surrounding search engines is commonly referred to as information retrieval, in
which algorithmic principles are developed to match user interests to the best information about
those interests.

Google started as a result of our founders' attempt to find the best matching between the user
queries and Web documents, and do it really fast. During the process, they uncovered a few basic
principles: 1) best pages tend to be those linked to the most; 2) best description of a page is
often derived from the anchor text associated with the links to a page. Theories were
developed to exploit these principles to optimize the task of retrieving the best documents
for a user query.

Search and Information Retrieval on the Web has advanced significantly from those early days:
1) the notion of ""information"" has greatly expanded from documents to much richer
representations such as images, videos, etc., 2) users are increasingly searching on their Mobile
devices with very different interaction characteristics from search on the Desktops; 3) users are
increasingly looking for direct information, such as answers to a question, or seeking to complete
tasks, such as appointment booking. Through our research, we are continuing to enhance and
refine the world's foremost search engine by aiming to scientifically understand the implications
of those changes and address new challenges that they bring.

The role of artificial intelligence (AI) in IR,

Information retrieval is a key technology for knowledge management. It deals with the search for
information and the representation, storage and organization of knowledge. Information retrieval
is concerned with search processes in which a user needs to identify a subset of information
which is relevant for his information need within a large amount of knowledge. The information
seeker formulates a query trying to describe his information need. The query is compared to
document representations which were extracted during an indexing phase. The representations of
documents and queries are typically matched by a similarity function such as the Cosine. The
most similar documents are presented to the users who can evaluate the relevance with respect to
their problem (Belkin, 2000). The problem to properly represent documents and to match
imprecise representations has soon led to the application of techniques developed within
Artificial Intelligence to information retrieval.

BACKGROUND

In the early days of computer science, information retrieval (IR) and artificial intelligence
(AI) developed in parallel. In the 1980s, they started to cooperate and the term intelligent
information retrieval was coined for AI applications in IR. In the 1990s, information retrieval has
seen a shift from set based Boolean retrieval models to ranking systems like the vector space
model and probabilistic approaches. These approximate reasoning systems opened the door for
more intelligent value added components. The large amount of text documents available in
professional databases and on the internet has led to a demand for intelligent methods in text
retrieval and to considerable research in this area. The need for better preprocessing to extract
more knowledge from data has become an important way to improve systems. Off the shelf
approaches promise worse results than systems adapted to users, domain and information needs.
Today, most techniques developed in AI have been applied to retrieval systems with more or less
success. When data from users is available, systems use often machine learning to optimize their
results.

Artificial Intelligence Methods in Information Retrieval

Artificial intelligence methods are employed throughout the standard information retrieval
process and for novel value added services. The first section gives a brief overview of
information retrieval. The subsequent sections are organized along the steps in the retrieval
process and give examples for applications.

Information Retrieval

Information retrieval deals with the storage and representation of knowledge and the retrieval of
information relevant for a specific user problem. The information seeker formulates a query
trying to describe his information need. The query is compared to document representations. The
representations of documents and queries are typically matched by a similarity function such as
the Cosine or the Dice coefficient. The most similar documents are presented to the users who
can evaluate the relevance with respect to their problem.

Indexing usually consists of the several phases. After word segmentation, stop words are
removed. These common words like articles or prepositions contain little meaning by themselves
and are ignored in the document representation. Second, word forms are transformed into their
basic form, the stem. During the stemming phase, e.g. houses would be transformed into house.
For the document representation, different word forms are usually not necessary. The importance
of a word for a document can be different. Some words better describe the content of a document
than others.
This weight is determined by the frequency of a stem within the text of a document (Savoy,
2003).

In multimedia retrieval, the context is essential for the selection of a form of query and
document representation. Different media representations may be matched against each other or
transformations may become necessary (e.g. to match terms against pictures or spoken language
utterances against documents in written text).
As information retrieval needs to deal with vague knowledge, exact processing methods are
not appropriate. Vague retrieval models like the probabilistic model are more suitable. Within
these models, terms are provided with weights corresponding to their importance for a document.
These weights mirror different levels of relevance.
The result of current information retrieval systems are usually sorted lists of documents
where the top results are more likely to be relevant according to the system. In some approaches,
the user can judge the documents returned to him and tell the systems which ones are relevant for
him. The system then resorts the result set. Documents which contain many of the words present
in the relevant documents are ranked higher. This relevance feedback process is known to greatly
improve the performance. Relevance feedback is also an interesting application for machine
learning. Based on a human decisions, the optimization step can be modeled with several
approaches, e.g. with rough sets (Singh & Dey 2005). In Web environments, a click is often
interpreted as an implicit positive relevance judgment (Joachims & Radlinski, 2007).

Advanced Representation Models

In order to represent documents in natural language, the content of these documents needs to
be analyzed. This is a hard task for computer systems. Robust semantic analysis for large text
collections or even multimedia objects has yet to be developed. Therefore, text documents are
represented by natural language terms mostly without syntactic or semantic context. This is often
referred to as the bag-of-words approach. These keywords or terms can only imperfectly
represent an object because their context and relations to other terms are lost.
However, great progress has been made and systems for semantic analysis are getting
competitive. Advanced syntactic and semantic parsing for robust processing of mass data has
been derived from computational linguistics (Hartrumpf, 2006).
For application and domain specific knowledge, another approach is taken to improve the
representation of documents. The representation scheme is enriched by exploiting knowledge
about concepts of the domain (Lin & Demner-Fushman, 2006).

Match Between Query and Document

Once the representation has been derived, a crucial aspect of an information retrieval system
is the similarity calculation between query and document representation. Most systems use
mathematical similarity functions such as the Cosine. The decision for a specific function is
based on heuristics or empirical evaluations. Several approaches use machine learning for long
term optimization of the matching between term and document. E.g. one approach applies
genetic algorithm to adapt a weighting function to a collection (Almeida et al., 2007).
Neural networks have been applied widely in IR. Several network architectures have been
applied for retrieval tasks, most often the so-called spreading activation networks are used.
Spreading activation networks are simple Hopfield-style networks, however, they do not use the
learning rule of Hopfield networks. They typically consist of two layers representing terms and
documents. The weights of connections between the layers are bi-directional and initially set
according to the results of the traditional indexing and weighting algorithms (Belkin, 2000). The
neurons corresponding to the terms of the user’s query are activated in the term layer and
activation spreads along the weights into the document layer and back. Activation represents
relevance or interest and reaches potentially relevant terms and documents. The most highly
activated documents are presented to the user as result. A closer look at the models reveals that
they very much resemble the traditional vector space model of Information Retrieval (Mandl,
2000). It is not until after the second step that associative nature of the spreading activation
process leads to results different from a vector space model. The spreading activation networks
successfully tested with mass data do not take advantage of this associative property. In some
systems the process is halted after only one step from the term layer into the document layer,
whereas others make one more step back to the term layer to facilitate learning (Kwok &
Grunfeld, 1996).
Queries in information retrieval systems are usually short and contain few words. Longer
queries have a higher probability to achieve good results. As a consequence, systems try to add
good terms to a query entered by a user. Several techniques have been applied. Either these terms
are taken from top ranked documents or terms similar to the original ones are used. Another
technique is to use terms from documents from the same category. For this task, classification
algorithms from machine learning are used (Sebastiani, 2002).
Link analysis applies well known measures from bibliometric analysis to the Web. The
number links pointing to a Web page is used as an indicator for its quality (Borodin et al., 2005).
PageRank assigns an authority value to each Web page which is primarily a function of its back
links. Additionally, it assumes that links from pages with high authority should be weighed
higher and should result in a higher authority for the receiving page. To account for the different
values each page has to distribute, the algorithm is carried out iteratively until the result
converges (Borodin et al., 2005). Machine Learning approaches complement link analysis.
Decisions of humans about the quality of Web pages are used to determine design features of
these pages which are good indicators of their quality. Machine learning models are applied to
determine the quality of pages not judged yet (Mandl, 2006, Marti & Hearst, 2002).
Learning from users has been an important strategy to improve systems. In addition to the
content, artificial intelligence methods have been used to improve the user interface.

Value Added Components for User Interfaces

Several Researchers have implemented information retrieval systems based on the


Kohonen self organizing map (SOM), a neural network model for unsuper-vised classification.
They provide an associative user interface where neighborhood of documents expresses a
semantic relation. Implementations for large collections can be tested on the internet (Kohonen,
1998). The SOM consists of a usually two-dimensional grid of neurons, each associated with a
weight vector. Input documents are classified according to the similarity between the input
pattern and the weight vectors, and, the algorithm adapts the weights of the winning neuron and
its neighbor. In that way, neighboring clusters have a high similarity.
The information retrieval applications of SOMs classify documents and assign the
dominant term as name for the cluster. For real world large scale collections, one two-
dimensional grid is not sufficient. It would be either too big or each node would contain too
many documents consequently. Neither would be helpful for users, therefore, a layered
architecture is adopted. The highest layer consists of nodes which represent clusters of
documents. The documents of these nodes are again analyzed by a SOM. For the user, the
system consists of several two-dimensional maps of terms where similar terms are close to each
other. After choosing one node, he may reach another two-dimensional SOM.
The information retrieval paradigm for the SOM is browsing and navigating between
layers of maps. The SOM seems to be a very natural visualization. However, the SOM approach
has some serious drawbacks.
• The interface for interacting with several layers of maps makes the system difficult to
browse.
• Users of large text collections need primarily search mechanisms which the SOM itself does
not offer.
• The similarity of the document collection is reduced to two dimensions omitting many
potentially interesting aspects.
• The SOM unfolds its advantages for human-computer-interaction better for a small number of
documents. A very encouraging application would be the clustering of the result set. The neurons
would fit on one screen, the number of terms would be limited and therefore, the reduction to
two dimensions would not omit so many aspects.

User Classification and Personalization

Adaptive information retrieval approaches intend to tailor the results of a system to one user
and his interests and preferences. The most popular representation scheme relies on the
representation scheme used in information retrieval where a document-term-matrix stores the
importance or weight of each term for each document. When a term appears in a document, this
weight should be different form zero. User interest can also be stored like a document. Then the
interest is a vector of terms. These terms can be ones that a user has entered or selected in a user
interface or which the system has extracted from documents for which the user has shown
interest by viewing or downloading them (Agichtein et al., 2006).
An example for such a system is UCAIR which can be installed as a browser
plugin. UCAIR relies on a standard web search engine to obtain a search result and a primary
ranking. This ranking is now being modified by re-ranking the documents based on implicit
feedback and a stored user interest profile (Shen et al., 2005).
Most systems use this method of storing the user interest in a term vector. However, this
method has several drawbacks. The interest profile may not be stable and the user may have a
variety of diverging interests for work and leisure which are mixed in one profile.
Advanced individualization techniques personalize the underlying system functions. The
results of empirical studies have shown that relevance feedback is an effective technique to
improve retrieval quality. Learning methods for information retrieval need to extend the range of
relevance feedback effects beyond the modification of the query in order to achieve long-term
adaptation to the subjective point of view of the user. The mere change of the query often results
in improved quality; however, the information is lost after the current session.
Some systems change the document representation according to the relevance feedback
information. In a vector space metaphor, the relevant documents are moved toward the query
representation. This approach also comprises some problems. Because only a fraction of the
documents are affected by the modifications, the basic data from the indexing process is changed
to a somewhat heterogeneous state. The original indexing result is not available anymore.
Certainly, this technique is inadequate for fusion approaches where several retrieval methods
are combined. In this case, several basic representations would need to be changed according to
the influence of the corresponding methods on the relevant documents. The indexes are usually
heterogeneous, which is often considered an advantage of fusion approaches. A high
computational overload would be the consequence.
The MIMOR (Multiple Indexing and Method-Object Relations) approach does not rely on
changes to the document or the query representation when processing relevance feedback
information for personalization. Instead, it focuses on the central aspect of a retrieval function,
the calculation of the similarity between document and query. Like other fusion methods,
MIMOR accepts the result of individual retrieval systems like from a black box. These results
are fused by a linear combination which is stored during many sessions. The weights for the
systems experience a change through learning. They adapt according to relevance feedback
information provided by users and create a long-term model for future use. That way, MIMOR
learns which systems were successful in the past (Mandl & Womser-Hacker, 2004).
FUTURE TRENDS

Information retrieval systems are applied in more and more complex and diverse
environments. Searching e-mail, social computing collections and other specific domains pose
new challenges which lead to innovative systems. These retrieval applications require thorough
and user oriented evaluation. New evaluation measures and standardized test collections are
necessary to achieve reliable evaluation results.
In user adaptation, recommendation systems are an important trend for future improvement.
Recommendation systems need to be seen in the context of social computing applications.
System developers face the growth of user generated content which allows new reasoning
methods.
New application like question answering relying on more intelligent processing can be
expected to gain more market share in the near future (Hartrumpf, 2006)

CONCLUSION

Knowledge management is of main importance for the information society. Documents


written in natural language contain an important share of the knowledge available. Consequently,
retrieval is crucial for the success of knowledge management systems. AI technologies have been
widely applied in retrieval systems. Exploiting knowledge more efficiently is a major research
field. In addition, user oriented value added systems require intelligent processing and machine
learning in many forms.
An important future trend for AI methods in IR will be the context specific adaptation of
retrieval methods.

Machine learning can be applied to find optimized functions for collections or queries.

KEY TERMS

Adaptation: Adaptation is a process of modification based on input or observation. An


information system should adapt itself to the specific needs of individual users in order to
produce optimized results.
Indexing: Indexing means the assignment of terms (words) which represent a document in an
index. Indexing can be carried out manually or automatically. Automatic indexing requires the
elimination of stop-words and stemming.
Information Retrieval: Information retrieval is concerned with the representation and
knowledge and subsequent search for relevant information within these knowledge sources.
Information retrieval provides the technology behind search engines.
Link Analysis: The links between pages on the web are a large knowledge source which is
exploited by link analysis algorithms for many ends. Many algorithms similar to Page Rank
determine a quality or authority score based on the number of in-coming links of a page.
Furthermore, link analysis is applied to identify thematically similar pages, web communities
and other social structures.
Recommendation Systems: Actions or content is suggested to the user based on past experience
collected from other users. Very often, documents are recommended based on similarity profiles
between users.
Term Expansion: Terms not present in the original query to an information retrieval system
entered by the user are added automatically. The expanded query is then sent to the system again.
Weighting: Weighting determines the importance of a term for a document. Weights are
calculated using many different formulas which consider the frequency of each term in a
document and in the collection as well as the length of the document and the average or
maximum length of any document in the collection.

IR Versus Web Search

Web search is the application of information retrieval techniques to the largest


corpus of text anywhere — the web — and it is the context where many people
interact with IR systems most frequently.

Search Engine
A search engine is a software program that provides information according to the user query. It
finds various websites or web pages that are available on the internet and gives related results
according to the search. For example, a student wants to learn C++ language so he searches the
“C++ tutorial” in the search engine. So the student gets a list of links that contain the tutorial
linksOr we can say that a search engine is an internet-based software program whose main task
is to collect a large amount of data or information about what is on the internet, then categorize
the data or information and then help user to find the required information from the categorized
information. Google, Yahoo, Bing are the most popular Search Engines.

How do Search Engines Work?


Search engines are generally working on three parts that are crawling, indexing, and ranking

1. Crawling: Search engines have a number of computers programs that are responsible for
finding information that is publicly available on the internet. These programs scan the web and
create a list of all available websites. Then they visit each website and by reading HTML code
they try to understand the structure of the page, the type of the content, the meaning of the
content, and when it was created or updated. Why crawling is important? Because your first
concern when optimizing your website for search engines is to make sure that they can access
it correctly. If they cannot find your content you won’t get any ranking or search engine traffic.
2. Indexing: Information identified by the crawler needs to be organized, Sorted, and Stored so
that it can be processed later by the ranking algorithm. Search engines don’t store all the
information in your index, but they keep things like the Title and description of the page, The
type of content, Associated keywords Number of incoming and outgoing links, and a lot of
other parameters that are needed by the ranking algorithm. Why indexing is important?
Because if your website is not in their index it will not appear for any searches this also means
that if you have any pages indexed you have more chances of appearing in the search results
for a related query.
3. Ranking: Ranking is the position by which your website is listed in any Search Engine.
(There are three steps in which ranking works).
 Step 1: Analyze user query – This step is to understand what kind of information
the user is looking for. To do that analyze the user’s query by breaking it down into
a number of meaningful keywords. A keyword is a word that has a specific meaning
and purpose, for example when you type how to make a chocolate cupcake search
engines know that you are looking for specific information so the results will
contain recipes and step-by-step instructions. They can also understand the meaning
of how to change a light bulb is the same as how to replace a light bulb search
engines are clever enough to interpret spelling mistakes also.
 Step 2: Finding matching pages – This step is to look into their index and find the
best matching pages, for example, if you search dark wallpaper then it gives you the
result of images, not text.
 Step 3: Present the results to the users – A typical search results page includes ten
organic results in most cases it is enriched with other elements like paid Ads, direct
answers for specific queries, etc.

Performance of Search Engine

The performance of search engine is determined by 2 requirements. They are:

 Effectiveness (quality of result).


 Efficiency (Response time & through put).

Components of Search Engine

There three components in search engine. They are web crawler, data base, and search
interface:

 Web crawler: A search engine uses multiple web crawlers to crawl through world
wide web and gather information. It is basically a software which is also known bat
or spider.
 Data base: The information which is gathered by web crawler by crawling through
internet is stored on the database.
 Search Interface: Search interface is just an interface to the data base which is
employed by the user to search through the data base.
Basic building blocks of search engine:

There are basically two building blocks which perform various activities.

 Indexing
 Querying

1. Indexing: Indexing Indexing performs mainly 3 activities text acquisition, text


transformation index creation.
i)Text acquisition: Text acquisition basically identifies and stores documents into data
base for indexing.
 It convert variety of documents into a consistent data Format.
 It also stores text meta data and other related information of document.
ii) Text transformation: It transforms document into indexed terms.
 Parser: It recognizes the “words’ in the text with the help of tokenizer and
process the sequence of text tokens to recognize structural pattern.
 Stopping: Removes stop words like “and”, “or”, “the”.
 Stemming: It groups together all the words derived from same stem.
 Link analysis: It is used identify the popularity page. It uses links & of & anchor
text from web pages.
 Information extraction: Information extraction identifies classes of index terms
which are important for some application.
 Classifier: Identifies class related data of document.
iii) Index creation:
 Document statistics: It collects the features like position & count of words.
 Weighing: Calculates weights of index terms.
 Inversion: As the format of inverted files is fast for query processing it converts
document term information to term document information
2. Querying: It consists following three tasks
 User interaction: User interaction provides a query input which gives an interface
and parser for query language. Then it transforms the query by improving query.
Then it shows the output by Constructing the display of ranked documents for a
query.
 Ranking: It first calculates the score of document by using ranking algorithms. It
processes query in distributed environment.
 Score: qi*di , Where qj & di are term weights for term i query and document
 Evaluation: It this step it logs user queries & interaction for improving search
engines efficiency & effectiveness.
Usage of Search Engine

Search engines have so many usages and some of them are:

 Searching for information: People use a search engine to search for any kind of
information present on the internet. For example, Rohit wants to buy a mobile
phone but he does not know which one is the best mobile phone. So he searches
“best mobile phones in 2021” in the search engine and gets the list of best mobile
phones along with their features, reviews, and prices.
 Searching images and videos: Search engines are also used to search images and
videos. There are so many videos and images available on the internet in different
categories like plants, animals, flowers, etc., you can search them according to your
need.
 Searching location: Search engines are also used to find locations. For example,
Seema is on a Goa trip but she doesn’t know the location of Palolem beach. So she
searches “Palolem beach” on the search engine and then the search engine gives the
best route to reach Palolem beach.
 Searching people: Search engines are also used to find people on the internet
around the world.
 Shopping: Search engines are also used for shopping. Search engines optimize the
pages to meet the needs of the user and give the lists of all the websites that contain
the specified product according to the best price, reviews, free shipping, etc.
 Entertainment: Search engines are also used for entertainment purposes. It is used
to search videos, movies, games, movie trailers, reviews of movies, social
networking sites, etc. For example, Rohan wants to watch a movie named “Ram”,
then he searches this movie on a search engine and the search engine returns a list
of links (of the websites) that contain the Ram movie.
 Education: Search engines are also used for education. With the help of search
engines, people can learn anything they wanted to learn like cooking, programming
languages, home decorations, etc. It is like an open school where you can learn
anything for free.
How do We Use a Search Engine?
Search engines are easy to use. There are billions of searches are performed using search
engines each day. It’s estimated that more than 5.6 billion searches are made per day. For
example, searching on Google, so to this simply open your web browser. Then type
“www.google.com” in the search bar of your web browser and press “Enter”. Then the google
search engine will open and now we are ready to search any information on the google search
engine. Always remember the result returned by the search engine may not all be relevant to
search because it will return search results that have the search words, they are not necessarily
in the same order you typed them in.

You might also like