0% found this document useful (0 votes)
66 views10 pages

Understanding Information Retrieval Systems

The document discusses information retrieval systems and how they work. Information retrieval systems store and manage documents to help users find relevant information. They index documents and return results based on similarity to user queries rather than directly answering questions. Key aspects include defining the text database, building an index, processing user queries, ranking results, and evaluating systems using precision and recall.

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views10 pages

Understanding Information Retrieval Systems

The document discusses information retrieval systems and how they work. Information retrieval systems store and manage documents to help users find relevant information. They index documents and return results based on similarity to user queries rather than directly answering questions. Key aspects include defining the text database, building an index, processing user queries, ranking results, and evaluating systems using precision and recall.

Uploaded by

rm23082001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)

that satisfies an information need from within large collections (usually stored on computers).

Information Retrieval System

An information retrieval system is a software programme that stores and manages information on
documents, often textual documents but possibly multimedia. The system assists users in finding the
information they need. It does not explicitly return information or answer questions. Instead, it informs
on the existence and location of documents that might contain the desired information.

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

Retrieves information based on the similarity Retrieves data based on the keywords in the
1
between the query and the document. query entered by the user.

Small errors are tolerated and will likely go There is no room for errors since it results in
2
unnoticed. complete system failure.

It is ambiguous and doesn’t have a defined It has a defined structure with respect to
3
structure. semantics.
Does not provide a solution to the user of the Provides solutions to the user of the database
4
database system. system.

Information Retrieval system produces


5 Data Retrieval system produces exact results.
approximate results

6 Displayed results are sorted by relevance Displayed results are not sorted by relevance.

The Data Retrieval model is deterministic by


7 The IR model is probabilistic by nature.
nature.

Information retrieving system architecture

First of all, before the retrieval process can even be initiated, it is necessary to define the text
database. This is usually done by the manager of the database, which specifies the following: (a)
the documents to be used, (b) the operations to be performed on the text, and (c) the text model
(i.e., the text structure and what elements can be retrieved). The text operations transform the
original documents and generate a logical view of them.
Once the logical view of the documents is defined, the database manager builds an index
of the text. An index is a critical data structure because it allows fast searching over large
volumes of data. Different index structures might be used, but the most popular one is the
inverted file. The resources (time and storage space) spent on defining the text database and
building the index are amortized by querying the retrieval system many times.
Given that the document database is indexed, the retrieval process can be initiated. The
user first specifies a user need which is then parsed and transformed by the same text operations
applied to the text. Then, query operations might be applied before the actual query, which
provides a system representation for the user need, is generated. The query is then processed to
obtain the retrieved documents. Fast query processing is made possible by the index structure
previously built.
Before been sent to the user, the retrieved documents are ranked according to a likelihood
of relevance. The user then examines the set of ranked documents in the search for useful
information. At this point, he might pinpoint a subset of the documents seen as definitely of
interest and initiate a user feedback cycle. In such a cycle, the system uses the documents
selected by the user to change the query formulation. Hopefully, this modified query is a better
representation

Issues with IR systems:


Are the retrieved documents relevant? (precision)
Are all the relevant documents retrieved? (Recall)

Evaluation of IR system

Two of the evaluation measures are precision and recall.


 Precision is the proportion of retrieved documents that are relevant.
Recall is the proportion of relevant documents that are retrieved.
Precision = Relevant documents ∩ Retrieved documents
Retrieved documents
 Recall = Relevant documents ∩ Retrieved documents
Relevant documents

 When the recall measure is used, there is an assumption that all the
relevant documents for a given query are known. Such an assumption is
clearly problematic in a web search environment, but with smaller test
collection of documents, this measure can be useful. It is not suitable
for large volumes of log data.

You can increase recall by returning more docs.


Recall is a non-decreasing function of the number of docs retrieved.
A system that returns all docs has 100% recall!
The converse is also true (usually): It’s easy to get high precision for very low recall.

Q Calculate precision and recall of the following truth table.

Sol.
TP=20, FP=40, FN=60

Luhn’s Idea

One of the first text summarization algorithms was published in 1958 by Hans Peter Luhn, working at
IBM research. Luhn’s algorithm is a naive approach based on TF-IDF and looking at the “window size” of
non-important words between words of high importance.

Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per
their frequency. Higher weights are assigned to the words present at the begining of the document. It
considers the words lying in the shaded region in this graph:
The region on the right signifies highest occurring elements while words on the left signifies least
occurring elements. Luhn introduced the following criteria during text pre-processing:

1. Removing stopwords

2. Stemming (Likes->Like)

In this method we select sentences with highest concentration of salient content terms. For example , if
we have 10 words in a sentence and 4 of the words are significant.
For calculating the significance instead of number of significant words by all words
here we divide them by the span that consist of these words. Thus the Score obtained from our example
would be
Score= 42/6 = 2.7

Application
The Luhns method is most significant when:

1. Too low frequent words are not significant

2. Too high frequent words are also not significant (e.g. “is”, “and”)

3. Removing low frequent words is easy

 set a minimum frequency-threshold

4. Removing common (high frequent) words:

 Setting a maximum frequency threshold (statistically obtained)

 Comparing to a common-word list

5. Used for summarizing technical documents.

Algorithm
Luhns method is a simple technique to generate a summary from given words. The algorithm can be
implemented in two stages.
In the first stage, we try to determine which words are more significant towards the meaning of
document. Luhn states that this is first done by doing a frequency analysis, then finding words which are
significant, but not unimportant English words.
In the second phase, we find out the most common words in the document, and then take a subset of
those that are not these most common english words, but are still important. It usually consists of
following three steps:
1. It begins with transforming the content of sentences into a mathematical expression, or vector
(represented below through binary representation). Here we use a bag of words , which ignores all
the filler words. Filler words are usually the supporting words that do not have any impact on our
document meaning. Then we count all the valuable words left to us. For example,

In the above table we can clearly see that the words like an and a that are the stopwords are not
considered while evaluation.
2. In this step we use evaluate sentences using sentence scoring technique. We can use the scoring
method as illustrated below.

Score= (Number of meaningful words)2/(Span of meaningful words)

A span here refers to the part of sentence (in our case)/document consisting of all the meaningful words.
Tf-idf can also be used to prioritize the words in a sentence.
3. Once the sentence scoring is complete, the last step is simply to select those sentences with the
highest overall rankings.
TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how
relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents

TF-IDF was invented for document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document, but is offset by the number of
documents that contain the word. So, words that are common in every document, such as this, what, and
if, rank low even though they may appear many times, since they don’t mean much to that document.

How is TF-IDF calculated?

TF-IDF for a word in a document is calculated by multiplying two different metrics:

 The term frequency of a word in a document. There are several ways of calculating this
frequency, with the simplest being a raw count of instances a word appears in a document. Then,
there are ways to adjust the frequency, by length of a document, or by the raw frequency of the
most frequent word in a document.
 The inverse document frequency of the word across a set of documents. This means, how
common or rare a word is in the entire document set. The closer it is to 0, the more common a
word is. This metric can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm.
 So, if the word is very common and appears in many documents, this number will approach 0.
Otherwise, it will approach 1.

Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the
score, the more relevant that word is in that document.

To put it in more formal mathematical terms, the TF-IDF score for the word t in the document d from the
document set D is calculated as follows:

Where:
Conflation algorithms
Conflation algorithms are used in Information Retrieval (IR) systems for matching the morphological
variants of terms for efficient indexing and faster retrieval operations. The conflation process can be done
either manually or automatically. The automatic conflation operation is also called stemming.

Conflation algorithms are used or improving IR performance by finding morphological variants of search
terms. For example, a searcher enters the term stemming as part of a query, it is likely that he or she will
also be interested in such variants as stemmed and stem. We use the term conflation, meaning the act of
fusing or combining, as the general term for the process of matching morphological term variants.
Conflation can be either manual--using regular expressions--or automatic, via programs called stemmers.

Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. For example: words such as “Likes”,”
liked”, ”likely” and ”liking” will be reduced to “like” after stemming

.
There are four automatic approaches. Affix removal algorithms removes affixes or prefixes
from terms leaving a stem. Successor variety stemmers use the frequencies of letter
sequences in the text as the basis for stemming. N-gram method conflates the terms based
on the number of diagrams or n-grams they share. Correctness, retrieval effectiveness and
compression performance judges the stemmers. There are two was a stemming can be
incorrect over stemming and under stemming. When a term is over stemmed too much of
the stem is removed. Over stemming may cause unrelated terms to be conflated. Under
stemming is removal of too little of a term and will make the related terms from being
conflated.

Common questions

Powered by AI

User feedback cycles enhance the Information Retrieval process by refining the query formulation based on user interactions with the retrieval results. After viewing retrieved documents, users may identify a subset as highly relevant or of special interest, triggering a feedback cycle whereby the system uses this information to adjust or reformulate the initial query. The feedback allows the system to enhance query representation, potentially improving future retrieval accuracy and relevance . This iterative process supports progressive improvements in precision and recall metrics by leveraging user input to fine-tune search strategies .

Over-stemming and under-stemming represent major drawbacks when employing conflation algorithms in Information Retrieval. Over-stemming removes too much of a word's stem, conflating unrelated terms that could significantly alter search results. For example, conflating 'universe' and 'universal' could lead to irrelevant documents being retrieved. Conversely, under-stemming retains excessive stem detail, preventing related terms from being adequately matched, which may lead to missed relevant documents . Both scenarios can degrade accuracy and deteriorate retrieval effectiveness, as they impact the system's ability to correctly interpret and retrieve documents based on user queries .

The index structure is crucial in an IR system for efficient query processing. It allows fast searching across large data volumes by organizing documents in a way that supports quick access and retrieval based on user queries . An index, often constructed as an inverted file, significantly speeds up query processing by reducing the search space; this efficiency is key as it enables the system to rank and retrieve documents rapidly according to their relevance to the query . The resources expended to build the index are amortized as the system handles multiple queries over time .

Conflation algorithms in Information Retrieval systems enhance search efficiency by merging morphological variants of search terms, a process often known as stemming. These algorithms, which can be automatic, like affix removal algorithms or N-gram methods, enable the system to match different forms of a word, such as 'stem', 'stems', 'stemming', to a single root form . This consolidation reduces variations in data storage, increases indexing efficiency, and improves retrieval by matching query terms with variant forms found in documents. Consequently, conflation reduces redundant queries and focuses search capabilities on the semantic core of terms .

Before initiating the retrieval process in document retrieval, several critical architectural steps must occur. Initially, the text database needs definition, specifying the documents, text operations, and text model components that will be included . This stage involves transforming documents and generating a logical view, followed by constructing an index structure—most commonly an inverted file. This index significantly impacts search efficiency by enabling fast lookup over vast data volumes, reducing the time and computational resources required for query processing. These preparatory steps, often resource-intensive, accumulate benefits through repeated queries, ultimately enhancing retrieval speed and relevance .

TF-IDF (Term Frequency-Inverse Document Frequency) is a pivotal statistical measure in Information Retrieval for evaluating word relevance within a document set. It calculates relevance by multiplying term frequency, which assesses how often a word appears in a document, by inverse document frequency, which gauges how common the word is across all documents. Thus, it highlights words that are significant to specific documents but not universally common, distinguishing them as more relevant for the user's query . This method enhances the precision of retrieval by prioritizing contextually important words over ubiquitous terms like 'the' or 'and' .

Luhn's text summarization algorithm employs TF-IDF by identifying words of higher importance based on their frequency compared to the entire document set. It uses TF-IDF to prioritize words that contribute significantly to the document's meaning, removing common and insignificant stopwords . When applied to technical documents, Luhn's method effectively summarizes content by selecting sentences with concentrated significant terms. It filters out high-frequency words that offer little additional information, thus generating a concise summary that retains essential information due to its capability to distinguish between varying word significances .

Information Retrieval (IR) systems and Data Retrieval (DR) systems differ significantly in their approach to handling queries and errors. IR systems retrieve information based on the similarity between the query and document content, tolerating small errors which are often unnoticed, allowing approximate results. They do not directly provide solutions but inform users about the existence and location of documents . In contrast, DR systems retrieve data based on exact keywords entered by the user, allowing no room for errors as any error might lead to system failure. They provide exact and deterministic results, offering a well-defined, structured, and semantic data retrieval .

Hans Peter Luhn's summarization technique finds a balance between high and low-frequency words by focusing on terms that neither appear too rarely nor too frequently within a document. He sets a frequency threshold to filter out low significance words, while statistical methods exclude overly common terms, like basic stopwords . This approach ensures that only the words contributing substantive meaning to the document, lying neither at the low-end nor high-end of frequency distribution, are selected for summarization. Consequently, it generates summaries that effectively capture the core themes of technical documents without being obscured by irrelevant or trivial content .

Precision and recall are key metrics for evaluating Information Retrieval systems. Precision measures the proportion of retrieved documents that are relevant to the query, while recall measures the proportion of relevant documents that are successfully retrieved . However, a limitation of these metrics is their dependency on knowing all relevant documents for a given query, which is often impractical in large, real-world databases such as web search environments. Increasing recall is straightforward by returning more documents, potentially reducing precision, making it challenging to find an optimal balance between the two metrics .

You might also like