0% found this document useful (0 votes)
25 views

Boolean Retrieval Model

Information Retrieval
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Boolean Retrieval Model

Information Retrieval
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Model is an idealization or abstraction of an actual process.

Information Retrieval models can


describe the computational process.

For example

1. how documents are ranked


2. Note that how documents or indexes are stored is implemented.

The goal of information retrieval (IR) is to provide users with those documents that will
satisfy their information need. Retrieval models can attempt to describe the human Process,
such as the information need, interaction.

Retrieval model can be categorize as

1. Boolean retrieval model


2. Vector space model
3. Probabilistic model
4. Model based on belief net

The Boolean model of information retrieval is a classical information retrieval (IR) model
and is the first and most adopted one. It is used by virtually all commercial IR systems today.

Exact vs Best match

In exact match a query specifies precise criteria. Each document either matches or fails to
match the query. The results retrieved in exact match is a set of document (without ranking).

In best match a query describes good or best matching documents. In this case the result is a
ranked list of document. The Boolean model here I’m going to deal with is the most common
exact match model.

Basic Assumption of Boolean Model


1. An index term is either present(1) or absent(0) in the document
2. All index terms provide equal evidence with respect to information needs.
3. Queries are Boolean combinations of index terms.
o X AND Y: represents doc that contains both X and Y
o X OR Y: represents doc that contains either X or Y
o NOT X: represents the doc that do not contain X
Consider below sentences,
1. I am a cow.
2. Cow is what I am.
3. Today is Tuesday.
Now, if I ask you a question — Can you tell the sentences which contain the term ‘cow’ but
not ‘Tuesday’?
As a human, it is easy for us to say that the answer will be sentence 1 and sentence 2.
But how to model this problem mathematically so that it can be solved by a machine?

Term-Document Incidence Matrix

The term-document incidence matrix is one of the basic techniques to represent text data
where,
> We get the unique words across all the documents.
> For each document, we add 1 if the term exists in the document otherwise fill 0 in the cell.

For the sentences, which we took in our problem statement, Term-Document Incidence Matrix
will look something like this:
Term-Document Incidence Matrix for the sentences — 1, 2 and 3.

Note : Words are normalized i.e. same word is not considered twice across all the
documents/sentences.

Boolean Retrieval Model

It is one of the application of this matrix where we can answer any query which is in the form
of a Boolean expression of terms, that is, in which terms are combined with the
operators and, or, and not.

For our query i.e. get the sentences which contain the term ‘cow’ but not ‘tuesday’,
> We will get the term vector, which is basically, the values from the row containing the term
in Term-Document Matrix. Example — For Cow, the vector will be [1,1,0].
> Perform a Bitwise AND operation between the vectors of the terms provided in the input
query.

Let’s apply the algorithm and see if we get the right answer.

1. Cow Vector = [1,1,0]

2. Tuesday Vector = [0,0,1].

3. Not Tuesday Vector = [1,1,0]. Not vector can be obtained by taking compliment of the
original vector.

Perform BITWISE AND OPERATION :


[1,1,0] & [1,1,0] => [1,1,0]

Inference from the result :


In the result obtained from BITWISE AND operation, the indices for which 1 is present,
those sentence satisfy the input query. Hence, sentence one and two contain the word ‘cow’
but not ‘tuesday’ and will be returned as result for the query.

Conclusion

Term-Document Incidence matrix is one of the basic mathematical model to represent texts
and it can be used to answer Boolean expression queries via model called Boolean Retrieval
Model. Below are the key points to consider:

1. It can answer any query which is a Boolean expression.

2. Views document as the set of terms.


3. Good precision since the documents are retrieved only if the condition is matched.

Advantages

 Clean formalism
 Easy to implement
 Intuitive concept
 If the resulting document set is either too small or too big, it is directly clear which
operators will produce respectively a bigger or smaller set.
 It gives (expert) users a sense of control over the system. It is immediately clear why a
document has been retrieved given a query.

Disadvantages

 Exact matching may retrieve too few or too many documents


 Hard to translate a query into a Boolean expression
 All terms are equally weighted
 More like data retrieval than information retrieval
 Retrieval based on binary decision criteria with no notion of partial matching
 No ranking of the documents is provided (absence of a grading scale)
 Information need has to be translated into a Boolean expression, which most users find
awkward
 The Boolean queries formulated by the users are most often too simplistic
 The model frequently returns either too few or too many documents in response to a user
query

You might also like