Boolean Retrieval Model
Boolean Retrieval Model
For example
The goal of information retrieval (IR) is to provide users with those documents that will
satisfy their information need. Retrieval models can attempt to describe the human Process,
such as the information need, interaction.
The Boolean model of information retrieval is a classical information retrieval (IR) model
and is the first and most adopted one. It is used by virtually all commercial IR systems today.
In exact match a query specifies precise criteria. Each document either matches or fails to
match the query. The results retrieved in exact match is a set of document (without ranking).
In best match a query describes good or best matching documents. In this case the result is a
ranked list of document. The Boolean model here I’m going to deal with is the most common
exact match model.
The term-document incidence matrix is one of the basic techniques to represent text data
where,
> We get the unique words across all the documents.
> For each document, we add 1 if the term exists in the document otherwise fill 0 in the cell.
For the sentences, which we took in our problem statement, Term-Document Incidence Matrix
will look something like this:
Term-Document Incidence Matrix for the sentences — 1, 2 and 3.
Note : Words are normalized i.e. same word is not considered twice across all the
documents/sentences.
It is one of the application of this matrix where we can answer any query which is in the form
of a Boolean expression of terms, that is, in which terms are combined with the
operators and, or, and not.
For our query i.e. get the sentences which contain the term ‘cow’ but not ‘tuesday’,
> We will get the term vector, which is basically, the values from the row containing the term
in Term-Document Matrix. Example — For Cow, the vector will be [1,1,0].
> Perform a Bitwise AND operation between the vectors of the terms provided in the input
query.
Let’s apply the algorithm and see if we get the right answer.
3. Not Tuesday Vector = [1,1,0]. Not vector can be obtained by taking compliment of the
original vector.
Conclusion
Term-Document Incidence matrix is one of the basic mathematical model to represent texts
and it can be used to answer Boolean expression queries via model called Boolean Retrieval
Model. Below are the key points to consider:
Advantages
Clean formalism
Easy to implement
Intuitive concept
If the resulting document set is either too small or too big, it is directly clear which
operators will produce respectively a bigger or smaller set.
It gives (expert) users a sense of control over the system. It is immediately clear why a
document has been retrieved given a query.
Disadvantages