0% found this document useful (0 votes)
31 views

Ir Practical Manual 2

The document outlines various practical exercises related to information retrieval systems, including tasks such as document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, and link analysis. Each practical includes specific aims, code implementations, and outputs demonstrating the results of the algorithms and techniques applied. The exercises utilize libraries like NLTK, scikit-learn, and BeautifulSoup to implement and evaluate different information retrieval methodologies.

Uploaded by

PRAKASH JAKHERE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Ir Practical Manual 2

The document outlines various practical exercises related to information retrieval systems, including tasks such as document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, and link analysis. Each practical includes specific aims, code implementations, and outputs demonstrating the results of the algorithms and techniques applied. The exercises utilize libraries like NLTK, scikit-learn, and BeautifulSoup to implement and evaluate different information retrieval methodologies.

Uploaded by

PRAKASH JAKHERE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

PRACTICAL NO 1

Aim: Document Indexing and Retrieval

Implement an inverted index construction algorithm.

Build a simple document retrieval system using the constructed index.

Code :

import nltk

from nltk.corpus import stopwords

document1 = "The quick brown fox jumped over the lazy dog"

document2 = "The lazy dog slept in the sun"

nltk.download('stopwords')

stopWords = stopwords.words('english')

tokens1 = document1.lower().split()

tokens2 = document2.lower().split()

terms = list(set(tokens1 + tokens2))

inverted_index = {}

occ_num_doc1 = {}

occ_num_doc2 = {}

for term in terms:

if term in stopWords:

continue

documents = []

if term in tokens1:

documents.append("Document 1")

occ_num_doc1[term] = tokens1.count(term)

if term in tokens2:

documents.append("Document 2")
occ_num_doc2[term] = tokens2.count(term)

inverted_index[term] = documents

for term, documents in inverted_index.items():

print(term, "->", end=" ")

for doc in documents:

if doc == "Document 1":

print(f"{doc} ({occ_num_doc1.get(term, 0)}),", end=" ")

else:

print(f"{doc} ({occ_num_doc2.get(term, 0)}),", end=" ")

print()

print("Performed by 740_Pallavi & 743_Deepak")

OUTPUT :

[nltk_data] Downloading package stopwords to

[nltk_data] C:\Users\shivl\AppData\Roaming\nltk_data...

[nltk_data] Unzipping corpora\stopwords.zip.

over -> Document 2 (0),

Performed by 740_Pallavi & 743_Deepak


PRACTICAL NO 2

Aim: Retrieval Models

Implement the Boolean retrieval model and process queries.

Implement the vector space model with TF-IDF weighting and cosine similarity.

Code :

documents = {

1: "apple banana orange",

2: "apple banana",

3: "banana orange",

4: "apple"

def build_index(docs):

index = {}

for doc_id, text in docs.items():

terms = set(text.split())

for term in terms:

if term not in index:

index[term] = {doc_id}

else:

index[term].add(doc_id)

return index

inverted_index = build_index(documents)

def boolean_and(operands, index):

if not operands:

return list(range(1, len(documents) + 1))

result = index.get(operands[0], set())


for term in operands[1:]:

result = result.intersection(index.get(term, set()))

return list(result)

def boolean_or(operands, index):

result = set()

for term in operands:

result = result.union(index.get(term, set()))

return list(result)

def boolean_not(operand, index, total_docs):

operand_set = set(index.get(operand, set()))

all_docs_set = set(range(1, total_docs + 1))

return list(all_docs_set.difference(operand_set))

query1 = ["apple", "banana"]

query2 = ["apple", "orange"]

result1 = boolean_and(query1, inverted_index)

result2 = boolean_or(query2, inverted_index)

result3 = boolean_not("orange", inverted_index, len(documents))

print("Documents containing 'apple' and 'banana':", result1)

print("Documents containing 'apple' or 'orange':", result2)

print("Documents not containing 'orange':", result3)

print("Performed by 740_Pallavi & 743_Deepak")

OUTPUT :

Documents containing 'apple' and 'banana': [1, 2]

Documents containing 'apple' or 'orange': [1, 2, 3, 4]

Documents not containing 'orange': [2, 4]

Performed by 740_Pallavi & 743_Deepak

CODE :
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import nltk

from nltk.corpus import stopwords

import numpy as np

from numpy.linalg import norm

train_set = ["The sky is blue.", "The sun is bright."]

test_set = ["The sun in the sky is bright."]

nltk.download('stopwords')

stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words=stopWords)

transformer = TfidfTransformer()

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()

testVectorizerArray = vectorizer.transform(test_set).toarray()

print('Fit Vectorizer to train set', trainVectorizerArray)

print('Transform Vectorizer to test set', testVectorizerArray)

cx = lambda a, b: round(np.inner(a, b) / (norm(a) * norm(b)), 3)

for vector in trainVectorizerArray:

print(vector)

for testV in testVectorizerArray:

print(testV)

cosine = cx(vector, testV)

print(cosine)

transformer.fit(trainVectorizerArray)

print()

print(transformer.transform(trainVectorizerArray).toarray())

transformer.fit(testVectorizerArray)

print()
tfidf = transformer.transform(testVectorizerArray)

print(tfidf.todense())

OUTPUT :

[nltk_data] Downloading package stopwords to

[nltk_data] C:\Users\shivl\AppData\Roaming\nltk_data...

[nltk_data] Package stopwords is already up-to-date!

Fit Vectorizer to train set [[1 0 1 0]

[0 1 0 1]]

Transform Vectorizer to test set [[0 1 1 1]]

[1 0 1 0]

[0 1 1 1]

0.408

[0 1 0 1]

[0 1 1 1]

0.816

[[0.70710678 0. 0.70710678 0. ]

[0. 0.70710678 0. 0.70710678]]

[[0. 0.57735027 0.57735027 0.57735027]]


PRACTICAL NO 3

Aim: Spelling Correction in IR Systems

Develop a spelling correction module using edit distance algorithms.

Integrate the spelling correction module into an information retrieval system.

CODE :

def editDistance(str1, str2, m, n):

if m == 0:

return n

if n == 0:

return m

if str1[m-1] == str2[n-1]:

return editDistance(str1, str2, m-1, n-1)

return 1 + min(editDistance(str1, str2, m, n-1),

editDistance(str1, str2, m-1, n),

editDistance(str1, str2, m-1, n-1)

str1 = "sunday"

str2 = "saturday"

print('Edit Distance is: ', editDistance(str1, str2, len(str1), len(str2)))

OUTPUT :

Edit Distance is: 3


PRACTICAL NO 4

Aim: Evaluation Metrics for IR Systems

Calculate precision, recall, and F-measure for a given set of retrieval results.

Use an evaluation toolkit to measure average precision and other evaluation metrics.

CODE :

A)

def calculate_metrics(retrieved_set, relevant_set):

true_positive = len(retrieved_set.intersection(relevant_set))

false_positive = len(retrieved_set.difference(relevant_set))

false_negative = len(relevant_set.difference(retrieved_set))

'''

(Optional)

PPT values:

true_positive = 20

false_positive = 10

false_negative = 30

'''

print("True Positive: ", true_positive

,"\nFalse Positive: ", false_positive

,"\nFalse Negative: ", false_negative ,"\n")

precision = true_positive / (true_positive + false_positive)

recall = true_positive / (true_positive + false_negative)

f_measure = 2 * precision * recall / (precision + recall)

return precision, recall, f_measure

retrieved_set = set(["doc1", "doc2", "doc3"])

relevant_set = set(["doc1", "doc4"])


precision, recall, f_measure = calculate_metrics(retrieved_set, relevant_set)

print(f"Precision: {precision}")

print(f"Recall: {recall}")

print(f"F-measure: {f_measure}")

OUTPUT :

True Positive: 1

False Positive: 2

False Negative: 1

Precision: 0.3333333333333333

Recall: 0.5

F-measure: 0.4

Average precision-recall score: 0.8041666666666667

B)

from sklearn.metrics import average_precision_score

y_true = [0, 1, 1, 0, 1, 1]

y_scores = [0.1, 0.4, 0.35, 0.8, 0.65, 0.9]

average_precision = average_precision_score(y_true, y_scores)

print(f'Average precision-recall score: {average_precision}')

from sklearn.metrics import average_precision_score

y_true = [0, 1, 1, 0, 1, 1]

y_scores = [0.1, 0.4, 0.35, 0.8, 0.65, 0.9]

average_precision = average_precision_score(y_true, y_scores)

print(f'Average precision-recall score: {average_precision}')

OUTPUT :

Average precision-recall score: 0.8041666666666667


PRACTICAL NO 5

Aim: Text Categorization

Implement a text classification algorithm (e.g., Naive Bayes or Support Vector Machines).

Train the classifier on a labelled dataset and evaluate its performance.

CODE :

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, classification_report

# Load and preprocess the training data

df_train = pd.read_csv('Dataset.csv') # Assuming 'Dataset.csv' is in the same folder

df_train["data"] = df_train["covid"].astype(str) + " " + df_train["fever"].astype(str)

X, y = df_train["data"], df_train["flu"]

# Split data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data

vectorizer = CountVectorizer()

X_train_counts = vectorizer.fit_transform(X_train)

X_test_counts = vectorizer.transform(X_test)

# Train the model

classifier = MultinomialNB()

classifier.fit(X_train_counts, y_train)

# Load and preprocess the test data

df_test = pd.read_csv('Test.csv') # Assuming 'Test.csv' is in the same folder

df_test["data"] = df_test["covid"].astype(str) + " " + df_test["fever"].astype(str)


new_data_counts = vectorizer.transform(df_test["data"])

# Predict and evaluate

predictions = classifier.predict(new_data_counts)

print(predictions)

# Evaluate the model's performance

print(f"\nAccuracy: {accuracy_score(y_test, classifier.predict(X_test_counts)):.2f}")

print("Classification Report:")

print(classification_report(y_test, classifier.predict(X_test_counts)))

# Save predictions to CSV

df_test['flu_prediction'] = predictions

df_test.to_csv('Test_with_predictions.csv', index=False) # Save predictions in the same folder

OUTPUT :

['Yes' 'No' 'Yes']

Accuracy: 1.00

Classification Report:

precision recall f1-score support

No 1.00 1.00 1.00 1

accuracy 1.00 1

macro avg 1.00 1.00 1.00 1

weighted avg 1.00 1.00 1.00 1


PRACTICAL NO 6

Aim: Clustering for Information Retrieval

Implement a clustering algorithm (e.g., K-means or hierarchical clustering).

Apply the clustering algorithm to a set of documents and evaluate the clustering results.

CODE :

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

documents = ["Cats are known for their agility and grace",

"Dogs are often called mans best friend.",

"Some dogs are trained to assist people with disabilities.",

"The sun rises in the east and sets in the west.",

"Many cats enjoy climbing trees and chasing toys.",

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(documents)

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

print(kmeans.labels_)

OUTPUT :

[0 2 0 1 0]
PRACTICAL NO 7

Aim: Web Crawling and Indexing

Develop a web crawler to fetch and index web pages.

Handle challenges such as robots.txt, dynamic content, and crawling delays.

CODE :

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

from urllib.robotparser import RobotFileParser

import time

def get_html(url):

try:

return requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text

except requests.RequestException:

return None

def save_and_print_robots_txt(url):

robots_url = urljoin(url, '/robots.txt')

robots_content = get_html(robots_url)

if robots_content:

with open('robots.txt', 'w', encoding='utf-8') as file:

file.write(robots_content)

print("robots.txt content:")

print(robots_content)

else:

print("No robots.txt found.")

def is_allowed_by_robots(url):
try:

with open('robots.txt', 'r') as file:

parser = RobotFileParser()

parser.parse(file.read().splitlines())

return parser.can_fetch('*', url)

except Exception:

return True

def crawl(start_url, max_depth=3, delay=1):

visited = set()

save_and_print_robots_txt(start_url) # Save and print robots.txt content

def crawl_recursive(url, depth):

if depth > max_depth or url in visited or not is_allowed_by_robots(url):

return

visited.add(url)

time.sleep(delay)

html = get_html(url)

if html:

print(f"Crawling: {url}")

links = [urljoin(url, a['href']) for a in BeautifulSoup(html, 'html.parser').find_all('a', href=True)]

for link in links:

crawl_recursive(link, depth + 1)

crawl_recursive(start_url, 1)

# Example usage:

crawl('https://wikipedia.com', max_depth=2, delay=2)

OUTPUT :

robots.txt content:

Crawling: https://wikipedia.com
Crawling: https://en.wikipedia.org/

Crawling: https://ja.wikipedia.org/

Crawling: https://ru.wikipedia.org/

Crawling: https://de.wikipedia.org/

Crawling: https://es.wikipedia.org/

Crawling: https://fr.wikipedia.org/

Crawling: https://it.wikipedia.org/

Crawling: https://zh.wikipedia.org/

Crawling: https://fa.wikipedia.org/

Crawling: https://pl.wikipedia.org/

Crawling: https://ar.wikipedia.org/

Crawling: https://arz.wikipedia.org/

Crawling: https://nl.wikipedia.org/

Crawling: https://pt.wikipedia.org/

Crawling: https://ceb.wikipedia.org/

Crawling:

# robots.txt for http://www.wikipedia.org/ and friends

# Please note: There are a lot of pages on this site, and there are

# some misbehaved spiders out there that go _way_ too fast. If you're

# irresponsible, your access to the site may be blocked.

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN

# and ignoring 429 ratelimit responses, claims to respect robots:

# http://mj12bot.com/

User-agent: MJ12bot

Disallow: /
# advertising-related bots:

User-agent: Mediapartners-Google*

Disallow: /

# Wikipedia work bots:

User-agent: IsraBot

Disallow:

User-agent: Orthogaffections
PRACTICAL NO 8

Aim: Link Analysis and PageRank

Implement the PageRank algorithm to rank web pages based on link analysis.

Apply the PageRank algorithm to a small web graph and analyze the results.

CODE :

import numpy as np

def page_rank(graph, damping_factor=0.85, max_iterations=100, tolerance=1e-6):

num_nodes = len(graph)

page_ranks = np.ones(num_nodes) / num_nodes

for _ in range(max_iterations):

prev_page_ranks = np.copy(page_ranks)

for node in range(num_nodes):

incoming_links = [i for i, v in enumerate(graph) if node in v]

if not incoming_links:

continue

page_ranks[node] = (1 - damping_factor) / num_nodes + \

damping_factor * sum(prev_page_ranks[link] /

len(graph[link]) for link in incoming_links)

if np.linalg.norm(page_ranks - prev_page_ranks, 2) < tolerance:

break

return page_ranks

if __name__ == "__main__":

web_graph = [

[1, 2],

[0, 2],

[0, 1] ,
[1,2],

result = page_rank(web_graph)

for i, pr in enumerate(result):

print(f"Page {i}: {pr}")

OUTPUT :

Page 0: 0.6725117940472367

Page 1: 0.7470731975560085

Page 2: 0.7470731975560085

Page 3: 0.25
PRACTICAL NO 9

Aim: Learning to Rank

Implement a learning to rank algorithm (e.g., RankSVM or RankBoost).

Train the ranking model using labelled data and evaluate its effectiveness.

CODE :

import numpy as np

from sklearn.svm import SVC

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

# Example training data (features for each document in a query)

X_train = np.array([

[0.2, 0.3], # Feature vector for first document in query 1

[0.1, 0.4], # Feature vector for second document in query 1

[0.4, 0.2], # Feature vector for first document in query 2

[0.3, 0.5] # Feature vector for second document in query 2

])

# Labels: 1 for better rank (higher score), 0 for lower rank

y_train = np.array([1, 0, 1, 0])

# LabelEncoder to encode the ranking

le = LabelEncoder()

y_train = le.fit_transform(y_train) # Make sure labels are in the proper range (0, 1)

# RankSVM: Using the Support Vector Classification (SVC) to simulate RankSVM

rank_svm = SVC(kernel='linear', C=1)

# Train RankSVM

rank_svm.fit(X_train, y_train)

# Example prediction (ranking for a new query)


test_data = np.array([[0.3, 0.4], [0.2, 0.5]]) # Test query

predictions = rank_svm.predict(test_data)

# Output predictions (Ranked results)

print(f"Predictions for ranking: {predictions}")

OUTPUT :

Predictions for ranking: [0 0]


PRACTICAL NO 10

Aim: Advanced Topics in Information Retrieval

Implement a text summarization algorithm (e.g., extractive or abstractive).

Build a question-answering system using techniques such as information extraction

CODE :

from transformers import pipeline

# Initialize the pipelines

summarizer_extractive = pipeline("summarization", model="facebook/bart-large-cnn")

summarizer_abstractive = pipeline("summarization", model="t5-base")

qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

# Sample text

text = """

Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural


intelligence displayed by humans.

Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives
its environment and takes actions

that maximize its chance of achieving its goals. Colloquially, the term "artificial intelligence" is often
used to describe machines

(or computers) that mimic "cognitive" functions.

"""

# 1. Extractive Summarization

extractive_summary = summarizer_extractive(text, max_length=50, min_length=25,


do_sample=False)

print("Extractive Summary:")

print(extractive_summary[0]['summary_text'])

print("\n" + "="*80 + "\n")

# 2. Abstractive Summarization
abstractive_summary = summarizer_abstractive(text, max_length=50, min_length=25,
do_sample=False)

print("Abstractive Summary:")

print(abstractive_summary[0]['summary_text'])

print("\n" + "="*80 + "\n")

# 3. Question Answering

context = text # The passage from which the answer will be extracted

question = "What is artificial intelligence?"

answer = qa_pipeline(question=question, context=context)

print("Question Answering Result:")

print("Question:", question)

print("Answer:", answer['answer'])

OUTPUT :

Device set to use cpu

Device set to use cpu

Device set to use cpu

Extractive Summary:

Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural


intelligence displayed by humans. Colloquially, the term "artificial intelligence" is often used to
describe machines(or computers) that mimic "c

============================================================================
====

Abstractive Summary:

leading AI textbooks define the field as the study of "intelligent agents" the term "artificial
intelligence" is often used to describe machines that mimic "cognitive" functions .

============================================================================
====

Question Answering Result:


Question: What is artificial intelligence?

Answer: intelligence demonstrated by machines

You might also like