0% found this document useful (0 votes)

8 views19 pages

Group 12 - Report

The document discusses analyzing the text of "The Adventures of Sherlock Holmes" using natural language processing and Python. It involves preprocessing the text by converting it to lowercase, analyzing word frequencies and probabilities, identifying different word types through lemmatization and letter modifications, and generating insights without revealing the full code. The goal is to gain a deeper understanding of the text's characteristics and content.

Uploaded by

SHAMBHAVI GUPTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views19 pages

Group 12 - Report

Uploaded by

SHAMBHAVI GUPTA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

SOCIAL MEDIA & WEB ANALYTICS

END TERM PROJECT (TERM IV)

Auto Corrector Feature Using

NLP and Text Analysis of
Ebook "The Adventures of
Sherlock Holmes" In Python

Under the guidance of

Prof. Mayank Sharma

Submitted by Group - 12
Amrutha Varshini MBAA22006
Arushi Golia MBAA22016
Nihali Sawant MBAA22040
Parul Saraswat MBAA22046
Shambhavi Gupta MBAA22063
Table of Contents

01. 07.
Introduction Sentiment Analysis

08.
02. Topic Modeling
Background
09.
03. Sentiment Analysis
Reading and Preprocessing of
10.
CONTENTS

Text
Named Entity Recognition
04. (NER)
Word Frequency and
Probability 11.
Geospatial Analysis
05.
N-Gram Analysis
12.
Summary
06.
Word Cloud

01
Introduction
Autocorrect is a valuable tool in modern communication, leveraging machine learning
and natural language processing (NLP) to enhance writing tasks. It predicts and
rectifies misspellings, streamlining the creation of paragraphs, reports, and articles.
Numerous websites and social media platforms integrate autocorrect to enhance user
experiences.
Python, a versatile programming language, is a popular choice for developing
autocorrection systems. The project begins with the utilization of the Natural Language

Toolkit (NLTK) library, a powerful resource for NLP-related tasks.

The autocorrection generator works by analyzing input text and suggesting correct
spellings for misspelled words. Machine learning models, often powered by large

datasets, enable the system to identify and correct errors with high accuracy. NLP
techniques come into play for understanding context, contextually-driven corrections,
and handling complex language nuances.
The autocorrect tool not only improves spelling but also enhances grammar,

punctuation, and overall writing quality. It has become an indispensable feature in our
digital age, aiding effective communication and reducing errors in written content.
In summary, autocorrect, driven by machine learning and NLP, is a vital tool for
improving the quality of text-based communication across various platforms, making it
easier for individuals to compose error-free paragraphs, reports, and articles. Its
continued development and integration into digital environments demonstrate the
importance of technology in enhancing our writing abilities.

ARUSHI GOLIA - MBAA22016

Background
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI)
that focuses on the interaction between computers and human language. It
encompasses a wide range of techniques and algorithms aimed at enabling
computers to understand, interpret, and generate human language in a way that
is both meaningful and useful. NLP plays a crucial role in various applications
involving text data, including but not limited to:

1. Language Translation: NLP powers machine translation systems like

Google Translate, allowing users to translate text between different
languages.
2. Speech Recognition: NLP is fundamental to speech recognition technology,
enabling devices like smartphones and virtual assistants to understand
spoken language.
3. Chatbots and Virtual Assistants: NLP is at the core of chatbots and virtual
assistants, enabling them to hold natural conversations with users.
4. Text Summarization: NLP can be used to automatically summarize lengthy
texts, making it useful in content curation and news aggregation.
5. Information Retrieval: NLP powers search engines, helping users find
relevant information from vast amounts of textual data.

The Need for Autocorrection in Text Editing and Messaging

Applications:
Autocorrection is a critical feature in text editing and messaging applications due to
several reasons:

ARUSHI GOLIA - MBAA22016

Spelling Errors: People often make spelling mistakes while typing quickly on
smartphones or keyboards. Autocorrect helps identify and correct these errors,
ensuring that the text is accurate and easily understandable.
Efficiency: Autocorrection increases typing efficiency by reducing the need for
manual correction. Users can type faster and with fewer interruptions when
they rely on autocorrect to fix mistakes.
User Experience: Messaging applications and word processors aim to provide a
seamless and user-friendly experience. Autocorrect contributes to this by
preventing users from sending or publishing content with glaring spelling errors.
Consistency: Autocorrect ensures that commonly misspelled words are
consistently corrected, promoting clarity and uniformity in written
communication.
Predictive Typing: Autocorrect often includes predictive typing features that
suggest words or phrases as users type. This not only corrects errors but also
assists users in completing their sentences more quickly.
Multilingual Support: For users who communicate in multiple languages,
autocorrect can be invaluable in ensuring accurate text across different
language contexts.

In essence, autocorrection using NLP enhances the quality of text communication

by automatically addressing spelling and typing errors, making it an indispensable
feature in today's digital communication landscape. It leverages NLP techniques to
analyze and understand the context of the text, allowing for intelligent and
contextually appropriate corrections.

ARUSHI GOLIA - MBAA22016

READING AND
PREPROCESSING OF
TEXT
The given code appears to be a text analysis script tailored for processing "The
Adventures of Sherlock Holmes" by Sir Arthur Conan Doyle, presumably stored
in a file named 'final.txt.' This script likely performs various tasks such as
counting words, sentences, and paragraphs, extracting keywords, calculating
average word length, and generating a summary of the book's content. It aids in
gaining insights into the book's structure, language usage, and key themes,
facilitating a deeper understanding of the text's characteristics and content
without directly revealing the entire code's functionality.

Converting the text to lowercase is crucial as it ensures uniformity in the

subsequent text analysis. By making all text lowercase, the code prevents
variations in letter casing from affecting the analysis results. This
consistency is essential for accurate and reliable text processing, allowing the
script to focus on content rather than case distinctions, ultimately enhancing
the quality of the analysis results.

ARUSHI GOLIA - MBAA22016

WORD FREQUENCY AND
PROBABILITY
In the code, the task at hand is to analyze a book or text document to
determine how frequently each word appears within it. This process involves
counting the occurrence of each unique word in the text and recording its
frequency. By doing so, we can gain insights into which words are used most
often and thereby understand their significance and prevalence in the text.

Fig :Frequency of Words Fig :Probability of an Occurence of the word

Creation of all types of words ?

The further code is divided into 5 main parts, that includes the creation of all types of
different words that are possible.
To do this, we can use :

01 Lemmatization 04 Replace Letter

02 Deletion of letter 05 Insert new Letter

03 Switching Letter

PARUL SARASWAT - MBAA22046

Lemmatization
To do Lemmatization we will be using pattern module. You can install it using
the below command

Lemmatization is a natural language processing (NLP) technique used to reduce

words to their base or root form, known as a lemma, in order to simplify word
variations and facilitate text analysis. The primary goal of lemmatization is to
transform words with different inflections into a common base form.

Deletion of Letter
Function that Removes a letter from a given word.

PARUL SARASWAT - MBAA22046

Switching Letter
This function swaps two letters of the word.

Replace Letter
It changes one letter to another.

PARUL SARASWAT - MBAA22046

Insert new Letter
It adds additional characters from the bunch of alphabets (one-by-one).

Now, we have implemented all the five steps. It’s time to merge all the words
(i.e. all functions) formed by those steps.

01
Collecting all the words in a set(so that no word
will repeat)

SHAMBHAVI GUPTA - MBAA22063

02
Only storing those values which are in the
vocab
Now, The main task is to extract the correct words among all. To do so we
will be using a get_corrections function.

Now the code is ready, we can test it for any user input by the below code.
Let’s print top 3 suggestions made by the Autocorrect.

The initial implementation involves a basic auto-corrector using Python and NLTK. To enhance it,
the next step is to develop a high-level auto-corrector system that leverages extensive datasets for
improved efficiency and accuracy in correcting spelling and grammar errors in text, making it more
robust and capable.
10

SHAMBHAVI GUPTA - MBAA22063

N-GRAM
N-gram analysis is a valuable technique in natural language processing and text
analysis. In the context of "The Adventures of Sherlock Holmes," it involves
breaking down the text into sequences of words, known as n-grams. N-grams can
be of various lengths, such as unigrams (single words), bigrams (two-word
sequences), trigrams (three-word sequences), and so on.
By applying n-gram analysis to this classic work, researchers and literary analysts
can uncover essential insights. For example, it can reveal frequently occurring
phrases and idiomatic expressions used by Sir Arthur Conan Doyle in his writing.
It also helps identify recurring sentence structures and grammatical patterns
unique to the Sherlock Holmes stories.
This analysis aids in understanding the author's writing style, thematic elements,
and the narrative's intricacies. It can also assist in characterizing the language and
tone specific to Sherlock Holmes tales.
Moreover, n-gram analysis has applications beyond literature. In fields like
machine learning and natural language processing, it is used for tasks like
language modeling, text generation, and sentiment analysis. In essence, by
dissecting text into n-grams, we gain valuable linguistic and structural insights,
offering a deeper understanding of the text's nuances and contributing to various
analytical and creative endeavors.

SHAMBHAVI GUPTA - MBAA22063

Plotting N-Gram

SHAMBHAVI GUPTA - MBAA22063

WORD CLOUD
Word Clouds are used to visualise the most frequent words in the text.
This provides a quick overview of the main themes of the book.
For example, in the book "The Adventures of Sherlock Holmes," common
words like "Sherlock," "Holmes," and "Adventure" might appear
prominently in the word cloud.

AMRUTHA VARSHINI - MBAA22006

SENTIMENT ANALYSIS
Sentiment analysis can be applied to the text to assess the overall
sentiment of the ebook. It determines whether the text has a positive,
negative, or neutral sentiment.
While not typically used for literature analysis, in this context, it can
provide a general sense of the emotional tone of the stories.

Using
“analyze_sentiment”,
we can see that the
tone/sentiment of the
text is Positive.

Further, to see the most used words other than stop words (commonly
used words like pronouns, conjunctions, prepositions etc, stopwords
package was used and the follwing 10 most used words were listed.

AMRUTHA VARSHINI - MBAA22006

TOPIC MODELING
Topic Modelling was done using Count Vectorizer and Latent
Dirichlet Allocation (LDA). LDA helps identify the primary topics or
subjects discussed in the text. The number of topics given was 5.

The top words under each of the topics are found to be as follows

AMRUTHA VARSHINI - MBAA22006

NAMED ENTITY
RECOGNITION (NER)
Named Entity Recognition (NER) is applied to the text using spaCy.
NER identifies and extracts entities such as names of characters, locations,
and organizations.

In the context of the book, this section would extract and categorize
entities like "Sherlock Holmes," "221B Baker Street," and other character
names and locations mentioned in the stories.

Named Entity Relationship Graph

NIHALI SAWANT - MBAA22040

CO-OCCURENCE NETWORK

LINGUISTIC ANALYSIS
Number of sentences: 7
Passive voice sentences:
He was still, as ever, deeply attracted by the study of crime, and occupied his immense
faculties and extraordinary powers of observation in following out those clues, and clearing
up those mysteries which had been abandoned as hopeless by the official police.
He was still, as ever, deeply attracted by the study of crime, and occupied his immense
faculties and extraordinary powers of observation in following out those clues, and clearing
up those mysteries which had been abandoned as hopeless by the official police.
Average sentence complexity: 9.571428571428571

NIHALI SAWANT - MBAA22040

GEOSPATIAL ANALYSIS
The code performs geospatial analysis to extract and geocode geographical
locations mentioned in the text. Locations mentioned in the book are
displayed on a map.

This step can help visualize the various places where the adventures take
place in the book.

SUMMARY
In summary, the code is designed to perform a wide range of text analysis tasks on
"The Adventures of Sherlock Holmes" text file ('final.txt'). It extracts valuable
information about the content, structure, and sentiment of the book, making it a
versatile tool for gaining insights into the text and its themes.