Reference Material For NLP - 1
Reference Material For NLP - 1
Text Similarity
Cosine Similarity
roughly the same direction or not. When the embeddings are pointing
in the same direction the angle between them is zero so their cosine
finally when the angle between them is 180 degrees the cosine
similarity is -1.
Cosine Similarity ranges from 1 to -1, where 1 represents most similar
We want to know how similar these texts are, purely in terms of word counts (and ignoring
word order). We begin by making a list of the words from both texts:
Now we count the number of times each of these words appears in each text:
me 2 2
Jane 0 1
Julie 1 1
Linda 1 0
likes 0 1
loves 2 1
more 1 1
than 1 1
A: [2, 0, 1, 1, 0, 2, 1, 1]
B: [2, 1, 1, 0, 1, 1, 1, 1]
The cosine of the angle between them is nearly about 0.822 by applying the formula for
cosine similarity given above.
Jaccard Similarity also called as Jaccard Index or Jaccard Coefficient is a
simple measure to represent the similarity between data samples. The
similarity is computed as the ratio of the length of the intersection within data
samples to the length of the union of the data samples.
It is represented as –
J(A, B) = |A Ո B| / |A U B|
It is used to find the similarity or overlap between the two binary vectors or
numeric vectors or strings. It can be represented as J. There is also a closely
related term associated with Jaccard Similarity which is called Jaccard
Dissimilarity or Jaccard Distance. Jaccard Distance is a measure of
dissimilarity between data samples and can be represented as (1 – J) where J
is Jaccard Similarity.
For example :
d1 = [ 1 3 2 ]
d2 = [ 5 0 3]
In this case, d1 ∩ d2 is: [3] and d1 ∪ d2 is [1 3 2 5 0]
Note : Cosine similarity measures the similarity between two vectors, while Jaccard
similarity Is normally used to measures the similarity between two sets or two binary
vectors.
TF-IDF:
used. TF-IDF comes into play at this stage, to solve this problem.
term TF stands for term frequency, and the term IDF stands
separately:
in the document.
— d is the document.
TF-IDF Score
across the entire corpus. Terms with higher TF-IDF scores are
● Q: The cat.
There are several ways of calculating TF, with the simplest being a raw
TF scores using the ratio of the count of instances over the length of
the document.
TF(word, document) = “number of occurrences of
the document”
Let’s compute the TF scores of the words “the” and “cat” (i.e. the query
approach 1.
document. The higher the score, the more relevant that word is in that
particular document.
IDF(word)
Let’s compute the TF-IDF scores of the words “the” and “cat”.
TF-IDF(“the”, D1) = 0.33 * 0 = 0
TF-IDF(“cat”, D3) = 0 * 0 = 0
We can use the average TF-IDF word scores over each document to get
Average TF-IDF of D3 = (0 + 0) / 2 = 0
Looks like the word “the” does not contribute to the TF-IDF scores of
collection of documents D1, D2, and D3, the ranked results would be:
1. I am a cow.
3. Today is Tuesday.
Now, if I ask you a question — Can you tell the sentences which
1 and sentence 2.
solved by a machine?
terms, that is, in which terms are combined with the operators
> We will get the term vector, which is basically, the values
Let’s apply the algorithm and see if we get the right answer.
input query. Hence, sentence one and two contain the word ‘cow’
but not ‘tuesday’ and will be returned as result for the query.
Inverted index
In this method, a vector is formed where each document is given a document ID
and the terms act as pointers. Then sorting of the list is done in alphabetical
For example:-
Formation of vector
Finally, an inverted index structure is created. Then an array-like structure is formed containing
NLP. These items can be characters, words, or even syllables, depending on the
granularity desired. The value of ’n’ determines the order of the N-gram.
Examples:
● Unigrams (1-grams): Single words, e.g., “cat,” “dog.”
“deep learning.”
● Trigrams (3-grams): Triplets of consecutive words, e.g., “machine
words.
language.
4. Information Retrieval:
● In information retrieval tasks, N-grams assist in matching and ranking
5. Feature Extraction:
● N-grams serve as powerful features in text classification and sentiment
1. Speech Recognition:
● N-grams play a crucial role in modeling and recognizing spoken
systems.
2. Machine Translation:
● In machine translation, N-grams contribute to understanding and
translation quality.
to suggest the next word based on the context of the input sequence.
0].
feedback.
In this way, the BoW method enables the social media platform
● Noun (NN)
● Verb (VB)
● Adjective (JJ)
● Adverb (RB)
● Pronoun (PRP)
● Preposition (IN)
● Conjunction (CC)
● Determiner (DT)
● Interjection (UH)
————————————————————————
1. Syntactic Parsing:
sentences.
● Role of POS Tagging: POS tags provide information about
3. Information Retrieval:
documents or information.
5.Machine Translation:
source language.
6.Sentiment Analysis:
● Application: Determining the sentiment expressed in a
author.
8.Text-to-Speech Synthesis:
9. Speech Recognition:
10.Grammar Checking:
in written text.
import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset()
Output :
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'),
('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'),
('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'),
('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]