NLP m3
NLP m3
LANGUAGE MODELLING
1
WORD EMBEDDING IN NLP
Used for representing words for text analysis in
the form of real-valued vectors.
It is defined as a numeric vector input that
allows words with similar meanings to have the
same representation.
It’s approximate meaning and represent a word
in a lower dimensional space.
These can be trained much faster than the
hand-built models that use graph embeddings
like WordNet.
2
FAMILIAR WITH TERMINOLOGIES
Document
A document is a single text data point. For
Example, a review of a particular product by the
user.
Corpus
It a collection of all the documents present in our
dataset.
Feature
Every unique word in the corpus is considered as a
feature.
3
FOR EXAMPLE
7
1. FREQUENCY-BASED OR STATISTICAL BASED
WORD EMBEDDING.
Methods :
Label Encoding
One-Hot Encoding (OHE)
Counter Vector
TF-IDF Vectorization
8
WHAT IS LABEL ENCODING?
Label Encoding:
Is a popular encoding technique for handling
categorical variables. A unique integer or
alphabetical ordering represents each label.
9
1. Label Encoding:
Is a popular encoding technique for handling categorical
variables. A unique integer or alphabetical ordering represents
each label.
17
Let us understand this using a simple example.
D1: He is a lazy boy. She is also lazy.
D2: Neeraj is a lazy person.
The dictionary created may be a list of unique tokens(words)
in the corpus =[‘He’, ’She’,’ lazy’, ’boy’, ’Neeraj’, ’person’]
Here, D=2, N=6 [DXN]
The count matrix M of size 2 X 6 will be represented as –
Now, a column can also be understood as word vector for the corresponding
word in the matrix M. For example, the word vector for ‘lazy’ in the above
matrix is [2,1] and so on.
Here, the rows correspond to the documents in the corpus and the columns
correspond to the tokens in the dictionary.
The second row in the above matrix may be read as – D2 contains ‘lazy’: 18
once, ‘Neeraj’: once and ‘person’ once.
so, Here we explain the sentence.
My name is XYZ. firstly, I completed my B.E. in
2019 from Gujarat Technology University. I like
playing cricket and reading books. also, I am
from Amreli which is located in Gujrat.
So, here will be represented as follows:
19
Problem:
20
TF-IDF VECTORIZATION
This is another method which is based on the
frequency method but it is different to the count
vectorization in the sense that it takes into account
not just the occurrence of a word in a single document
but in the entire corpus. So, what is the rationale
behind this? Let us try to understand.
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear
quite frequently in comparison to the words which
are important to a document.
For example, a document A on Lionel Messi is going
to contain more occurrences of the word “Messi” in
comparison to other documents. But common words
like “the” etc. are also going to be present in higher
frequency in almost every document.
21
Ideally, what we would want is to down weight
the common words occurring in almost all
documents and give more importance to words
that appear in a subset of documents.
TF-IDF works by penalizing these common words
by assigning them lower weights while giving
importance to words like Messi in a particular
document.
So, how exactly does TF-IDF work?
Consider the below sample table which gives the
count of terms(tokens/words) in two documents.
22
Now, let us define a few terms related to TF-IDF.
TF = (Number of times term t appears in a
document)/(Number of terms in the document)
So, TF(This,Document1) = 1/8 = (1+1+2+4)
TF(This, Document2)=1/5 = (1+2+1+1)
EX: A document about Messi should contain the word
‘Messi’ in large number.
23
IDF = log(N/n), where, N is the number of
documents and n is the number of documents a
term t has appeared in.
So, IDF(This) = log(2/2) = 0.
So, how do we explain the reasoning behind IDF?
Ideally, if a word has appeared in all the document,
then probably that word is not relevant to a
particular document. But if it has appeared in a
subset of documents then probably the word is of
some relevance to the documents it is present in.
Let us compute IDF for the word ‘Messi’.
IDF(Messi) = log(2/1) = 0.301.
24
Now, let us compare the TF-IDF for a common
word ‘This’ and a word ‘Messi’ which seems to be
of relevance to Document 1.
TF-IDF(This,Document1) = (1/8) * (0) = 0
TF-IDF(This, Document2) = (1/5) * (0) = 0
TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
27
2. PREDICTION BASED WORD EMBEDDING
Word2Vec
Skip Gram
CBOW
28
Word2Vec:
• Word2Vec is a popular natural language processing (NLP) technique for
generating word embeddings, which are vector representations of words in
a high-dimensional space.
• The fundamental idea behind Word2Vec is to learn dense vector
representations of words based on their context in a large corpus of text.
• There are two main approaches to Word2Vec: Continuous Bag of Words
(CBOW) and Skip-gram.
2.Skip-gram:
1. Skip-gram predicts the context words based on the current word.
2. The model tries to minimize the difference between the predicted
29
context words and the actual context words for a given word.
30
31
SKIP GRAM MODEL
The skip-gram model is a method for learning
word embeddings, which are continuous,
dense, and low-dimensional representations of
words in a vocabulary.
36
The dog fetched the ball.
37
ARCHITECTURE OF SKIP-GRAM MODEL
The architecture of the skip-gram model consists
of an input layer, an output layer, and a
hidden layer.
The input layer is the word to be predicted, and
the output layer is the context words.
The hidden layer represents the embedding of the
input word learned during training.
The skip-gram model uses a feed forward neural
network with a single hidden layer, as shown in
the diagram below:
38
39
Input Layer --> Hidden Layer --> Output Layer
We can use TensorFlow with the Keras API to build and train
the model. A skip-gram generator can create training data for
the model in pairs of words (the target word and the context
word) and labels indicating whether the context word appears
within a fixed window size of the target word in the input text.
42
The word sat will be given and we’ll try to predict
words cat, mat at position -1 and 3 respectively
given sat is at position 0 . We do not predict common
or stop words such as the
43
As we can see w(t) is the target word or input
given. There is one hidden layer which performs
the dot product between the weight matrix and the
input vector w(t).
48
PROBABILITY FUNCTION
49
CBOW(CONTINUOUS BAG-OF-WORDS )
The continuous bag-of-words (CBOW) model is a
neural network for natural languages processing tasks
such as language translation and text classification.
It predicts a target word based on the context of
the surrounding words and is trained on a large
dataset of text using an optimization algorithm such as
stochastic gradient descent.
Once trained, the CBOW model generates
numerical vectors, known as word embeddings,
which capture the semantics of words in a continuous
vector space and can be used in various NLP tasks.
It is often combined with other techniques and models,
such as the skip-gram model, and can be implemented 50
using libraries like gensim in python.
The way CBOW work is that it tends to predict
the probability of a word given a context. A
context may be a single word or a group of words.
But for simplicity, I will take a single context
word and try to predict a single target word.
0 0 0 1 0 0 0 0 0 0
52
This matrix shown in the above image is sent
into a shallow neural network with three layers:
an input layer, a hidden layer and an output
layer.
53
54
The input layer and the target, both are one- hot
encoded of size [1 X V]. Here V=10 in the above
example.
57
The image above takes 3 context words and
predicts the probability of a target word. The
input can be assumed as taking three one-hot
encoded vectors in the input layer as shown
above in red, blue and green.
58
59
The steps remain the same, only the calculation of
hidden activation changes. Instead of just copying the
corresponding rows of the input-hidden weight
matrix to the hidden layer, an average is taken over
all the corresponding rows of the matrix.
62
convolution neural
Neural Networks networks (CNNs)
In deep learning, all problems are generally classified into two types:
▪ Fixed topological structure: For images having static data, with use
cases such as image classification
recurrent neural
networks (RNNs)
63
Neural Networks
Chatbots
Sequential pattern identification
Image/handwriting detection
Video and audio classification
Sentiment analysis
Time series modeling in finance
64
Recurrent Neural Networks
RNNs have varied sets of use cases and can implement a set of multiple
smaller programs,
with each painting a separate picture on its own and
all learning in parallel,
to finally reveal the intricate effect of the collaboration of all such
small programs.
66
Recurrent Neural Networks-Applications
67
Recurrent Neural Networks-Applications
Text mining and Sentiment analysis can be carried out using an RNN
for Natural Language Processing (NLP).
69
Recurrent Neural Networks-Applications
71
Differences Between Feedforward and
Recurrent Neural Networks
Following are the main limitations of feedforward neural networks:
an RNN takes decisions based on the current and previous inputs and
makes sure that the connections are built across the hidden layers as
well. 72
Differences Between Feedforward and
Recurrent Neural Networks
73
RNN
Recurrent Neural Network(RNN) is a type of Neural
Network where the output from the previous step is fed as input
to the current step.
Where:
ht -> current state
ht-1 -> previous state
xt -> input state
77
How RNN works
Where:
Yt -> output
Why -> weight at output layer
78
How RNN works
79
How RNN works
81
How RNN works
82
Training through RNN
❖ A single-time step of the input is provided to the network.
❖ Then calculate its current state using set of current input and the
previous state.
❖ The current ht becomes ht-1 for the next time step.
❖ One can go as many time steps according to the problem and join
the information from all the previous states.
❖ Once all the time steps are completed the final current state is
used to calculate the output.
❖ The output is then compared to the actual output i.e the target
output and the error is generated.
❖ The error is then back-propagated to the network to update the
weights and hence the network (RNN) is trained. 83
Basics of RNN- activation functions
Nonlinear Activation function: These are the most used ones, and they
make the output restricted between some range:
❖ Softmax
❖ Tanh
Sigmoid:
• expensive,
• causes vanishing
gradient problem
• not zero-centered
91
Basics of Recurrent Neural Networks
92
RNN model architecture to compute the number of 1s
in a 20-length sequence of binary digits
Types of Recurrent Neural Networks
1. One to One
2. One to Many
3. Many to One
4. Many to Many
93
Types of Recurrent Neural Networks
1. One to One
94
2. One to Many
95
3. Many to One
97
Two Issues of Standard RNNs
98
Two Issues of Standard RNNs
❖ Vanishing Gradient Problem
99
❖ Exploding Gradient Problem
This problem arises when large error gradients accumulate, resulting in very large 100
updates to the neural network model weights during the training process
Solution to Gradient Problem
101
Solution to Gradient Problem
102
Solution to Gradient Problem
103
Solution to Gradient Problem
104
RNN
LSTMs also have a chain-like structure, but the repeating module is a bit
different structure.
Instead of having a single neural network layer, four interacting layers are
106
communicating extraordinarily
Workings of LSTMs in RNN
107
Workings of LSTMs in RNN
Step 1: Decide How Much Past Data It Should Remember