How do Google home, Siri,
Alexa understand me….
Natural Language Processing
Getting Started
Resources:
[Link]
um21/publication/secondary/Class10_Facilitator_
[Link]
[Link]
um20/AI_Curriculum_Handbook.pdf
Agenda:
• NLP Concept
• How does NLP work?
Activity
Solve the puzzle
Number of independent nation in Asia
continent?
There are 48 independent nations in Asia
• How do we understand what others are saying/ written text.
• How does a computer understand what we say in our language?
• Let us experience it with the help of this AI Game:
Identify the mystery animal- it’s a voice experiment guessing game:
[Link]
[Link]
Mystery Animal
• Machine acts as an animal which has randomly picked up and player gets 20 chances to guess that animal.
• The player can ask 20 y/n questions and machine answers them y/n. Machine interprets the meaning of the questions with the help of NLP and answers
accordingly.
20Q
• 20Q will read your mind by asking a few simple questions.
• The object you think of should be something that most people would know about, but not a proper noun or a specific person, place, or thing.
Activity 1
Ask Questions from students – Mystery Animal
• Were you able to guess the animal?
• If yes, in how many questions were you able to guess it? (students can make a table for tries and
number of questions)
• If no, how many times did you try playing this game?
• What according to you was the task of the machine?
• Were there any challenges that you faced while playing this game? If yes, list them down.
• What approach must one follow to win this game?
• If you play for a long time does the performance change?
• If you ask irrelevant questions (like how is the whether .. adding noise) what will be the performance?
• Your observation (any other)
Ask Questions from students – 20 Q
• Was the app. able to guess the object?
• If yes, in how many questions was it able to guess?
• If no, how many times did you try playing this game?
• What according to you was the task of the machine?
• Were there any challenges that you faced while playing this game? If yes, list them down.
• If you play for a long time does the performance change?
• If you answer incorrectly, what will be the performance?
• Your observation (it shows responses / training contradictions detected)
Natural Language Processing
• Natural Language Processing, abbreviated as NLP, is a branch of Artificial Intelligence that
deals with the interaction between computers and humans using the natural language.
Natural language refers to language that is spoken and written by people, and natural
language processing (NLP) attempts to extract information from the spoken and written
word using algorithms.
• The ultimate objective of NLP is to read, decipher, understand, and make sense of the human
languages in a manner that is valuable. Example: spam and ham filter
STEPS for any AI model
Problem Scoping
• To understand the business model
Data Acquisition
• To understand the action from the statement, we need to collect the statement data so the machine can interpret the words/ text that they use and
understand their meaning. Such data can be collected from various means:
1. Statements written/ said by people
2. Databases available on the internet etc
Data Exploration
• Once the textual data has been collected, it needs to be processed and cleaned so that an easier version can be sent to the machine. Thus, the text is
normalised through various steps and is lowered to minimum vocabulary since the machine does not require grammatically correct statements but the
essence of it.
Modelling
• Once the text has been normalised, it is then fed to an NLP based AI model. Note that in NLP, modelling requires data pre-processing only after which the
data is fed to the machine.
Evaluation
• The model trained is then evaluated and the accuracy for the same is generated on the basis of the relevance of the answers which the machine gives to the
user’s responses. To understand the efficiency of the model, the suggested answers by the chatbot are compared to the actual answers. If they are accurate
in a model, that model is deployed.
• Mitsuku Bot
Mitsuku is an emotionally intelligent chatbot that converses [Link]
with users in a very human way, with humor, empathy and
even a little sass. • CleverBot
At the annual event "The Loebner Prize" A.I. specialists from
around the world play their bots against a panel of judges [Link]
and the most human-like bot is the winner.
• Jabberwacky
Used for Cognitive Behaviour Therapy to understand the [Link]
behaviour and mindset of people. Therapist used to treat
patients. • Haptik
Chat bots reduces customer service call, gives response [Link]
quickly, increases sale.
Chat bot used in banks, hospitality, education, food delivery • Rose
etc.
[Link]
[Link]/[Link]
• Ochatbot
[Link]
Activity 2
Discussion
• What chatbot did you try, name any one
• What is the purpose of the chatbot?
• How was the interaction with the chatbot?
• Did the chat feel like talking to a human or a robot? Why do you think so?
• Do you feel that the chatbot has a certain personality? (sports loving/ news loving etc)
• Were the responses same or different when you were asking the same questions?
Conclusion
As you interact with more and more chatbots, you would realise that some of them are
scripted or in other words are traditional chatbots while others were AI-powered and
had more knowledge. With the help of this experience, we can understand that there
are 2 types of chatbots around us: Script-bot and Smart-bot. Let us understand what
each of them mean in detail:
Rule based model/ Traditional Vs AI Model
Example: Script bot : inklet, story speaker, [Link]
Example: smart bot :Alexa, cortana, siri, google assistant
[Link]
Activity 3
NLP is used in
• Sentiment analysis- Finding if the text is leaning towards a
positive or negative sentiment, example: “I love the new
iPhone” and, a few lines later “But sometimes it doesn’t
work well - sentiment of products (Amazon), movies
(Netflix), food, restaurants (Yelp). Machine Translation Sentiment Analysis
• Text Classification - Categorizing text to various categories
example spam email/ SMS, technology, sports, fashion.
• Document Summarization - Compressing a
paragraph/document into few words or sentences,
paraphrasing.
• Parts of Speech Tagging- text to speech conversion.
• Text translator - [Link]/
• Chat bot conversation-It reduces customer service call, Information Extraction
gives response quickly, increases sale
• Virtual Assistants: Google, Cortana, Siri, Alexa
Human Language Vs Computer Language
• Humans communicate through language which we process all the time. Even in the classroom, as the teacher delivers the
session, our brain is continuously processing everything and storing it in some place. Also, while this is happening, when
your friend whispers something, the focus of your brain automatically shifts from the teacher’s speech to your friend’s
conversation. So now, the brain is processing both the sounds but is prioritizing the one on which our interest lies ☺.
• The sound reaches the brain through a long channel. As a person speaks, the sound travels from his mouth and goes to the
listener’s eardrum. The sound striking the eardrum is converted into neuron impulse, gets transported to the brain and
then gets processed. After processing the signal, the brain gains understanding around the meaning of it. If it is clear, the
signal gets stored. Otherwise, the listener asks for clarity to the speaker. This is how human languages are processed by
humans.
• Computer understands the language of numbers. Everything that is sent to the machine has to be converted to numbers. If
a single mistake is made, the computer throws an error and does not process that part.
• Now, if we want the machine to understand our language, how should this happen? What are the possible difficulties a
machine would face in processing natural language? Let us take a look at some of them :
Challenges in understanding Natural Language by Machine
* Arrangement of words and meaning – I like banana, is not same as banana like I
* Different words having the same meaning – synonym date and date - I will have a date.
* It’s raining cats and dogs- ambiguity
* Ritika is my friend, she loves to read- Anaphora resolution, who is she
* The grammar and morphology –Google translator also struggles to perfectly convert text from
one language to another.
* Perfect syntax and no meaning- Chickens feed extravagantly while the moon drinks tea.
Google translator also sometimes struggles to perfectly convert text from one
language to another.
Making the task difficult
[Link]
• Text it is messy and unstructured, and ML prefers structured, well defined
fixed-length inputs.
• By using NLP (Bag-of-Words technique) we can convert variable-length texts
into a fixed-length vector.
• ML works with numerical data rather than textual data.
How does NLP work?
• Data Processing – convert our language to number, by using text normalization, since computer
understands number
• Text Normalization- collecting text from all the documents i.e. corpus
• Sentence Segmentation- corpus is divided into sentence
• Tokenization- Each sentence is further divided into tokens
• Removing unnecessary tokens- stop words, special characters, prepositions
• Converting text to common case
• Stemming-reducing remaining words to root words, remove ing, ed
• Lemmatization – the removed word has a meaning
• Bag of Words (BoW)–occurrence/ frequency of each word and construct the vocabulary for the corpus. Steps to
implement are as follows:
• Create Dictionary
• Create document vector for each document (Term Frequency)
• Create document vector for all documents
• Create inverse document frequency(TFIDF) Term Frequency inverse document inverse
Unplugged activity - step by step approach to implement
Step 1: Collecting data and pre-processing it.
Document 1: Aman and Anil are stressed.
Document 2: Aman went to a therapist.
Document 3: Anil went to download a health chatbot.
Here are three documents having one sentence each.
Corpus: Aman and Anil are stressed. Aman went to a therapist. Anil went to download a health chatbot.
Corpus divided into sentence- Sentence Segmentation
Sentence 1: Aman and Anil are stressed.
Sentence 2: Aman went to a therapist.
Sentence 3: Anil went to download a health chatbot.
Sentence divided into tokens - Tokenization
Sentence 1 with tokens: [Aman, and, Anil, are, stressed,.]
Sentence 2 with tokens: [Aman, went, to, a, therapist,.]
Sentence 3 with tokens: [Anil, went, to, download, a, health, chatbot,.]
Removing unnecessary tokens- stop words
• Sentence 1: [Aman, and, Anil, are, stressed, .]
• Sentence 2: [Aman, went, to, a, therapist,.]
• Sentence 3: [Anil, went, to, download, a, health, chatbot,.]
Converting text to common case- lower case
• Sentence 1: [aman, anil, stressed]
• Sentence 2: [aman, went, therapist]
• Sentence 3: [anil, went, download, health, chatbot]
Stemming-reducing remaining words/ verbs to root words
• Sentence 1: [aman, anil, stress]
• Sentnece 2: [aman, went, therap]
• Sentnece 3: [anil, went, download, health, chatbot]
Lemmatization – the removed word/ stem has a meaning
• Document 1: [aman, anil, stress]
• Document 2: [aman, went, therapy]
• Document 3: [anil, went, download, health, chatbot]
Bag of Words - frequency of each word and construct the vocabulary for the corpus
aman anil stress went therapy download health chatbot
Repeated words are written just once.
Create Document Vector for each document
aman anil stress went therapy download health chatbot
1 1 1 0 0 0 0 0
• Prepare for all document (Term Frequency - TF of words)
aman anil stress went therapy download health chatbot
Sentence 1 1 1 1 0 0 0 0 0
Sentence 2 1 0 0 1 1 0 0 0
Sentence 3 0 1 0 1 0 1 1 1
Sum/ Doc 2 2 1 2 1 1 1 1
frequency
• Create Inverse Document Frequency- IDF
= total no of documents/ document frequency
Denominator = document frequency
Numerator= Total no of documents
aman anil stress went therapy download health chatbot
3/2 3/2 3/1 3/2 3/1 3/1 3/1 3/1
TFIDF for any word (Term Frequency and Inverse data frequency):
TFIDF(W)= TF(W) *log(IDF(W))
aman anil stress went therapy Download Health chatbot
1*log(3/2) 1*log(3/2) 1*log(3/1) 0*log(3/2) 0*log(3/1) 0*log(3/1) 0*log(3/1) 0*log(3/1)
1*log(3/2) 0*log(3/2) 0*log(3/1) 1*log(3/2) 1*log(3/1) 0*log(3/1) 0*log(3/1) 0*log(3/1)
0*log(3/2) 1*log(3/2) 0*log(3/1) 1*log(3/2) 0*log(3/1) 1*log(3/1) 1*log(3/1) 1*log(3/1)
• Words have been converted to numbers. It shows the importance/
considerable value of word, document wise (IDF)
aman anil stress went therapy Download Health chatbot
0.176 0.176 0.477 0 0 0 0 0
0.176 0 0 0.176 0.477 0 0 0
0 0.176 0 0.176 0 0.477 0.477 0.477
Example 2
Pre-process the given data:
• Document 1: Welcome to Great Learning, Now start learning.
• Document2: Learning is a good practice.
2 Documents
• Make a Corpus:
Corpus
Welcome to Great Learning, Now start learning. Learning is
a good practice.
Sentence Segmentation
Sentence 1:Welcome to Great Learning, Now start learning.
Sentence 2: Learning is a good practice.
Sentence divided into tokens
Sentence 1 with tokens: [Welcome, to, Great, Learning,,, Now,
start, learning,. ]
Sentence 2 with tokesn: [Learning, is, a, good, practice,.]
Removing unnecessary tokens
• Sentence 1: [Welcome, to, Great, Learning,,, Now, start,
learning,.] # since L and l is kept different
• Sentence 2: [Learning, is, a, good, practice,.]
Converting text to common case- lower case
• Sentence 1: [welcome, great, learning, now, start, learning]
• Sentence 2: [learning, good, practice]
Stemming-reducing remaining words/ verbs to root words
• Sentence 1: [welcome, great, learn, now, start, learn]
• Sentence 2: [learn, good, practice]
Lemmatization – the removed word/ stem has a meaning
Sentence 1: [welcome, great, learn, now, start, learn]
Sentence 2: [learn, good, practice]
Bag of Words - frequency of each word and construct the vocabulary for the corpus
welcome great learn now start good practice
Repeated words are written just once.
Create Document Vector for each document
welcome great learn now start good practice
1 1 1 1 1 0 0
0 0 1 0 0 1 1
Term Frequency
Document Frequency – the removed word/ stem has a meaning
welcome great learn now start good practice
1
0
1
0
1
1
1
0
1
0
0
1
0
1
Term Frequency
1 1 2 2 1 1 1 Document Freq
2/1 2/1 2/2 2/2 2/1 2/1 2/1 Inverse doc Freq
welcome great learn now start good practice
1X log(2/1) 1X log(2/1) 1X log(2/2) 1X log(2/2) 1Xlog(2/1) 0X log(2/1) 0Xlog(2/1) IDF
0X log(2/1) 0X log(2/1) 1X log(2/2) 0X log(2/2) 0 1 1
TFIDF of any word = TF(W) X log(IDF9W)
Thank you!
The capacity to learn is a gift,
The ability to learn is a skill,
The willingness to learn is a choice!”
- Brian Herbert
Content taken is the property of individual organizations and are used here for reference purpose only.