• Natural language is a means for us to express our thoughts and ideas.
• Language is a mutually agreed upon set of protocols involving words/sounds that
we use to communicate with each other.
• In this era of digitization and computation, we are constantly interacting with
machines around us through various means, such as voice commands and typing
instructions in the form of words.
• NLP can be defined as a field of computer science that is concerned with enabling
computer algorithms to understand, analyze and generate natural languages.
• For example, interacting with Siri or Alexa at some point.
• Siri and Alexa use techniques such as Speech to Text with the help of a search
engine to do the magic.
• Speech to Text is an application of NLP.
2
Stages in a Comprehensive NLP System
Tokenization
Morphological Analysis
Syntactic Analysis
Semantic Analysis (lexical and compositional)
Pragmatics and Discourse Analysis
Knowledge-Based Reasoning
Text generation
• NLP works at different levels, which means that machine process and understand
natural language at different levels.
• These levels are :
• Morphological level: This level deals with understanding word structure and word
information.
• Lexical level: This level deals with understanding the part of speech of the word.
• Syntactic level: This level deals with understanding the syntactic analysis of a sentence, or
parsing a sentence.
• Semantic level: This level deals with understanding the actual meaning of a sentence.
• Discourse level: This level deals with understanding the meaning of a sentence beyond just
the sentence level, that is, considering the context.
• Pragmatic level: This level deals with using real-world knowledge to understand the
sentence.
5
History of NLP
• NLP is a field that has emerged from various other fields such as AI, linguistics
and Data science.
• As stated above the idea had emerged from the need for Machine Translation in
the 1940s.
• Then the original language was English and Russian.
• But the use of other words such as Chinese also came into existence in the initial
period of the 1960s.
• Then a lousy era came for MT/NLP during 1966, this fact was supported by a
report of ALPAC, according to which MT/NLP almost died because the research
in this area did not have the pace at that time.
• This condition became better again in the 1980s when the product related to
MT/NLP started providing some results to customers.
8
• After reaching in dying state in the 1960s, the NLP/MT got a new life when the
idea and need of Artificial Intelligence emerged. LUNAR is developed in 1978 by
W.A woods; it could analyze, compare and evaluate the chemical data on a lunar
rock and soil composition that was accumulating as a result of Apollo moon
missions and can answer the related question.
• In the 1980s the area of computational grammar became a very active field of
research which was linked with the science of reasoning for meaning and
considering the user ‘s beliefs and intentions.
• In the period of 1990s, the pace of growth of NLP/MT increased. Grammars, tools
and Practical resources related to NLP/MT became available with the parsers.
• The research on the core and futuristic topics such as word sense disambiguation
and statistically colored NLP, the work on the lexicon got a direction of research.
9
• This quest of the emergence of NLP was joined by other essential topics such as
statistical language processing, Information Extraction and automatic
summarising.
• The discussion on the history of NLP cannot be considered complete without the
mention of the ELIZA, a chatbot program which was developed from 1964 to
1966 at the Artificial Intelligence Laboratory of MIT.
• It was created by Joseph Weizenbaum.
• It was a program which was based on script named as DOCTOR which was
arranged to Rogerian Psychotherapist and used rules, to response the questions of
the users which were psychometric-based.
• It was one of the chatbots which were capable of taking the Turing test at that
time.
10
• Previously, a traditional rule-based system was used for computations, in which
you had to explicitly write hardcoded rules.
• Today, computations on natural language are being done using ML and DL
techniques.
• Let’s say we have to extract the names of some politicians from a set of policial
news articles. So, if we want to apply rule-based grammar, we must manually craft
certain rules based on human understanding of language.
• As we can see, using a rule-based system like this would not yield very accurate
results.
• One major disadvantage is that the same rule cannot be applicable in all cases,
given the complex and nuanced nature of most language.
11
Basic Concepts
• Text corpus or corpora
• Paragraph
• Sentences
• Phrases and words
• N-grams
• Bag-of-words
12
Text Corpus or corpora
• The language data that all NLP tasks depend upon is called the text corpus or
simply corpus.
• A corpus is a large set of text data that can be in one of the languages like English,
French, and so on.
• The corpus can consist of a single document or a bunch of documents.
• The source of the text corpus can be social network sites like Twitter, blog sites,
open discussion forums like Stack Overflow, books, and several others.
• In some of the tasks like machine translation, we would require a multilingual
corpus.
• For example we might need both the English and French translations of the same
document content for developing a machine translation model.
13
• For speech tasks, we would also need human voice recordings and the
corresponding transcribed corpus.
• For many of the NLP task, the corpus is split into chunks for further analysis.
• These chunks could be at the paragraph, sentence, or word level.
14
Paragraph
• A paragraph is the largest unit of text handled by an NLP task.
• Paragraph level boundaries by itself may not be much use unless broken down into
sentences.
• Though sometimes the paragraph may be considered as context boundaries.
• Tokenizers that can split a document into paragraphs are available in some of the
Python libraries.
15
Sentences
• Sentences are the next level of lexical unit of language data.
• A sentence encapsulates a complete meaning or thought and context.
• It is usually extracted from a paragraph based on boundaries determined by
punctuations like period.
• The sentence may also convey opinion or sentiment expressed in it.
• In general, sentences consists of parts of speech (POS) entities like nouns, verbs,
adjectives, and so on.
• There are tokenizers available to split paragraphs to sentences based on
punctuations.
16
Phrases and words
• Phrases are a group of consecutive words within a sentence that can convey a
specific meaning.
• For example, in the sentence Tomorrow is going to be a rainy day the part going to
be a rainy day expresses a specific thought.
• Some of the NLP tasks extract key phrases from sentences for search and retrieval
applications.
• The next smallest unit of text is the word.
• The common tokenizers split sentences into text based on punctuations like spaces
and comma.
• One of the problems with NLP is ambiguity in the meaning of same words used in
different context.
17
N-gram
• A sequence of characters or words forms an N-gram.
• For example, character unigram consists of a single character.
• A bigram consists of a sequence of two characters and so on.
• Similarly word N-grams consists of a sequence of n words.
• In NLP, N-grams are used as features for tasks like text classification.
18
Bag-of-words
• Bag-of-words in contrast to N-grams does not consider word order or sequence.
• It captures the word occurrence frequencies in the text corpus.
• Bag-of-words is also used as features in tasks like sentiment analysis and topic
identification.
19
Applications
• Analyzing sentiment
• Recognizing named entities
• Linking entities
• Translating text
• Natural language interfaces
• Semantic Role Labeling
• Relation extraction
• SQL query generation, or semantic parsing
• Machine Comprehension
• Textual entailment
20
• Coreference resolution
• Searching
• Question answering and chatbots
• Converting text to voice
• Converting voice to text
• Speaker identification
• Spoken dialog systems
• Other applications
21