NLP Msc Computer science S2 Kerala University

NATURAL LANGUAGE
PROCESSING
MODULE 1 – ELECTIVE I
Prepared By Vineeth P, Asst. Professor, CNC, Maranalloor, Trivandrum

WHAT IS NLP
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence. It is the technology that
is used by machines to understand, analyse, manipulate, and interpret
human's languages. It helps developers to organize knowledge for performing
tasks such as translation, automatic summarization, Named Entity
Recognition (NER), speech recognition, relationship extraction, and topic
segmentation.

NLP

INTRODUCTION TO NLP
Natural Language Processing (NLP) is a field that provides machines with the ability to
understand natural human language.
Natural language processing is a subfield of linguistics, computer science, information
engineering, and artificial intelligence concerned with the interactions between computers and
human languages, in particular how to program computers to process and analyze large amounts
of natural language data
Developers can make use of NLP to perform tasks like speech recognition, sentiment analysis,
translation, auto-correct of grammar while typing, and automated answer generation. NLP is a
challenging field since it deals with human language, which is extremely diverse and can be
spoken in a lot of ways. Developers make use of NLP Algorithms to implement functionalities

Scopes & Applications of NLP
Voice Assistants like Google Assistant, Amazon Alexa, and Apple Siri: Voice
Assistants have become quite popular with the advancement of technology. They use
voice recognition, NLP, and speech synthesis to communicate with human beings
successfully. Voice Assistants can perform a lot of tasks like making calls,
answering questions, playing our favorite song, searching for something on the
internet, etc. They have been around for many years now, with Apple's Siri was
released with iPhone 4s in the year 2011.

Customer Research: Many companies are turning to NLP to perform Sentiment Analysis
that can provide an understanding of customer's buying habits, their likes, whether their
comments are positive or negative, etc. Valuable insights come from understanding the
customers better. The increase in social media usage has tremendously helped in Sentiment
Analysis. Based on the customer's habits, the business can make marketing and sales
decisions.
Autocomplete feature: When we search for a particular text in Google search, we can
see the autocomplete feature working. This makes it easy for us since we don't have to
type everything; we can just select from the list of suggestions. NLP and the study of
languages make it possible for the server to provide recommendations to us.

E-mail classification: Popular Email providers use NLP algorithms to understand the
tone of each email and segregate our inbox accordingly. Many emails are automatically
sent into the Spam folder based on careful analysis by NLP algorithms. Automatic
segregation of emails helps people save a lot of time and energy. With the immense
progress of technology, gone are the days when we had to manually scan through every
email. Popular email providers using such techniques are Gmail and Yahoo Mail.
Automated messenger bots: Many websites have chatbots that efficiently communicate with
users. Many food delivery operators like Dominos, etc. provide automated chat options to
users to place their orders. With more and more experience, these chatbots are providing
user-friendly communication.

Financial research: NLP analyzes people's comments and views about a particular
subject and provides valuable knowledge to Financial traders and companies. It can be
used to track news and global happenings. Algorithms can use the information to improve
the profits of businesses.
Fake news detection: Fake news has become a significant problem across the world today,
with an increase in social media usage. Fake news has been looked upon as a considerable
issue and causes unnecessary stress and worry among people. NLP Algorithms can
analyze the language and detect if it is trustworthy or not. This is extremely helpful in
times when the world is facing issues like a global pandemic or a natural disaster like a
cyclone.

History of NLP
(1940-1960) - Focused on Machine Translation (MT):
The Natural Languages Processing started in the year 1940s: 1948 - In the Year 1948,
the first recognisable NLP application was introduced in Birkbeck College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and
computer science. Now, Chomsky developed his first book syntactic structures and
claimed that language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule
based descriptions of syntactic structures.
(1960-1980) - Flavored with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were ATN and Case Grammar

History of NLP
Augmented Transition Networks ( ATN ): ATN is a finite state machine that is capable of
recognizing regular languages. A regular language is a formal language that can be defined
by a regular expression or a finite automaton.
Case Grammar: Case grammar is a linguistic theory that analyzes the grammatical
structure of sentences by studying the semantic roles of words. It was developed in the
1960s by American linguist Charles J. Fillmore. Case grammar is based on the idea that each
verb has a specific number of deep cases, or semantic roles, that it requires to form a
sentence. For example, the verb "give" requires an agent, an object, and a beneficiary. In
general, grammatical case is a grammatical category that refers to inflections that indicate a
word's function in a sentence.

History of NLP
For example: "Neha broke the mirror with the hammer". In this example case grammar identify
Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were: SHRDLU and LUNAR were invented :
SHRDLU: SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to
communicate with the computer and moving objects. It can handle instructions such as "pick
up the green boll" and also answer the questions like "What is inside the black box." The main
importance of SHRDLU is that it shows those syntax, semantics, and reasoning about the
world that can be combined to produce a system that understands a natural language.
Working of SHRDLU: SHRDLU controlled a virtual robot arm that operated above a table
with colored play blocks. The computer simulated the arm and its environment, and
displayed the arm's activities on a TV screen.

History of NLP
Users communicated with SHRDLU using a keyboard, and the computer's replies appeared as
subtitles on the TV screen.
LUNAR: LUNAR is the classic example of a Natural Language database interface system that
is used ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural
language expressions into database queries and handle 78% of requests without errors.
1980 – Current: Till the year 1980, natural language processing systems were based on complex
sets of hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.
1990s: In the beginning of the year 1990s, NLP started growing faster and achieved good process
accuracy, especially in English Grammar. In 1990 also, an electronic text introduced, which
provided a good resource for training and examining natural language programs.

History of NLP
Other factors may include the availability of computers with fast CPUs and more memory.
The major factor behind the advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it
allows the artificial intelligence to gain knowledge of the world. Let's consider the example of
AMAZON ALEXA, using this robot you can ask the question to Alexa, and it will reply to
you.

Advantages of NLP
NLP helps users to ask questions about any subject and get a direct response within
seconds.
NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
NLP helps computers to communicate with humans in their languages.
It is very time efficient.
Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.

Disadvantages of NLP
❑ NLP may not show context.
❑ NLP is unpredictable
❑ NLP may require more keystrokes.
❑ NLP is unable to adapt to the new domain, and it has a limited function
that's why NLP is built for a single and specific task only

Components of NLP
There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Understanding (NLU).: Natural Language Understanding (NLU)
which involves transforming human language into a machine-readable format. It helps
the machine to understand and analyze human language by extracting the text from large
data such as keywords, emotions, relations, and semantics.
Natural Language Generation (NLG): NLG acts as a translator that converts the
computerized data into natural language representation. It mainly involves Text
planning, Sentence planning, and Text realization. The NLU is harder than NLG

Levels of NLP
Basically, there is a total of seven independent levels to understand and extract the meaning
from a text.
1. Phonology level
2. Morphology level
3. Lexical level
4. Syntactic level
5. Semantics level
6. Disclosure level
7. Pragmatic level

Levels of NLP
1. Phonology Level
At this level basically, it deals with pronunciation.
It deals with the interpretation of speech sound across words
2. Morphological Level
It deals with the smallest words that convey meaning and suffixes and prefixes.
Morphemes mean studying the words that are built from smaller meanings.
E.g So the rabbit word has single morphemes while the rabbits have two morphemes.
The ‘s’ denotes the singular and plural concepts.

Levels of NLP
3. Lexical Level
This deals with the study at the level of words with respect to their lexical meaning and Part of
speech (POS)
It uses the lexicon that is collection of lexemes.
A Lexeme is a basic unit of lexical meaning which is an abstract unit of morphological
analysis
4. Syntactic Level
This level deals with the grammar and structure of sentences
It studies the proper relationships between the words.
The POS tagging output of lexical analysis can be used at the syntactic level of two group
words into the phrase and clause brackets.

Levels of NLP
5. Semantics Level
This level deals with the meaning of words and sentences
There are different two approaches :
1) Syntax driven semantic analysis
2) Semantic grammar.
Its a study of meaning of words that are associated with grammatical structure.
6. Discosure Level
This level deals with the structure of different kinds of text.
There are 2 types of discourse :
1) Anaphora resolution
2) Discourse/ text structure recognition
Words are replaced in anaphora resolution.

Levels of NLP
7. Pragmatic Level
This level deals with the use real world knowledge and understanding of how this influences the
meaning of what is being communicated.
Pragmatics identifies the meaning of words and phrases based on how language is used to
communicate.

Phases of NLP
The process of analysis in NLP can be divided into five distinct phases which are
Lexical Analysis, Syntactic Analysis, Semantic Analysis, Discourse Integration, and
Pragmatic Analysis. Each phase plays a crucial role in the overall understanding and
processing of natural language.

Phase 1: Lexical & Morphological Analysis
Tokenization: The lexical phase in Natural Language Processing (NLP) involves scanning
text and breaking it down into smaller units such as paragraphs, sentences, and words.
This process, known as tokenization, converts raw text into manageable units called tokens
or lexemes. Tokenization is essential for understanding and processing text at the word level.
In addition to tokenization, various data cleaning and feature extraction techniques are
applied, including:
Lemmatization: Reducing words to their base or root form.
Stopwords Removal: Eliminating common words that do not carry significant meaning,
such as "and," "the," and "is."

Phase 1: Lexical & Morphological Analysis
Correcting Misspelled Words: Ensuring the text is free of spelling errors to maintain
accuracy.
Morphological analysis: Morphological analysis is another critical phase in NLP, focusing on
identifying morphemes, the smallest units of a word that carry meaning and cannot be further divided.
Understanding morphemes is vital for grasping the structure of words and their relationships.
Types of Morphemes
Free Morphemes: Text elements that carry meaning independently and make sense on their own. For
example, "bat" is a free morpheme.
Bound Morphemes: Elements that must be attached to free morphemes to convey meaning, as they cannot
stand alone. For instance, the suffix "-ing" is a bound morpheme, needing to be attached to a free morpheme
like "run" to form "running."

Phase 1: Lexical And Morphological Analysis
Morphological analysis is crucial in NLP for several reasons:
Understanding Word Structure: It helps in deciphering the composition of complex words.
Predicting Word Forms: It aids in anticipating different forms of a word based on its
morphemes.
Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging,
syntactic parsing, and machine translation.
By identifying and analyzing morphemes, the system can interpret text correctly at the most
fundamental level, laying the groundwork for more advanced NLP applications.

Phase 2: Syntactic Analysis ( Parsing )
Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing
(NLP). This phase is essential for understanding the structure of a sentence and assessing its
grammatical correctness. It involves analyzing the relationships between words and ensuring
their logical consistency by comparing their arrangement against standard grammatical rules.
Role of parsing: Parsing examines the grammatical structure and relationships within a given
text. It assigns Parts-Of-Speech (POS) tags to each word, categorizing them as nouns, verbs,
adverbs, etc. This tagging is crucial for understanding how words relate to each other syntactically
and helps in avoiding ambiguity. Ambiguity arises when a text can be interpreted in multiple ways
due to words having various meanings. For example, the word "book" can be a noun (a physical
book) or a verb (the action of booking something), depending on the sentence context.

Phase 2: Syntactic Analysis ( Parsing )
Examples of Syntax:
Consider the following sentences:
Correct Syntax: "John eats an apple."
Incorrect Syntax: "Apple eats John an."
Despite using the same words, only the first sentence is grammatically correct and makes sense. The
correct arrangement of words according to grammatical rules is what makes the sentence meaningful.
Assigning POS Tags
During parsing, each word in the sentence is assigned a POS tag to indicate its grammatical category.
Here’s an example breakdown:
Sentence: "John eats an apple."
POS Tags:
John: Proper Noun (NNP)
eats: Verb (VBZ)

Phase 2: Syntcatic Analysis ( Parsing )
an: Determiner (DT)
apple: Noun (NN)
Assigning POS tags correctly is crucial for understanding the sentence structure and
ensuring accurate interpretation of the text.
Importance of Syntactic Analysis: By analyzing and ensuring proper syntax, NLP systems can
better understand and generate human language. This analysis helps in various applications, such
as machine translation, sentiment analysis, and information retrieval, by providing a clear
structure and reducing ambiguity.

Phase 3: Semantic Analysis
Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on
extracting the meaning from text. Unlike syntactic analysis, which deals with
grammatical structure, semantic analysis is concerned with the literal and contextual
meaning of words, phrases, and sentences.
Semantic analysis aims to understand the dictionary definitions of words and their usage
in context. It determines whether the arrangement of words in a sentence makes logical
sense. This phase helps in finding context and logic by ensuring the semantic coherence of
sentences.

Key tasks in semantic analysis includes:
Named Entity Recognition (NER): NER identifies and classifies entities within the text, such
as names of people, places, and organizations. These entities belong to predefined categories
and are crucial for understanding the text's content.
Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous
words based on context. For example, the word "bank" can refer to a financial institution or the
side of a river. WSD uses contextual clues to assign the appropriate meaning.

Examples of Semantic Analysis
Syntactically Correct but Semantically Incorrect: "Apple eats a John."
This sentence is grammatically correct but does not make sense semantically. An apple
cannot eat a person, highlighting the importance of semantic analysis in ensuring
logical coherence.
Literal Interpretation: "What time is it?"
This phrase is interpreted literally as someone asking for the current time,
demonstrating how semantic analysis helps in understanding the intended meaning.

Phase 4: Discourse Integration
Discourse Integration is the fourth phase of Natural Language Processing (NLP). This phase
deals with comprehending the relationship between the current sentence and earlier sentences
or the larger context. Discourse integration is crucial for contextualizing text and understanding
the overall message conveyed.
Discourse integration examines how words, phrases, and sentences relate to each other within
a larger context. It assesses the impact a word or sentence has on the structure of a text and
how the combination of sentences affects the overall meaning. This phase helps in
understanding implicit references and the flow of information across sentences.

Phase 4: Discourse Integration
In conversations and texts, words and sentences often depend on preceding or following
sentences for their meaning. Understanding the context behind these words and sentences is
essential to accurately interpret their meaning.
Example of Discourse Integration
Consider the following examples:
Contextual Reference: "This is unfair!"
To understand what "this" refers to, we need to examine the preceding or following
sentences. Without context, the statement's meaning remains unclear.
Anaphora Resolution: "Taylor went to the store to buy some groceries. She realized she forgot
her wallet."
In this example, the pronoun "she" refers back to "Taylor" in the first sentence.
Understanding that "Taylor" is the antecedent of "she" is crucial for grasping the sentence's
meaning.

Phase 5: Pragmatic Analysis
Pragmatic Analysis is the fifth and final phase of Natural Language Processing (NLP),
focusing on interpreting the inferred meaning of a text beyond its literal content. Human
language is often complex and layered with underlying assumptions, implications, and
intentions that go beyond straightforward interpretation. This phase aims to grasp these
deeper meanings in communication.
Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming
to understand what the writer or speaker truly intends to convey. In natural language, words
and phrases can carry different meanings depending on context, tone, and the situation in
which they are used.

In human communication, people often do not say exactly what they mean. For instance, the word
"Hello" can have various interpretations depending on the tone and context in which it is spoken. It
could be a simple greeting, an expression of surprise, or even a signal of anger. Thus,
understanding the intended meaning behind words and sentences is crucial.
Examples of Pragmatic Analysis
Contextual Greeting: "Hello! What time is it?"
"Hello!" is more than just a greeting; it serves to establish contact.
"What time is it?" might be a straightforward request for the current time, but it could also imply concern
about being late.
Figurative Expression: "I'm falling for you."
The word "falling" literally means collapsing, but in this context, it means the speaker is expressing love
for someone.

Pragmatic analysis is essential for applications like sentiment analysis, conversational AI, and
advanced dialogue systems. By interpreting the deeper, inferred meanings of texts, NLP systems
can understand human emotions, intentions, and subtleties in communication, leading to more
accurate and human-like interactions.

Common Challenges of NLP
1. Language Differences : The human language and understanding is rich and intricated and there
many languages spoken by humans. Human language is diverse and thousand of human languages
spoken around the world with having its own grammar, vocabular and cultural nuances. Human
cannot understand all the languages and the productivity of human language is high. There is
ambiguity in natural language since same words and phrases can have different meanings and
different context. This is the major challenges in understating of natural language. There is a
complex syntactic structures and grammatical rules of natural languages. The rules are such as
word order, verb, conjugation, tense, aspect and agreement. There is rich semantic content in
human language that allows speaker to convey a wide range of meaning through words and
sentences. Natural Language is pragmatics which means that how language can be used in context to
approach communication goals. The human language evolves time to time with the processes such
as lexical change.

2. Training data : Training data is a curated collection of input-output pairs, where the input
represents the features or attributes of the data, and the output is the corresponding label or
target. Training data is composed of both the features (inputs) and their corresponding labels
(outputs). For NLP, features might include text data, and labels could be categories, sentiments,
or any other relevant annotations. It helps the model generalize patterns from the training set to
make predictions or classifications on new, previously unseen data.
3. Development Time and Resource Requirement: Development Time and Resource
Requirements for Natural Language Processing (NLP) projects depends on various factors
consisting the task complexity, size and quality of the data, availability of existing tools and
libraries, and the team of expert involved

Some key points are as follow:
Complexity of the task: Task such as classification of text or analyzing the sentiment of the text
may require less time compared to more complex tasks such as machine translation or answering
the questions.
Availability and Quality Data: For Natural Language Processing models requires high-quality
of annotated data. It can be time consuming to collect, annotate, and preprocess the large text
datasets and can be resource-intensive specially for tasks that requires specialized domain
knowledge or fine-tuned annotations.
Selection of algorithm and development of model: It is difficult to choose the right algorithms
machine learning algorithms that is best for Natural Language Processing task.
Evaluation and Training: It requires powerful computation resources that consists of
powerful hardware (GPUs or TPUs) and time for training the algorithms iteration. It is also
important to evaluate the performance of the model with the help of suitable metrics and
validation techniques for conforming the quality of the results.

3. Navigating phrasing ambiguities: It is a crucial aspect to navigate phrasing ambiguities
because of the inherent complexity of human languages. The cause of phrasing ambiguities is
when a phrase can be evaluated in multiple ways that leads to uncertainty in understanding the
meaning. Here are some key points for navigating phrasing ambiguities in NLP:
Contextual Understanding: Contextual information like previous sentences, topic focus, or
conversational cues can give valuable clues for solving ambiguities.
Semantic Analysis: The content of the semantic text is analyzed to find meaning based on word, lexical
relationships and semantic roles. Tools such as word sense disambiguation, semantics role labeling can
be helpful in solving phrasing ambiguities.
Syntactic Analysis: The syntactic structure of the sentence is analyzed to find the possible evaluation
based on grammatical relationships and syntactic patterns.

Pragmatic Analysis: Pragmatic factors such as intentions of speaker, implicatures to infer meaning of a phrase.
This analysis consists of understanding the pragmatic context.
Statistical methods: Statistical methods and machine learning models are used to learn patterns from data and
make predictions about the input phrase.
5. Misspellings and Grammatical Errors: Overcoming Misspelling and Grammatical Error are the
basic challenges in NLP, as there are different forms of linguistics noise that can impact accuracy of
understanding and analysis. Here are some key points for solving misspelling and grammatical error in
NLP:
Spell Checking: Implement spell-check algorithms and dictionaries to find and correct misspelled words.
Text Normalization: The is normalized by converting into a standard format which may contains tasks such
as conversion of text to lowercase, removal of punctuation and special characters, and expanding
contractions.

Tokenization: The text is split into individual tokens with the help of tokenization techniques. This
technique allows to identify and isolate misspelled words and grammatical error that makes it easy to
correct the phrase.
Language Models: With the help of language models that is trained on large corpus of data to predict the
likelihood of word or phrase that is correct or not based on its context
6. Mitigating Innate biases in NLP algorithms: It is a crucial step of mitigating innate biases in NLP
algorithm for conforming fairness, equity, and inclusivity in natural language processing
applications. Here are some key points for mitigating biases in NLP algorithms.
Collection of data and annotation: It is very important to confirm that the training data used to
develop NLP algorithms is diverse, representative and free from biases.

Analysis and Detection of bias: Apply bias detection and analysis method on training data to
find biases that is based on demographic factors such as race, gender, age.
Data Preprocessing: Data Preprocessing the most important process to train data to mitigate
biases like debiasing word embeddings, balance class distributions and augmenting
underrepresented samples.
Fair representation learning: Natural Language Processing models are trained to learn fair
representations that are invariant to protect attributes like race or gender.
Auditing and Evaluation of Models: Natural Language models are evaluated for fairness and
bias with the help of metrics and audits. NLP models are evaluated on diverse datasets and
perform post-hoc analyses to find and mitigate innate biases in NLP algorithms.

7. Words with multiple meaning: Words with multiple meaning plays a lexical challenge in
Nature Language Processing because of the ambiguity of the word. These words with multiple
meaning are known as polysemous or homonymous have different meaning based on the context
in which they are used. Here are some key points for representing the lexical challenge plays by
words with multiple meanings in NLP:
Semantic analysis: Implement semantic analysis techniques to find the underlying meaning of the
word in various contexts. Word embedding or semantic networks are the semantic representation that
can find the semantic similarity and relatedness between different word sense.
Domain specific knowledge: It is very important to have a specific domain-knowledge in Natural
Processing tasks that can be helpful in providing valuable context and constraints for determining the
correct context of the word.

Multi-word Expression (MWEs): The meaning of the entire sentence or phrase is analyzed to
disambiguate the word with multiple meanings.
Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to find the semantic
relationships between different words context.
8. Addressing multilingualism: It is very important to address language diversity and multilingualism
in Natural Language Processing to confirm that NLP systems can handle the text data in multiple
languages effectively. Here are some key points to address language diversity and multilingualism:
Multilingual Corpora: Multilingual corpus consists of text data in various languages and serve as valuable
resources for training NLP models and systems.
Cross-Lingual Transfer Learning: This is a type of techniques that is used to transfer knowledge learned from
one language to another.

Language Identification: Design language identification models to automatically detect the language of a
given text.
Machine Translation: Machine Translation provides the facility to communicate and inform access across
language barriers and can be used as preprocessing step for multilingual NLP tasks.
9. Reducing Uncertainty and False Positives in NLP: It is very crucial task to reduce uncertainty
and false positives in Natural Language Process (NLP) to improve the accuracy and reliability of the
NLP models. Here are some key points to approach the solution:
Probabilistic Models: Use probabilistic models to figure out the uncertainty in predictions. Probabilistic
models such as Bayesian networks gives probabilistic estimates of outputs that allow uncertainty
quantification and better decision making.
Confidence Scores: The confidence scores or probability estimates is calculated for NLP predictions to assess
the certainty of the output of the model. Confidence scores helps us to identify cases where the model is
uncertain or likely to produce false positives.

Threshold Tuning: For the classification tasks the decision thresholds is adjusted to make the balance
between sensitivity (recall) and specificity. False Positives in NLP can be reduced by setting the appropriate
thresholds.
Ensemble Methods: Apply ensemble learning techniques to join multiple model to reduce uncertainty.
10. Facilitating Continuous Conversations with NLP: Facilitating continuous conversations with
NLP includes the development of system that understands and responds to human language in real-
time that enables seamless interaction between users and machines. Implementing real time natural
language processing pipelines gives to capability to analyse and interpret user input as it is received
involving algorithms are optimized and systems for low latency processing to confirm quick responses
to user queries and inputs. he understanding of context enables systems to interpret user intent,
conversation history tracking, and generating relevant responses based on the ongoing dialogue.
Apply intent recognition algorithm to find the underlying goals and intentions expressed by users in
their messages.

How to overcome NLP Challenges
It requires a combination of innovative technologies, experts of domain, and methodological approached
to over the challenges in NLP. Here are some key points to overcome the challenge of NLP tasks:
Quantity and Quality of data: High quality of data and diverse data is used to train the NLP algorithms
effectively. Data augmentation, data synthesis, crowdsourcing are the techniques to address data scarcity
issues.
Ambiguity: The NLP algorithm should be trained to disambiguate the words and phrases.
Out-of-vocabulary Words: The techniques are implemented to handle out-of-vocabulary words such as
tokenization, character-level modeling, and vocabulary expansion.
Lack of Annotated Data: Techniques such transfer learning and pre-training can be used to transfer
knowledge from large dataset to specific tasks with limited labeled data.

Machine learning in NLP
Machine Learning (ML) has revolutionized the field of Natural Language Processing
(NLP), allowing computers to understand, interpret, and generate human language more
effectively. By leveraging statistical techniques and algorithms, ML empowers NLP systems
to handle a vast array of complex language-related tasks.
Key ML techniques in NLP are:
Supervised learning
Unsupervised learning
Deep learning

What is supervised learning
Supervised learning is a type of machine learning where a model is trained on a labelled
dataset. This means each data point is paired with a correct output label. The model
learns to map input data to the correct output by analysing the training data and adjusting
its internal parameters.
Key Characteristics:
Labelled Data: The training data is explicitly labelled with the correct output.
Model Training: The model learns to recognize patterns and relationships between the
input data and the corresponding output labels.
Prediction: Once trained, the model can make predictions on new, unseen data.

Supervised Learning in NLP
Text Classification: Categorizing text documents into predefined classes (e.g.,
spam detection, sentiment analysis).
Named Entity Recognition (NER): Identifying and classifying named entities
(e.g., persons, organizations, locations).
Part-of-Speech Tagging: Assigning grammatical tags to words (e.g., noun, verb,
adjective).

What is Unsupervised learning
Unsupervised Learning is a type of machine learning where the model learns patterns
from unlabelled data. Unlike supervised learning, there are no predefined output labels.
The model must identify underlying structures and relationships within the data itself.
Unlabelled Data: The training data is not explicitly labeled.
Pattern Discovery: The model seeks to identify patterns, clusters, or anomalies in the
data.
Self-Organization: The model learns to group similar data points together or to identify
outliers.

Unsupervised Learning in NLP
Topic Modeling: Discovering abstract topics within a collection of documents.
Word Embedding: Representing words as numerical vectors, capturing semantic
and syntactic relationships.

What is Deep learning
Deep Learning is a subset of machine learning that utilizes artificial neural networks with
multiple layers to learn complex patterns from data.
It's inspired by the structure and function of the human brain, where neurons are
interconnected to process information.
Artificial Neural Networks: Deep learning models are composed of interconnected layers
of artificial neurons.
Hierarchical Learning: Information is processed in multiple layers, with each layer
learning increasingly complex features.

What is Deep learning
Feature Learning: Deep learning models can automatically learn relevant features from raw
data, reducing the need for manual feature engineering.
Common Deep Learning Architectures:
Convolutional Neural Networks (CNNs): Primarily used for image and video analysis.
Recurrent Neural Networks (RNNs): Designed for sequential data, such as text and time series
data.
Long Short-Term Memory (LSTM) Networks: A type of RNN capable of learning long-term
dependencies.
Transformer Networks: A powerful architecture that leverages self-attention mechanisms for
tasks like machine translation and text generation

Deep learning in NLP
Recurrent Neural Networks (RNNs): Processing sequential data, such as text, by
considering the order of words.
Long Short-Term Memory (LSTM) Networks: A type of RNN capable of learning long-
term dependencies.
Transformer Models: Powerful models that leverage self-attention mechanisms to capture
complex relationships between words.

Applications of Machine learning in NLP
Sentiment Analysis: Determining the sentiment expressed in text (positive, negative, neutral).
Machine Translation: Translating text from one language to another.
Text Summarization: Condensing long texts into shorter versions.
Chatbots and Virtual Assistants: Developing conversational agents that can interact with
users.
Information Extraction: Extracting structured information from unstructured text.
Text Generation: Generating human-quality text, such as articles or code.

NLP Msc Computer science S2 Kerala University

More Related Content

Similar to NLP Msc Computer science S2 Kerala University

Recently uploaded

NLP Msc Computer science S2 Kerala University