0% found this document useful (0 votes)
18 views

1.Chapter1 Introduction Chapter2 LanguageCharacteristics

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

1.Chapter1 Introduction Chapter2 LanguageCharacteristics

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Natural Language Processing

AC3110E

1
Chapter 1: Introduction

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Human language

• Language = Speech + Text


• Speech: the oldest and most natural mode of language communication
• Text: most of human knowledge and complex information is maintained and
transmitted in written texts
• Language is the center of human intelligence

3
Natural language processing (NLP)

• NLP: branch of Computer Science - Artificial Intelligence


• evolved Computational Linguistics (linguistics and computer science)
• combines computational linguistics, machine learning, and deep learning models to
process human language.
• Give computers the ability to understand and use text and spoken words in
the same way human beings can
• Enable computers to process human language in the form of text or voice data
• Enable computers to ‘understand’ its full meaning with the speaker or writer’s intent
and sentiment.
• Make computers to perform useful tasks with the natural languages humans use

Freepik4
Examples of NLP Applications

... and a lot

src: Internet
5
NLP applications

Domains:
• Banking & Financial Services Industry
• IT and ITeS
• Retail and eCommerce
• Healthcare and Life Sciences
• Transportation & Logistics
• Government & Public Sector
• Media & Entertainment
• Manufacturing, Education, Automotive, etc...

https://www.statista.com/statistics/607891/worldwide-natural-language-processing-market-revenues/

6
NLP Components

• Natural Language Understanding (NLU)


• uses analysis of text and speech to determine the meaning of a sentence
• Mapping the given input in natural language into useful representations.
• Different aspects of the language: context, semantic, syntax, intent, and sentiment of
the text, phonology, intonation, etc.
• Natural Language Generation (NLG)
• Producing meaningful phrases and sentences in the form of natural language: text
and speech.
• Includes tasks such as machine translation, text summarization, essay generation,
speech synthesis, etc. which aim to produce coherent and fluent text and speech.

7
Technologies in NLP

• Speech processing
• Speech synthesis •Text processing
• Speech recognition •Language modeling
• Speaker recognition •Text Summarization
• Speaker verification •Text classification
• Speech encoding •Text Retrieval
• etc. •Text Data Mining
•Question Answering
•Report Generation
• Dictionaries •Translation Technologies
• Morphological and •Dialogue systems
syntactic grammars •etc.
• Rules for semantic
interpretation
• Pronunciation
• Intonation
THIS COURSE !

• Link to concepts and tasks in


the real world

Image src: Hans Uszkoreit, Language Technology, DFKI 8


Why is NLP Hard?

The boy saw the man on the mountain with a telescope

Ambiguity !

Artificial Intelligence Notes Unit 4, CSVTU 9


Why is NLP Hard?

• Phonetic ambiguity
• Lexical ambiguity
• “board” is noun or verb?
• Syntax Level ambiguity
• “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a
beetle that had red cap?

• Semantic level ambiguity


• “I walked to the bank” – commercial bank or mud in the river bank
• Discourse ambiguity
• “Rima went to Gauri. She said, I am tired.” − Exactly who is tired?
• One input can mean different meanings
• Many inputs can mean the same thing

10
Why is NLP Hard?

• Grammar and usage exceptions, variations in sentence structure


• Synonyms
• Homonyms, homophones
• Sarcasm, idioms, metaphors
• Misspelled or misused words, mispronunciations or different accents
• Colloquialisms and slang, informal phrases, lingo
• Domain-specific language, Low-resource languages
• Need more research and development

11
NLP Terminology

• Phonology, Phonetics, Phonemics, Prosodics − study of organizing sound


systematically
• Morpheme − primitive unit of meaning in a language
• Morphology − Shape and structure of words: study of construction of words
from primitive meaningful units
• Lexical - Segmenting text into words
• Syntax − Rules for words in a sentence, structural role of words in the
sentence
• Semantics − Meaning of words in a sentence
• Discourse − Meaning among sentences, how the immediately preceding
sentence can affect the interpretation of the next sentence
• Pragmatics − Meaning through speaker intent, using and understanding
sentences in different situations and how the interpretation of the sentence
is affected

12
Levels of analysis in natural language processing

Input a text sentence : “I want to print Ali’s .init file”

Pragmatic analysis

Discourse
integration

Semantic analysis
Output: run
Syntactic analysis “lpr /ali/stuff.init”

Lexical/Morphology
analysis

13
Levels of analysis in natural language processing

• Lexical or morphological Analysis


• Identify and analyze the structure of words, divide the whole chunk of text into paragraphs, sentences,
words, morphemes, etc.
• Tasks: Lemmatization , Morphological segmentation, Part-of-speech tagging, Stemming, etc.
• Syntactic Analysis
• Analysis of words in the sentence for grammar and arranging words in a manner that shows the
relationship among the words.
• Tasks: Grammar induction, Sentence breaking , Parsing, etc.
• Semantic Analysis
• Draw the exact meaning or the dictionary meaning from the text. Map syntactic structures and objects
in the task domain, and check for logic between entities, words, phrases and sentences in the text.
• Tasks: Lexical semantics, Distributional semantics, Terminology extraction, Word-sense
disambiguation, Entity linking, Relationship extraction, Semantic parsing, Semantic role labelling,
etc.
• Discourse Integration
• The meaning of any sentence depends upon the meaning of the sentences just before it.
• Tasks: Co-reference resolution, Discourse analysis, Argument mining
• Pragmatic Analysis
• Find out what it actually means.
• Involve deriving aspects of language which require real world knowledge + understanding of the
meaning of the words and context of application.

14
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)


• collection of rules , complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

15
Objectives of the course

• Understand the fundamentals and approaches in natural language


processing, which focuses on the text of a language
• Learn the representation of language's features and the basic text processing
methods: word/sentence segmentation, Part-of-Speech tagging, semantic
parser, name entity recognition, etc.
• Learn basic problems in language processing: text classification, information
extraction, machine translation, etc.
• Learn techniques and tools that can be used to develop the real-world NLP
applications.

16
Chapter 2:
Language Characteristics
2.1 Languages and Dialects

• How many languages are there in the world?


• the exact number is not known !
• 7,168 languages (date 2023 - ethnologue.com)
• but changes over time !
• Language distribution:

18
2.1 Languages and Dialects

• How many languages are endangered?


• 3,045 languages are endangered (~40% of languages) (date 2023 - ethnologue.com)

19
2.1 Languages and Dialects

• Languages
• Different languages are not mutually intelligible
• Need to be explicitly learned
• Dialects
• regional variant of a language
• involves modifications at the lexical and grammatical levels
• Dialects of the same language are assumed to be mutually intelligible
• Accent
• regional variant affecting only the pronunciation

20
2.2 Linguistic Description and Classification

• Groups of languages having a common ancestor, or protolanguage


• “can” share some vocabulary and grammar rules.
• can further be divided into branches, groups, and sub-groups
• ~142 different language families

The Indo-European family

https://www.theguardian.com/education/gallery/2015/jan/23/a-language-family-tree-in-pictures
21
Vietnamese language

https://en.wikipedia.org/wiki/Austroasiatic_languages 22
Vietnamese language

• 54 ethnic groups
• Vietnamese (Kinh ethnic group): 86%
• 5 ethnic groups < 1000 people

R/MapPorn
23
2.2 Linguistic Description and Classification

• Phonetics, Phonology, Tone and Prosody: sound structure of a language


cat.....3 phonemes (/k/a/t/) knock.....3 phonemes (/n/o/k/)
cream.....4 phonemes (/k/r/ē/m/) shadow.....4 phonemes (/sh/a/d/ō)

24
2.2 Linguistic Description and Classification

• Word formation in a language (Morphology)


• morphemes: the smallest meaningful unit of language that cannot be broken into
smaller parts
• Free morphemes: can occur on their own
• Bound morphemes (prefix or suffix): have to attach to other morphemes (-ly, -er, -ism,
–ness, etc.)
unforgettable
12 phonemes: / ˌənfərˈɡedəb(ə)l /
3 morphemes : un-forget-able

prefix root suffix

• 3 main types of word formation:


• Compounding word:
tea + pot => tea pot
• Derivation word:
inflate (verb) => inflatable (adjective)
• Inflection word:
oscillate + s (changes the grammatical features of the word) => oscillates

25
2.2 Linguistic Description and Classification

The degree of morphological complexity of a language


• Isolating languages
• “have no morphology”
• grammatical features are expressed by simple juxtaposition of invariant morphemes.

• Synthetic language - Agglutinative


• combine linear component parts (several morphemes) per word
evleriden in Turkish
(from their house)

• Synthetic language - Fusional or Inflected


• contain several morphemes per word but not in linear combination
• inflectional morpheme
go -> went

• ...

26
2.2 Linguistic Description and Classification

• Sentence structures
• The relative ordering of subject (S), verb (V), and object (O).
• The six resulting possible word orders—SOV, SVO, VSO, VOS, OVS, and OSV
• Most common: SOV, SVO
• not limited to just one of these types but allow several different word orders
• Phrasal structure
• modifiers (adjectives or relative clauses) typically precede or follow the head words
they modify

27
2.3 Writing Systems

• Orthography

Image source: en.wikipedia.org 28


2.3 Writing Systems

https://en.wikipedia.org/wiki/List_of_writing_systems 29
2.3 Writing Systems

• Writing system: set of notational symbols along with usage conventions


• Grapheme: basic unit of a writing system

• Logographic system : each grapheme stands for a meaningful unit, word or


morpheme
• Chinese characters (Hànzì) in Mandarin Chinese, Kanji in Japanese,
Hanja in Korean, Hán tự in Vietnamese.
• Phonographic system: graphemes represent sound units (phonemes)
• Alphabetic writing system
laugh /lɑːf/ bright /braɪt/ ghost /ɡəʊst/
believe /bɪˈliːv/ people /ˈpiːpᵊl/ tree /triː/

https://wals.info/chapter/141 30
2.3 Writing Systems

• Consonantal/segmental writing systems: graphemes represent only


consonants (and possibly long vowels) but omit short vowels.
• In Arabic, only consonants and the long vowels are represented by the basic letters,
whereas short vowels and certain other types of phonetic information indicated by
diacritics

“I like to travel to Cairo”


with diacritics
and without diacritics

• Syllabic writing systems: grapheme stands for entire syllable


• Japanese Hiragana system
• Featural writing system: based upon the articulatory features that underlie
the phoneme (such as voicing and place of articulation)
• Hangul Korean system

https://wals.info/chapter/141
https://www.britannica.com 31
2.3 Writing Systems

Schultz, Tanja, and Katrin Kirchhoff, eds. Multilingual speech processing.


Elsevier, 2006 32
2.3 Writing Systems

• Digital Representation and Input Methods


• Romanization
• Universal character encoding standard: Unicode
• Grapheme-to-Phoneme Mapping
• Linguistic Data Resources
• Linguistic Data Consortium (LDC)
• European Language Resources Association (ELRA)
• COCOSDA
• OLAC

33
Example
Vietnamese Japanese French
Language family Austro-Asiatic, Mon-Khmer Japonic Indo-European, Romance
branch Agglutinative language language branch
Isolating language Inflected language

Word formation Syllables are separated by a Words consist of multiple linear Word has lot of forms to
white space morphemes express grammatical
Word is created by one, two or No space between words category such as tense,
more syllables number, gender etc.
Sentence structure subject–verb–object subject–object–verb subject–verb–object
Word order makes Verbs are conjugated primarily for Verbs and nouns are
grammatical relationship tense and passive voice conjugated following
Nouns have no grammatical gender and number of the
number or gender pronoun
Phrasal structure modified then modifier modifier then modified modified then modifier
lot of irregular cases
Example + tôi ăn một cái 日本の家電製品は世界で有名で J’ai un produit naturel
bánh to す。 (I have a natural product)
i eat one CLASSIFIER (Japanese electronic product is
cake big famous worldwide) Cette peinture a beaucoup
(I eat a big cake) des couleurs naturelles
+ tôi đang ăn (I’m eating) 日本 (japan) の (of) 家電 (This picture has a lot of
+ tôi đã ăn (I ate) (electronic) 製品 (product) は natural colors)
+ tôi sẽ ăn (I will eat) (the) 世界で (worldwide) 有名
+ đi ăn (go for eating) vs. (famous) です(is)
ăn đi (let’s eat)
34
• End of chapter 2

35

You might also like