0% found this document useful (0 votes)

11 views9 pages

Chapter 1 + 2

Uploaded by

Steve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views9 pages

Chapter 1 + 2

Uploaded by

Steve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Chapter 1 + 2

Date : 24-Dec-2024
Time : 04:51 pm

Tags: language_processing

Chapter 1

Text normalization

converting text to more convenient, standard form

tokenization - separating words by whitespaces, etc

sentence segmentation

2.1 Regular Expressions

Regex patterns are case-sensitive

Pattern Description
/character/ Match single character or sequence of characters between the
slashes
[wW] disjunction - case insensitive match
[1-5] range - specifies any one character in the range
[ ^a ] negation - single character (including special characters) except a
only when ^ is the first character after [
[ ^a-z ]
? optional - preceding character or nothing
[ an? ] => a or an
* Kleene * Cleany star
0 or more occurrences of previous character
/ an* / => a, an, annnnnn
+ Kleene + pattern
1 or more occurrences of previous character or range of characters
/ [ 0-9 ] + / => at least one digit
. Wildcard character
any character
/ .* / => any number of any characters
Pattern Description
anchors

Word => sequence of digits, underscores or letters

- 99 is a word
- 299 (here 99) is not a word ( follows a digit )
- $99 is a word ( follows special character - not digit, underscore, letter )

2.1.2 Disjunction, Grouping and Precedence

Pattern Description
| pipe symbol
/ cat | dog / => either cat or dog

() precedence
/ pupp(y|ies) / => puppy or puppies
{} Exact occurrences of previous character or expression
/ a{2} / => aa

Operator precedence

2.1.4 More operators

\b[0-9]+(\.[0-9]+)? *((GB|gb)|[Gg]igabytes)\b

2.1.6 Substitution, Capture groups and ELIZA

Pattern Description
the (.*)er it is, the \1er it will be Substitution => \1 will match the first pattern
> The bigger it is, the bigger it will be.

Capture group => use of parentheses to store a pattern

Non-capture group => adding special commands after the open parenthesis (?:
pattern)

/(?:some|a few) (people|cats) like some \1/

some cats like some cats ✅
some people like some cats ❌
2.1.7 Lookahead assertions

2.2 Words

Disfluencies

fragment => main- in mainly

filler or filled pause => uh, ah, umm

Herdan's Law or Heap's Law

Lemma => set of lexical forms having the same stem, same major part of speech
and same word sense
e.g - cat vs cats

2.3 Corpora

Code switch => use multiple languages in a single communicative act

2.4 Simple Unix tools for word tokenization

Text normalization process:

1. Word tokenization
2. Word formats normalization
3. Sentences segmentation

2.5 Word and subword tokenization

Top-down tokenization ( rule based tokenization)

Bottom-up tokenization ( Byte-pair encoding ) - subword tokens
Clitic => part of a word that can't occur on its own and is used to attach another
word
eg. we're, I'm

Penn Treebank Tokenization

separates out critics

keep hyphenated words together
separates out all punctuations

2.5.2 Byte-pair Encoding ( Bottom-up tokenization )

A morpheme is the smallest meaning-bearing unit of a language.

eg. unwashable => un-, wash, and -able

Tokenization schemes:

token learner
token segmenter

Algorithms:

1. Byte-pair encoding (BPE)

first count all pairs of adjacent characters
make k merges for those pairs
vocab => all characters + k pairs of characters
2. Unigram language modeling

2.6 Word Normalization, Lemmatization and Stemming

Normalization => task of putting words and tokens in standard format

2.6.1 Lemmatization

Lemmatization => task of determining two words have the same root despite
their surface differences
e.g. be => am, are

Morphology is the study of the way words are built up from smaller meaning-
bearing units called morphemes.
- stem => central morpheme
- affixes => additional meanings of various kinds

Porter Stemmer [1]

2.7 Sentence Segmentation

2.8 Minimum Edit Distance ( minimum number of editing operations )

Coreference => deciding if two strings refer to the same identity

e.g. Stanford President Marc ~ Stanford University President Marc
Levenshtein distance weighing method

insertion/ deletion => cost 1

substitution => cost 2 ( 1 insertion, 1 deletion )

2.8.1 The Minimum Edit Distance Algorithm

Levenshtein algorithm
grep => Global Regular Expression Prints

Backtracing
Weighted Minimum Edit Distance

References

1. Porter stemmer - https://tartarus.org/martin/PorterStemmer/

2. CS124 - https://www.youtube.com/playlist?list=PLaZQkZp6WhWyvdiP49JG-
rjyTPck_hvEu

Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Multimedia Application L3
No ratings yet
Multimedia Application L3
49 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
55 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
Text Proc
No ratings yet
Text Proc
55 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
Morphological Analysis
No ratings yet
Morphological Analysis
118 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Week 2
No ratings yet
Week 2
90 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
CC 2
No ratings yet
CC 2
65 pages
NLP Module 2 - 1
No ratings yet
NLP Module 2 - 1
86 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
NLP Individual Assignment ch-2
No ratings yet
NLP Individual Assignment ch-2
4 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
Unit 2
No ratings yet
Unit 2
20 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
41 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
02 Text Processing - Regular Expressions-Text Normalization
No ratings yet
02 Text Processing - Regular Expressions-Text Normalization
58 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
CC Note 1
No ratings yet
CC Note 1
11 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Module 2
No ratings yet
Module 2
78 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Module II
No ratings yet
Module II
47 pages
Module 2 Chap1
No ratings yet
Module 2 Chap1
92 pages
Regular Expressions, Tok-Enization, Edit Distance
No ratings yet
Regular Expressions, Tok-Enization, Edit Distance
29 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Basic Text Processing: Regular Expressions
No ratings yet
Basic Text Processing: Regular Expressions
46 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Chapter 2
No ratings yet
Chapter 2
8 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
14 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
American Language Course Volume 1100 Elementary Phase
No ratings yet
American Language Course Volume 1100 Elementary Phase
291 pages
English Stage 7 SOW Tcm143-354015
100% (1)
English Stage 7 SOW Tcm143-354015
55 pages
General English Sem 1 Model QP With Directions
No ratings yet
General English Sem 1 Model QP With Directions
3 pages
Translate Soal
No ratings yet
Translate Soal
16 pages
DLL EAPP Week 4
No ratings yet
DLL EAPP Week 4
2 pages
Body Language Secrets
100% (1)
Body Language Secrets
31 pages
Testing Vocabulary
No ratings yet
Testing Vocabulary
5 pages
General Strategy For Listening
No ratings yet
General Strategy For Listening
8 pages
Linguistics Materials Compilation
No ratings yet
Linguistics Materials Compilation
4 pages
Basic Bootcamp #2 Basic Arabic Sentence Structure: Lesson Notes
100% (1)
Basic Bootcamp #2 Basic Arabic Sentence Structure: Lesson Notes
5 pages
20 OnScreen B1 Test 2
No ratings yet
20 OnScreen B1 Test 2
3 pages
Price of Success
No ratings yet
Price of Success
3 pages
Translation 2021 2022
No ratings yet
Translation 2021 2022
42 pages
Verbaland Nonverbal Comm
No ratings yet
Verbaland Nonverbal Comm
24 pages
Psych 145 Language Books
No ratings yet
Psych 145 Language Books
4 pages
Snapshot 2024 State of Media Literacy FINAL
No ratings yet
Snapshot 2024 State of Media Literacy FINAL
40 pages
Why Do You Need Natto
100% (1)
Why Do You Need Natto
13 pages
Unit 2 2
No ratings yet
Unit 2 2
35 pages
KELOMPOK 2 - DAiT - RF
No ratings yet
KELOMPOK 2 - DAiT - RF
32 pages
Revision Pack
No ratings yet
Revision Pack
19 pages
Self-Learning Modules - EEnglish-7-Q3-M2
No ratings yet
Self-Learning Modules - EEnglish-7-Q3-M2
9 pages
057 Talking About Traveling
No ratings yet
057 Talking About Traveling
6 pages
New Curriculum Schemes of Work For ECE English 4-5 Ages Term 1 2025
No ratings yet
New Curriculum Schemes of Work For ECE English 4-5 Ages Term 1 2025
6 pages
JD - Staff - Product Design
No ratings yet
JD - Staff - Product Design
2 pages
Business Communication
No ratings yet
Business Communication
37 pages
2Q TQ Mil11
No ratings yet
2Q TQ Mil11
3 pages
Influence of Artificial Intelligence (Ai) On Customer
No ratings yet
Influence of Artificial Intelligence (Ai) On Customer
25 pages
Ted Talks, Public Speaking in Media
No ratings yet
Ted Talks, Public Speaking in Media
13 pages
0 - READ FIRST - Your Bonuses (Table of Contents)
No ratings yet
0 - READ FIRST - Your Bonuses (Table of Contents)
2 pages
Oral Com and Eapp Cet
No ratings yet
Oral Com and Eapp Cet
3 pages

Chapter 1 + 2

Uploaded by

Chapter 1 + 2

Uploaded by

Chapter 1 + 2

converting text to more convenient, standard form

tokenization - separating words by whitespaces, etc

2.1 Regular Expressions

Regex patterns are case-sensitive

Word => sequence of digits, underscores or letters

2.1.2 Disjunction, Grouping and Precedence

2.1.4 More operators

2.1.6 Substitution, Capture groups and ELIZA

Capture group => use of parentheses to store a pattern

/(?:some|a few) (people|cats) like some \1/

fragment => main- in mainly

Herdan's Law or Heap's Law

Code switch => use multiple languages in a single communicative act

2.4 Simple Unix tools for word tokenization

Text normalization process:

2.5 Word and subword tokenization

Top-down tokenization ( rule based tokenization)

Penn Treebank Tokenization

separates out critics

2.5.2 Byte-pair Encoding ( Bottom-up tokenization )

A morpheme is the smallest meaning-bearing unit of a language.

1. Byte-pair encoding (BPE)

2.6 Word Normalization, Lemmatization and Stemming

Normalization => task of putting words and tokens in standard format

Porter Stemmer [1]

2.7 Sentence Segmentation

2.8 Minimum Edit Distance ( minimum number of editing operations )

Coreference => deciding if two strings refer to the same identity

insertion/ deletion => cost 1

2.8.1 The Minimum Edit Distance Algorithm

1. Porter stemmer - https://tartarus.org/martin/PorterStemmer/

You might also like