0% found this document useful (0 votes)
11 views9 pages

Chapter 1 + 2

Uploaded by

Steve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Chapter 1 + 2

Uploaded by

Steve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Chapter 1 + 2

Date : 24-Dec-2024
Time : 04:51 pm

Tags: language_processing

Chapter 1

Text normalization

converting text to more convenient, standard form

tokenization - separating words by whitespaces, etc


sentence segmentation

2.1 Regular Expressions

Regex patterns are case-sensitive

Pattern Description
/character/ Match single character or sequence of characters between the
slashes
[wW] disjunction - case insensitive match
[1-5] range - specifies any one character in the range
[ ^a ] negation - single character (including special characters) except a
only when ^ is the first character after [
[ ^a-z ]
? optional - preceding character or nothing
[ an? ] => a or an
* Kleene * Cleany star
0 or more occurrences of previous character
/ an* / => a, an, annnnnn
+ Kleene + pattern
1 or more occurrences of previous character or range of characters
/ [ 0-9 ] + / => at least one digit
. Wildcard character
any character
/ .* / => any number of any characters
Pattern Description
anchors

Word => sequence of digits, underscores or letters


- 99 is a word
- 299 (here 99) is not a word ( follows a digit )
- $99 is a word ( follows special character - not digit, underscore, letter )

2.1.2 Disjunction, Grouping and Precedence

Pattern Description
| pipe symbol
/ cat | dog / => either cat or dog

() precedence
/ pupp(y|ies) / => puppy or puppies
{} Exact occurrences of previous character or expression
/ a{2} / => aa

Operator precedence

2.1.4 More operators


\b[0-9]+(\.[0-9]+)? *((GB|gb)|[Gg]igabytes)\b

2.1.6 Substitution, Capture groups and ELIZA

Pattern Description
the (.*)er it is, the \1er it will be Substitution => \1 will match the first pattern
> The bigger it is, the bigger it will be.

Capture group => use of parentheses to store a pattern

Non-capture group => adding special commands after the open parenthesis (?:
pattern)

/(?:some|a few) (people|cats) like some \1/


some cats like some cats ✅
some people like some cats ❌
2.1.7 Lookahead assertions

2.2 Words

Disfluencies

fragment => main- in mainly


filler or filled pause => uh, ah, umm

Herdan's Law or Heap's Law

Lemma => set of lexical forms having the same stem, same major part of speech
and same word sense
e.g - cat vs cats

2.3 Corpora

Code switch => use multiple languages in a single communicative act

2.4 Simple Unix tools for word tokenization

Text normalization process:

1. Word tokenization
2. Word formats normalization
3. Sentences segmentation

2.5 Word and subword tokenization

Top-down tokenization ( rule based tokenization)


Bottom-up tokenization ( Byte-pair encoding ) - subword tokens
Clitic => part of a word that can't occur on its own and is used to attach another
word
eg. we're, I'm

Penn Treebank Tokenization

separates out critics


keep hyphenated words together
separates out all punctuations

2.5.2 Byte-pair Encoding ( Bottom-up tokenization )

A morpheme is the smallest meaning-bearing unit of a language.


eg. unwashable => un-, wash, and -able

Tokenization schemes:

token learner
token segmenter

Algorithms:

1. Byte-pair encoding (BPE)


first count all pairs of adjacent characters
make k merges for those pairs
vocab => all characters + k pairs of characters
2. Unigram language modeling

2.6 Word Normalization, Lemmatization and Stemming

Normalization => task of putting words and tokens in standard format

2.6.1 Lemmatization

Lemmatization => task of determining two words have the same root despite
their surface differences
e.g. be => am, are

Morphology is the study of the way words are built up from smaller meaning-
bearing units called morphemes.
- stem => central morpheme
- affixes => additional meanings of various kinds

Porter Stemmer [1]

2.7 Sentence Segmentation

2.8 Minimum Edit Distance ( minimum number of editing operations )

Coreference => deciding if two strings refer to the same identity


e.g. Stanford President Marc ~ Stanford University President Marc
Levenshtein distance weighing method

insertion/ deletion => cost 1


substitution => cost 2 ( 1 insertion, 1 deletion )

2.8.1 The Minimum Edit Distance Algorithm

Levenshtein algorithm
grep => Global Regular Expression Prints

Backtracing
Weighted Minimum Edit Distance

References

1. Porter stemmer - https://tartarus.org/martin/PorterStemmer/


2. CS124 - https://www.youtube.com/playlist?list=PLaZQkZp6WhWyvdiP49JG-
rjyTPck_hvEu

You might also like