Chapter 1 + 2
Chapter 1 + 2
Date : 24-Dec-2024
Time : 04:51 pm
Tags: language_processing
Chapter 1
Text normalization
Pattern Description
/character/ Match single character or sequence of characters between the
slashes
[wW] disjunction - case insensitive match
[1-5] range - specifies any one character in the range
[ ^a ] negation - single character (including special characters) except a
only when ^ is the first character after [
[ ^a-z ]
? optional - preceding character or nothing
[ an? ] => a or an
* Kleene * Cleany star
0 or more occurrences of previous character
/ an* / => a, an, annnnnn
+ Kleene + pattern
1 or more occurrences of previous character or range of characters
/ [ 0-9 ] + / => at least one digit
. Wildcard character
any character
/ .* / => any number of any characters
Pattern Description
anchors
Pattern Description
| pipe symbol
/ cat | dog / => either cat or dog
() precedence
/ pupp(y|ies) / => puppy or puppies
{} Exact occurrences of previous character or expression
/ a{2} / => aa
Operator precedence
Pattern Description
the (.*)er it is, the \1er it will be Substitution => \1 will match the first pattern
> The bigger it is, the bigger it will be.
Non-capture group => adding special commands after the open parenthesis (?:
pattern)
2.2 Words
Disfluencies
Lemma => set of lexical forms having the same stem, same major part of speech
and same word sense
e.g - cat vs cats
2.3 Corpora
1. Word tokenization
2. Word formats normalization
3. Sentences segmentation
Tokenization schemes:
token learner
token segmenter
Algorithms:
2.6.1 Lemmatization
Lemmatization => task of determining two words have the same root despite
their surface differences
e.g. be => am, are
Morphology is the study of the way words are built up from smaller meaning-
bearing units called morphemes.
- stem => central morpheme
- affixes => additional meanings of various kinds
Levenshtein algorithm
grep => Global Regular Expression Prints
Backtracing
Weighted Minimum Edit Distance
References