Lecture 2n 04032024 081220pm 19022025 105409am
Lecture 2n 04032024 081220pm 19022025 105409am
Expressions, Text
Normalization
Lecture Objectives:
5
Regular Expression Application
• In a Web search engine they might be the
entire documents or Web pages
• In a word processor they might be
individual words, or lines of a document
6
Regular Expression Guide
Pattern Definition
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
+? Repeats a character one or more times (non-greedy)
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Regular Expressions:
Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret ^ matches the start of a string.
– The regular expression ^The matches the word The only at the start of a string.
• The dollar sign $ matches the end of a line.
Regular Expression Matches
.$ any character at the end of a string
\.$ dot character at the end of a string
^[A-Z] any uppercase character at the
beginning of a string
^[A-Z][^\.]*is[^\.]*\.$ a string that contains “is”, start with
capital and end with “.”
Regular Expressions:Disjunctions
Negations in []
• Negations in []:
– The square braces can also be used to specify what a single character cannot be,
by use of
the caret ^.
– If the caret ^ is the first symbol after the open square brace [, the resulting pattern
is negated.
Regular Expression Matches
[^A-Z] Not an upper case letter
[^a-z] Not a lower case letter
[^Ss] Neither ‘S’ nor ‘s’
[^e^] Neither e nor ^
a^b The pattern a^b
Regular Expressions: {} . ?
• {m,n} causes the resulting RE to match from m to n repetitions of the
preceding RE.
• {m} specifies that exactly m copies of the previous RE should be matched
• The question mark ? marks optionality of the previous expression.
Regular Expression Matches
woodchucks? woodchuck or woodchucks
colou?r color or colour
(a|b)?c ac, bc, c
(ba){2,3} baba, bababa
• A wildcard expression dot . matches any single character (except a carriage
return).
Regular Expression Matches
beg.n begin, begun, begxn, …
a.*b any string starts with a and ends with
b
Natural 1
Regular Expression: Writing
Patterns
Online regex finding Apps: PyRegex, Pythex
•\S+@\S+
•[A-Z][a-zA-Z]+
•[A-Z][a-z]*
Regular Expression: Writing
Patterns in PYTHON
>>> import re
>>> text1= ‘We will be moving next month to earn PKR 20000'
>>> w = re.findall(r'\[0-9]+',x)
>>> print (w)
>>>[a-zA-Z0-9]* (\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
>>>???????
Books : Regular Expression
13
Information extraction by using
REs
import nltk
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r''‘
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
print(text)
print(nltk.regexp_tokenize(text, pattern))
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
NLP pipeline :Text cleaning
def main(text):
text2=[NLP_pipeline(x) for x in
text]
while("" in text2):
text2.remove("")
return text2
Text cleaning: sending text
def cleaning(dfilename): #Received
raw data CSV
df = pd.read_csv(dfilename,encoding
='mac_roman’)# or “utf-16”
text1 = df['Text']
cleantext=cf.main(text1) # applying
cleaning
import cleanfile as cf #Step
Ctext=cleantext 1 : code to clean the raw text
return Ctext
Text cleaning: Calling pipeline()
#Finding Keywords example
def NLP_pipeline(inputext):
return Make_Lemmatize('
'.join(Remove_RUEstopwords(Remove_Use
lessWords(Remove_Repeatdc(Remove_Emoj
i(Remove_Whitespaces(Remove_Punctuati
on(Remove_Otags(Remove_HtmlTags(Make_
LowerCase(inputext)))))))))))
Text cleaning: Calling pipeline()
#Example function:Ignoring non-ascii
chars
def Remove_Nonascii(a_str):
ascii_chars =
set(string.printable)
return ''.join(filter(lambda x: x
in ascii_chars, a_str))
Text cleaning: Calling pipeline()
#Example function: Removing other
tags
def Remove_Otags(text1):
text1 = " ".join([word for word in text1.split()
if 'http' not in
word and '@' not in word and '<' not in word and '#'
not in word])
#text1= re.sub('[!@#$:).;,?&]', '', text1.lower())
#text1= re.sub(' ', ' ', text1)
#text1=Remove_Nonascii(text1)
return text1
Text Normalization
• Text normalization is the process of
converting text into a standard form so that it
can be more easily processed by computers.
• Why is text normalization important?
o It helps computers understand text.
o It helps make (TTS) systems more
accurate .
o It helps improve the accuracy and
efficiency of automated systems
nltk:Natural Language Tool Kit
• Use your own text
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed",
len(nltk.word_tokenize(text1)), "words.")
>>Enter some text: Natural Languge Processing
['Natural', 'Languge', 'Processing']
3
You typed 3 words.
How many words?
• My father , walking along a river looking at
sky said these words.
• Type: an element of the vocabulary.
• Token: recognized word.
• How many?
– No of Tokens =14
– No of types =?
Words Normalization
• Lemmatization
Represent all words as their shared root
The goal is to remove inflections and map a word to its
root form.
am, are, is be
car, cars, car's, cars' car
lemmatized: yaariyan
Stemmed: yaariyan
Sentence Segmentation
• !, ? are relatively unambiguous but “.” is quite
ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Common Algorithm: decide whether a (.) is
part of the word or is a sentence-boundary
marker.
– An abbreviation dictionary can help
• Sentence segmentation can then often be
done by rules based on this tokenization.
How similar are two strings?
Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi
True Positive (TP) : number of correct predictions when the actual class is positive.
True Negative (TN) :number of correct predictions when the actual class is negative.
False Positive (FP) : number of incorrect predictions when the actual class is positive,
(Type I Error).
False Negative (FN) :number of incorrect predictions when the actual class is negative
(Type II Error). 34
Example
Summary
Basic text processing includes
• Words tokenization
• Ordering
• Removal of unnecessary information like useless words
• Can be used for categorization of text.
36
Test Data
Hahhahhaha
Ye bat to manany wali h ap ki.
Very well said, Irshad ;);D
Seventy-seven days of friendship
Beautiful humanity.
My son’s U.S. friends have left Paris to be with their
families. We had a long talk and decided that he'd
stay in Paris to finish his semester. He doesn't want
further disruption in his studies that have already
moved from regular classes to online ones.
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
• Levity.ai
38