0% found this document useful (0 votes)
10 views38 pages

Lecture 2n 04032024 081220pm 19022025 105409am

The document covers key concepts in Natural Language Processing (NLP), including the use of regular expressions for text normalization and various NLP tasks such as text annotation, classification, and lemmatization. It explains the importance of text normalization for improving computer understanding and processing of text, as well as methods for cleaning and segmenting text. Additionally, it discusses performance measurement in text classification and provides examples of regular expression patterns and their applications.

Uploaded by

uk1122334456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

Lecture 2n 04032024 081220pm 19022025 105409am

The document covers key concepts in Natural Language Processing (NLP), including the use of regular expressions for text normalization and various NLP tasks such as text annotation, classification, and lemmatization. It explains the importance of text normalization for improving computer understanding and processing of text, as well as methods for cleaning and segmenting text. Additionally, it discusses performance measurement in text classification and provides examples of regular expression patterns and their applications.

Uploaded by

uk1122334456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture 2: Regular

Expressions, Text
Normalization
Lecture Objectives:

•Student will be able to understand NLP tasks


•Students Will be able to understand Parsing Algorithm

CSC-441: Natural Language Processing


What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Text Annotation Tasks

• Classification of individual word tokens


• Identify phrases
• Parsing
• Text Classification
• Semantic annotation
What is Regular Expression?

Each Regular Expression (RE) represents a set of strings having certain


pattern.
•In NLP, we can use REs to find strings having certain patterns in a given text.
Simple Definition for Regular Expressions over alphabet 
• is a regular expression
•If a  , a is a regular expression
•or : If E1 and E2 are REs, then E1 | E2 is a regular expression
•concatenation : If E1 and E2 are REs, then E1E2 is a regular expression
•Kleene Closure: If E is a RE, then E* is a regular expression
•Positive Closure: If E is a RE, then E+ is a regular expression
Searching Strings with Regular Expressions

• How can we search for any of following strings?


– woodchuck
– woodchucks
– Woodchuck
– Woodchucks

5
Regular Expression Application
• In a Web search engine they might be the
entire documents or Web pages
• In a word processor they might be
individual words, or lines of a document

• E g the UNIX grep command


• E g dir *.*

6
Regular Expression Guide
Pattern Definition
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
+? Repeats a character one or more times (non-greedy)
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Regular Expressions:
Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret ^ matches the start of a string.
– The regular expression ^The matches the word The only at the start of a string.
• The dollar sign $ matches the end of a line.
Regular Expression Matches
.$ any character at the end of a string
\.$ dot character at the end of a string
^[A-Z] any uppercase character at the
beginning of a string
^[A-Z][^\.]*is[^\.]*\.$ a string that contains “is”, start with
capital and end with “.”
Regular Expressions:Disjunctions
Negations in []
• Negations in []:
– The square braces can also be used to specify what a single character cannot be,
by use of
the caret ^.
– If the caret ^ is the first symbol after the open square brace [, the resulting pattern
is negated.
Regular Expression Matches
[^A-Z] Not an upper case letter
[^a-z] Not a lower case letter
[^Ss] Neither ‘S’ nor ‘s’
[^e^] Neither e nor ^
a^b The pattern a^b
Regular Expressions: {} . ?
• {m,n} causes the resulting RE to match from m to n repetitions of the
preceding RE.
• {m} specifies that exactly m copies of the previous RE should be matched
• The question mark ? marks optionality of the previous expression.
Regular Expression Matches
woodchucks? woodchuck or woodchucks
colou?r color or colour
(a|b)?c ac, bc, c
(ba){2,3} baba, bababa
• A wildcard expression dot . matches any single character (except a carriage
return).
Regular Expression Matches
beg.n begin, begun, begxn, …
a.*b any string starts with a and ends with
b
Natural 1
Regular Expression: Writing
Patterns
Online regex finding Apps: PyRegex, Pythex

Pakistan, China and Iran are friends from years.


asiasamreen.bukc@bahria .edu.pk sent on Sat Jan 5
09:14:16 2008

What will be the results for:


(?:\@|http?\://|https?\://|www)\S+

•\S+@\S+
•[A-Z][a-zA-Z]+
•[A-Z][a-z]*
Regular Expression: Writing
Patterns in PYTHON

>>> import re
>>> text1= ‘We will be moving next month to earn PKR 20000'
>>> w = re.findall(r'\[0-9]+',x)
>>> print (w)
>>>[a-zA-Z0-9]* (\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$
>>>???????
Books : Regular Expression

13
Information extraction by using
REs
import nltk
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r''‘
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
print(text)
print(nltk.regexp_tokenize(text, pattern))
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
NLP pipeline :Text cleaning
def main(text):

text2=[NLP_pipeline(x) for x in
text]
while("" in text2):
text2.remove("")
return text2
Text cleaning: sending text
def cleaning(dfilename): #Received
raw data CSV
df = pd.read_csv(dfilename,encoding
='mac_roman’)# or “utf-16”
text1 = df['Text']
cleantext=cf.main(text1) # applying
cleaning
import cleanfile as cf #Step
Ctext=cleantext 1 : code to clean the raw text
return Ctext
Text cleaning: Calling pipeline()
#Finding Keywords example
def NLP_pipeline(inputext):
return Make_Lemmatize('
'.join(Remove_RUEstopwords(Remove_Use
lessWords(Remove_Repeatdc(Remove_Emoj
i(Remove_Whitespaces(Remove_Punctuati
on(Remove_Otags(Remove_HtmlTags(Make_
LowerCase(inputext)))))))))))
Text cleaning: Calling pipeline()
#Example function:Ignoring non-ascii
chars
def Remove_Nonascii(a_str):
ascii_chars =
set(string.printable)

return ''.join(filter(lambda x: x
in ascii_chars, a_str))
Text cleaning: Calling pipeline()
#Example function: Removing other
tags
def Remove_Otags(text1):
text1 = " ".join([word for word in text1.split()
if 'http' not in
word and '@' not in word and '<' not in word and '#'
not in word])
#text1= re.sub('[!@#$:).;,?&]', '', text1.lower())
#text1= re.sub(' ', ' ', text1)
#text1=Remove_Nonascii(text1)
return text1
Text Normalization
• Text normalization is the process of
converting text into a standard form so that it
can be more easily processed by computers.
• Why is text normalization important?
o It helps computers understand text.
o It helps make (TTS) systems more
accurate .
o It helps improve the accuracy and
efficiency of automated systems
nltk:Natural Language Tool Kit
• Use your own text
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed",
len(nltk.word_tokenize(text1)), "words.")
>>Enter some text: Natural Languge Processing
['Natural', 'Languge', 'Processing']
3
You typed 3 words.
How many words?
• My father , walking along a river looking at
sky said these words.
• Type: an element of the vocabulary.
• Token: recognized word.
• How many?
– No of Tokens =14
– No of types =?
Words Normalization
• Lemmatization
 Represent all words as their shared root
 The goal is to remove inflections and map a word to its
root form.
am, are, is  be
car, cars, car's, cars'  car

Lemmatization is the process of grouping together the different


inflected forms of a word so they can be analyzed as a single item
Lemmatization is done by
Morphological Parsing
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Parts that adhere to stems, often with
grammatical functions
• Morphological Parsers:
– Parse cats into two morphemes cat and
s
– Parse connected into connect and ed
Words Normalization
Stemming: chop off the words to get root
 Stemming uses a crude heuristic process that chops off the
ends of words such as ThisThi, AccurateAccur
Lemmatization and stemming
import nltk
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed", len(nltk.word_tokenize(text1)),
"words.")
lemma = nltk.wordnet.WordNetLemmatizer()
print ("Lemmatized: ",lemma.lemmatize('article'))
print ("Lemmatized: ",lemma.lemmatize('leaves'))
sno = nltk.stem.SnowballStemmer('english')
print("Stemmed: ",sno.stem('article'))
print("Stemmed: ",sno.stem('leaves'))
output
• “article” Lemmatized: article
• “leaves” Lemmatized: leaf
• “article” Stemmed: articl
• “leaves” Stemmed: leav
Problem???
• What about Roman Urdu wording
– print("lemmatized:",lemma.lemmatize('yaariya
n'))
– print("Stemmed: ",sno.stem('Yaariyan'))

lemmatized: yaariyan
Stemmed: yaariyan
Sentence Segmentation
• !, ? are relatively unambiguous but “.” is quite
ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Common Algorithm: decide whether a (.) is
part of the word or is a sentence-boundary
marker.
– An abbreviation dictionary can help
• Sentence segmentation can then often be
done by rules based on this tokenization.
How similar are two strings?

• Spell correction • Computational Biology


• Align two sequences of nucleotides
– The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“Karachi” TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi

• Also used for Machine Translation, Information Extraction, Speech


Recognition
Annotated Text Corpora
• Many text corpora contain linguistic
annotations, representing part-of-speech
tags, named entities, syntactic structures,
semantic roles, and so forth.
Text Classification
• Given:
– A representation of a document d
• Issue: how to represent text documents.
• Usually some type of high-dimensional space – bag of words
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:
– The category of d: γ(d) ∈ C, where γ(d) is a
classification function
– We want to build classification functions (“classifiers”).
Text Categorization
• Is it spam?
• Is it Urdu?
• Is it interesting to this user?
– News filtering
– Helpdesk routing
• Is it interesting to this NLP program?
– e.g., should my calendar system try to interpret this
email as an appointment (using info. extraction)?
• Where should it go in the directory?
– Yahoo! / Open Directory / digital libraries
– Which mail folder? (work, friends, junk, urgent ...)
33
Measuring Performance
• Precision =
good messages kept/
all messages kept
• Recall =
good messages kept/
all good messages

True Positive (TP) : number of correct predictions when the actual class is positive.
True Negative (TN) :number of correct predictions when the actual class is negative.
False Positive (FP) : number of incorrect predictions when the actual class is positive,
(Type I Error).
False Negative (FN) :number of incorrect predictions when the actual class is negative
(Type II Error). 34
Example
Summary
Basic text processing includes

• Words tokenization
• Ordering
• Removal of unnecessary information like useless words
• Can be used for categorization of text.

36
Test Data
Hahhahhaha
Ye bat to manany wali h ap ki.
Very well said, Irshad ;);D
Seventy-seven days of friendship
Beautiful humanity.
My son’s U.S. friends have left Paris to be with their
families. We had a long talk and decided that he'd
stay in Paris to finish his semester. He doesn't want
further disruption in his studies that have already
moved from regular classes to online ones.
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
• Levity.ai

38

You might also like