0% found this document useful (0 votes)

10 views38 pages

Lecture 2n 04032024 081220pm 19022025 105409am

The document covers key concepts in Natural Language Processing (NLP), including the use of regular expressions for text normalization and various NLP tasks such as text annotation, classification, and lemmatization. It explains the importance of text normalization for improving computer understanding and processing of text, as well as methods for cleaning and segmenting text. Additionally, it discusses performance measurement in text classification and provides examples of regular expression patterns and their applications.

Uploaded by

uk1122334456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views38 pages

Lecture 2n 04032024 081220pm 19022025 105409am

Uploaded by

uk1122334456

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 38

Lecture 2: Regular

Expressions, Text
Normalization
Lecture Objectives:

•Student will be able to understand NLP tasks

•Students Will be able to understand Parsing Algorithm

CSC-441: Natural Language Processing

What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Text Annotation Tasks

• Classification of individual word tokens

• Identify phrases
• Parsing
• Text Classification
• Semantic annotation
What is Regular Expression?

Each Regular Expression (RE) represents a set of strings having certain

pattern.
•In NLP, we can use REs to find strings having certain patterns in a given text.
Simple Definition for Regular Expressions over alphabet 
• is a regular expression
•If a  , a is a regular expression
•or : If E1 and E2 are REs, then E1 | E2 is a regular expression
•concatenation : If E1 and E2 are REs, then E1E2 is a regular expression
•Kleene Closure: If E is a RE, then E* is a regular expression
•Positive Closure: If E is a RE, then E+ is a regular expression
Searching Strings with Regular Expressions

• How can we search for any of following strings?

– woodchuck
– woodchucks
– Woodchuck
– Woodchucks

5
Regular Expression Application
• In a Web search engine they might be the
entire documents or Web pages
• In a word processor they might be
individual words, or lines of a document

• E g the UNIX grep command

• E g dir *.*

6
Regular Expression Guide
Pattern Definition
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
+? Repeats a character one or more times (non-greedy)
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Regular Expressions:
Anchors ^ $
• Anchors are special characters that anchor regular expressions to particular
places in a string.
• The caret ^ matches the start of a string.
– The regular expression ^The matches the word The only at the start of a string.
• The dollar sign $ matches the end of a line.
Regular Expression Matches
.$ any character at the end of a string
\.$ dot character at the end of a string
^[A-Z] any uppercase character at the
beginning of a string
^[A-Z][^\.]*is[^\.]*\.$ a string that contains “is”, start with
capital and end with “.”
Regular Expressions:Disjunctions
Negations in []
• Negations in []:
– The square braces can also be used to specify what a single character cannot be,
by use of
the caret ^.
– If the caret ^ is the first symbol after the open square brace [, the resulting pattern
is negated.
Regular Expression Matches
[Â-Z] Not an upper case letter
[â-z] Not a lower case letter
[^Ss] Neither ‘S’ nor ‘s’
[ê^] Neither e nor ^
a^b The pattern a^b
Regular Expressions: {} . ?
• {m,n} causes the resulting RE to match from m to n repetitions of the
preceding RE.
• {m} specifies that exactly m copies of the previous RE should be matched
• The question mark ? marks optionality of the previous expression.
Regular Expression Matches
woodchucks? woodchuck or woodchucks
colou?r color or colour
(a|b)?c ac, bc, c
(ba){2,3} baba, bababa
• A wildcard expression dot . matches any single character (except a carriage
return).
Regular Expression Matches
beg.n begin, begun, begxn, …
a.*b any string starts with a and ends with
b
Natural 1
Regular Expression: Writing
Patterns
Online regex finding Apps: PyRegex, Pythex

Pakistan, China and Iran are friends from years.

asiasamreen.bukc@bahria .edu.pk sent on Sat Jan 5
09:14:16 2008

What will be the results for:

(?:\@|http?\://|https?\://|www)\S+

•\S+@\S+
•[A-Z][a-zA-Z]+
•[A-Z][a-z]*
Regular Expression: Writing
Patterns in PYTHON

>>> import re
>>> text1= ‘We will be moving next month to earn PKR 20000'
>>> w = re.findall(r'\[0-9]+',x)
>>> print (w)
>>>[a-zA-Z0-9]* (\+\d{1,2}\s)?$?\d{3}$?[\s.-]\d{3}[\s.-]\d{4}$
>>>???????
Books : Regular Expression

13
Information extraction by using
REs
import nltk
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r''‘
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
print(text)
print(nltk.regexp_tokenize(text, pattern))
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
NLP pipeline :Text cleaning
def main(text):

text2=[NLP_pipeline(x) for x in
text]
while("" in text2):
text2.remove("")
return text2
Text cleaning: sending text
def cleaning(dfilename): #Received
raw data CSV
df = pd.read_csv(dfilename,encoding
='mac_roman’)# or “utf-16”
text1 = df['Text']
cleantext=cf.main(text1) # applying
cleaning
import cleanfile as cf #Step
Ctext=cleantext 1 : code to clean the raw text
return Ctext
Text cleaning: Calling pipeline()
#Finding Keywords example
def NLP_pipeline(inputext):
return Make_Lemmatize('
'.join(Remove_RUEstopwords(Remove_Use
lessWords(Remove_Repeatdc(Remove_Emoj
i(Remove_Whitespaces(Remove_Punctuati
on(Remove_Otags(Remove_HtmlTags(Make_
LowerCase(inputext)))))))))))
Text cleaning: Calling pipeline()
#Example function:Ignoring non-ascii
chars
def Remove_Nonascii(a_str):
ascii_chars =
set(string.printable)

return ''.join(filter(lambda x: x
in ascii_chars, a_str))
Text cleaning: Calling pipeline()
#Example function: Removing other
tags
def Remove_Otags(text1):
text1 = " ".join([word for word in text1.split()
if 'http' not in
word and '@' not in word and '<' not in word and '#'
not in word])
#text1= re.sub('[!@#$:).;,?&]', '', text1.lower())
#text1= re.sub(' ', ' ', text1)
#text1=Remove_Nonascii(text1)
return text1
Text Normalization
• Text normalization is the process of
converting text into a standard form so that it
can be more easily processed by computers.
• Why is text normalization important?
o It helps computers understand text.
o It helps make (TTS) systems more
accurate .
o It helps improve the accuracy and
efficiency of automated systems
nltk:Natural Language Tool Kit
• Use your own text
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed",
len(nltk.word_tokenize(text1)), "words.")
>>Enter some text: Natural Languge Processing
['Natural', 'Languge', 'Processing']
3
You typed 3 words.
How many words?
• My father , walking along a river looking at
sky said these words.
• Type: an element of the vocabulary.
• Token: recognized word.
• How many?
– No of Tokens =14
– No of types =?
Words Normalization
• Lemmatization
 Represent all words as their shared root
 The goal is to remove inflections and map a word to its
root form.
am, are, is  be
car, cars, car's, cars'  car

Lemmatization is the process of grouping together the different

inflected forms of a word so they can be analyzed as a single item
Lemmatization is done by
Morphological Parsing
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Parts that adhere to stems, often with
grammatical functions
• Morphological Parsers:
– Parse cats into two morphemes cat and
s
– Parse connected into connect and ed
Words Normalization
Stemming: chop off the words to get root
 Stemming uses a crude heuristic process that chops off the
ends of words such as ThisThi, AccurateAccur
Lemmatization and stemming
import nltk
text1 =input("Enter some text: ")
words=nltk.word_tokenize(text1)
print(words)
print(len(words))
print ("You typed", len(nltk.word_tokenize(text1)),
"words.")
lemma = nltk.wordnet.WordNetLemmatizer()
print ("Lemmatized: ",lemma.lemmatize('article'))
print ("Lemmatized: ",lemma.lemmatize('leaves'))
sno = nltk.stem.SnowballStemmer('english')
print("Stemmed: ",sno.stem('article'))
print("Stemmed: ",sno.stem('leaves'))
output
• “article” Lemmatized: article
• “leaves” Lemmatized: leaf
• “article” Stemmed: articl
• “leaves” Stemmed: leav
Problem???
• What about Roman Urdu wording
– print("lemmatized:",lemma.lemmatize('yaariya
n'))
– print("Stemmed: ",sno.stem('Yaariyan'))

lemmatized: yaariyan
Stemmed: yaariyan
Sentence Segmentation
• !, ? are relatively unambiguous but “.” is quite
ambiguous
– Sentence boundary
– Abbreviations like Inc. or Dr.
– Numbers like .02% or 4.3
• Common Algorithm: decide whether a (.) is
part of the word or is a sentence-boundary
marker.
– An abbreviation dictionary can help
• Sentence segmentation can then often be
done by rules based on this tokenization.
How similar are two strings?

• Spell correction • Computational Biology

• Align two sequences of nucleotides
– The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“Karachi” TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi

• Also used for Machine Translation, Information Extraction, Speech

Recognition
Annotated Text Corpora
• Many text corpora contain linguistic
annotations, representing part-of-speech
tags, named entities, syntactic structures,
semantic roles, and so forth.
Text Classification
• Given:
– A representation of a document d
• Issue: how to represent text documents.
• Usually some type of high-dimensional space – bag of words
– A fixed set of classes:
C = {c1, c2,…, cJ}
• Determine:
– The category of d: γ(d) ∈ C, where γ(d) is a
classification function
– We want to build classification functions (“classifiers”).
Text Categorization
• Is it spam?
• Is it Urdu?
• Is it interesting to this user?
– News filtering
– Helpdesk routing
• Is it interesting to this NLP program?
– e.g., should my calendar system try to interpret this
email as an appointment (using info. extraction)?
• Where should it go in the directory?
– Yahoo! / Open Directory / digital libraries
– Which mail folder? (work, friends, junk, urgent ...)
33
Measuring Performance
• Precision =
good messages kept/
all messages kept
• Recall =
good messages kept/
all good messages

True Positive (TP) : number of correct predictions when the actual class is positive.
True Negative (TN) :number of correct predictions when the actual class is negative.
False Positive (FP) : number of incorrect predictions when the actual class is positive,
(Type I Error).
False Negative (FN) :number of incorrect predictions when the actual class is negative
(Type II Error). 34
Example
Summary
Basic text processing includes

• Words tokenization
• Ordering
• Removal of unnecessary information like useless words
• Can be used for categorization of text.

36
Test Data
Hahhahhaha
Ye bat to manany wali h ap ki.
Very well said, Irshad ;);D
Seventy-seven days of friendship
Beautiful humanity.
My son’s U.S. friends have left Paris to be with their
families. We had a long talk and decided that he'd
stay in Paris to finish his semester. He doesn't want
further disruption in his studies that have already
moved from regular classes to online ones.
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
• Levity.ai

NLP Chapter 5
No ratings yet
NLP Chapter 5
70 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101858 2024-07-26 Reference-Material-I
55 pages
03.1 - Regular Expressions
No ratings yet
03.1 - Regular Expressions
34 pages
Text Proc
No ratings yet
Text Proc
55 pages
Regular Expressions: Luísa Coheur
No ratings yet
Regular Expressions: Luísa Coheur
22 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
Basic Text Processing: Regular Expressions and Text Normalization
No ratings yet
Basic Text Processing: Regular Expressions and Text Normalization
53 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
3b TextProcessing
No ratings yet
3b TextProcessing
32 pages
Chapter Two
No ratings yet
Chapter Two
72 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
Regular Expression and BPE
No ratings yet
Regular Expression and BPE
68 pages
Week 2
No ratings yet
Week 2
90 pages
Untitled
No ratings yet
Untitled
53 pages
CC 2
No ratings yet
CC 2
65 pages
Intro To NLP
No ratings yet
Intro To NLP
44 pages
02 Textprocessingboth
No ratings yet
02 Textprocessingboth
46 pages
NLP Experiment 04
No ratings yet
NLP Experiment 04
3 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Regular Expressions
No ratings yet
Regular Expressions
20 pages
NLP 04
No ratings yet
NLP 04
3 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Regular Expressions: SESSION - 14 - 15 - 16
No ratings yet
Regular Expressions: SESSION - 14 - 15 - 16
42 pages
Unit 2
No ratings yet
Unit 2
20 pages
02 Text Processing - Regular Expressions-Text Normalization
No ratings yet
02 Text Processing - Regular Expressions-Text Normalization
58 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
Unit1 01
No ratings yet
Unit1 01
10 pages
Module II
No ratings yet
Module II
17 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
Lec02 1 BasicTextProcessing
No ratings yet
Lec02 1 BasicTextProcessing
47 pages
CS173 Class Activity 2 Regex PDF
No ratings yet
CS173 Class Activity 2 Regex PDF
3 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Regular Expressions, Tok-Enization, Edit Distance
No ratings yet
Regular Expressions, Tok-Enization, Edit Distance
29 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
ATFL Assignment 1
No ratings yet
ATFL Assignment 1
4 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
30 pages
03 Regular Expressions and Grammars Parser Generators 16102023 041542pm
No ratings yet
03 Regular Expressions and Grammars Parser Generators 16102023 041542pm
32 pages
3-Regular Expressions
No ratings yet
3-Regular Expressions
34 pages
Unit 5
No ratings yet
Unit 5
4 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
3 Regular Expression
No ratings yet
3 Regular Expression
15 pages
2 Regular Expressions
No ratings yet
2 Regular Expressions
34 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Regular Expressions, Text Normalization, Edit Distance
No ratings yet
Regular Expressions, Text Normalization, Edit Distance
23 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
CS 491 Natural Language Processing Module 2: Basic Text Processing
No ratings yet
CS 491 Natural Language Processing Module 2: Basic Text Processing
24 pages
Usage of Regular Expressions in NLP
No ratings yet
Usage of Regular Expressions in NLP
7 pages
Contact List Latest Master - Sheet88
No ratings yet
Contact List Latest Master - Sheet88
8 pages
2) Instruction Manual
No ratings yet
2) Instruction Manual
1,340 pages
List of Drawing Instruments Equipments and Materials
No ratings yet
List of Drawing Instruments Equipments and Materials
16 pages
1.1. How Should We Define AI
No ratings yet
1.1. How Should We Define AI
14 pages
Salesforce Course Content PDF
No ratings yet
Salesforce Course Content PDF
8 pages
Logcat CSC Update Log
No ratings yet
Logcat CSC Update Log
2,493 pages
Modern Database Management Systems Edition 8-Answers Ch1
67% (3)
Modern Database Management Systems Edition 8-Answers Ch1
13 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Thesis On Mobile Computing PDF
100% (3)
Thesis On Mobile Computing PDF
6 pages
InteliGen 200 Datasheet
No ratings yet
InteliGen 200 Datasheet
4 pages
Cloud - Business Profile
No ratings yet
Cloud - Business Profile
14 pages
Se Unit 2 Analysis Modelling
No ratings yet
Se Unit 2 Analysis Modelling
68 pages
Installation Testing
No ratings yet
Installation Testing
12 pages
Asha International Institute of Marine Technology: Refresher Training in Advanced Fire Fighting
No ratings yet
Asha International Institute of Marine Technology: Refresher Training in Advanced Fire Fighting
1 page
AQA Comp Sci WB2 Answers Ms
No ratings yet
AQA Comp Sci WB2 Answers Ms
52 pages
English Paper 1: Stage 9
No ratings yet
English Paper 1: Stage 9
48 pages
Lektira Za Osmi Razred
No ratings yet
Lektira Za Osmi Razred
9 pages
Week2-Fuzzy Logic and Reasoning
No ratings yet
Week2-Fuzzy Logic and Reasoning
48 pages
Curriculum Vitae: Sanjay Dixit
No ratings yet
Curriculum Vitae: Sanjay Dixit
3 pages
IOT Embedded Projects List 2021 - 2022
No ratings yet
IOT Embedded Projects List 2021 - 2022
10 pages
Lecture 3 Revision Questions
No ratings yet
Lecture 3 Revision Questions
3 pages
Device Dispatch
No ratings yet
Device Dispatch
7 pages
NURSING INFORMATICS Updated
No ratings yet
NURSING INFORMATICS Updated
13 pages
Synopsis
No ratings yet
Synopsis
9 pages
DA-100 Mod6-ENU-PowerPoint
No ratings yet
DA-100 Mod6-ENU-PowerPoint
26 pages
Fpls 14 1308528
No ratings yet
Fpls 14 1308528
10 pages
BSC Business Analytics (Coming Soon) School of Business and Economics
No ratings yet
BSC Business Analytics (Coming Soon) School of Business and Economics
2 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
32 pages
Homework Set No. 5, Numerical Computation: 1. Bisection Method
No ratings yet
Homework Set No. 5, Numerical Computation: 1. Bisection Method
4 pages
Open University Learning Analytics Dataset
No ratings yet
Open University Learning Analytics Dataset
6 pages

Lecture 2n 04032024 081220pm 19022025 105409am

Uploaded by

Lecture 2n 04032024 081220pm 19022025 105409am

Uploaded by

Lecture 2: Regular

•Student will be able to understand NLP tasks

CSC-441: Natural Language Processing

• Classification of individual word tokens

Each Regular Expression (RE) represents a set of strings having certain

• How can we search for any of following strings?

• E g the UNIX grep command

Pakistan, China and Iran are friends from years.

What will be the results for:

Lemmatization is the process of grouping together the different

• Spell correction • Computational Biology

• Also used for Machine Translation, Information Extraction, Speech

You might also like