0% found this document useful (0 votes)

61 views

Chapter2 PDF

The document provides an introduction to natural language processing techniques including bag-of-words modeling and text preprocessing using Python libraries like NLTK and Gensim. It explains how bag-of-words works by tokenizing text and counting token frequencies. It also demonstrates how to preprocess text by removing stopwords and punctuation. Finally, it introduces Gensim and how it can be used to create word vectors and compute TF-IDF scores to identify important words in documents.

Uploaded by

NourheneMbarek

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Chapter2 PDF

Uploaded by

NourheneMbarek

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Word counts with bag-

of-words

Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python

Bag-of-words
Basic method for finding topics in a text
Need to first create tokens using tokenization
... and then count up all the tokens
The more frequent a word, the more important it might be
Can be a great way to determine the significant words in a text
DataCamp Introduction to Natural Language Processing in Python

Bag-of-words example

Text: "The cat is in the box. The cat likes the box. The box is over the
cat."

Bag of words (stripped punctuation):

"The": 3, "box": 3
"cat": 3, "the": 3
"is": 2
"in": 1, "likes": 1, "over": 1
DataCamp Introduction to Natural Language Processing in Python

Bag-of-words in Python
In [1]: from nltk.tokenize import word_tokenize

In [2]: from collections import Counter

In [3]: Counter(word_tokenize(
"""The cat is in the box. The cat likes the box.
The box is over the cat."""))
Out[3]:
Counter({'.': 3,
'The': 3,
'box': 3,
'cat': 3,
'in': 1,
...
'the': 3})

In [4]: counter.most_common(2)
Out[4]: [('The', 3), ('box', 3)]
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Simple text
preprocessing

Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python

Why preprocess?
Helps make for better input data
When performing machine learning or other statistical methods
Examples:
Tokenization to create a bag of words
Lowercasing words
Lemmatization/Stemming
Shorten words to their root stems
Removing stop words, punctuation, or unwanted tokens
Good to experiment with diﬀerent approaches
DataCamp Introduction to Natural Language Processing in Python

Preprocessing example

Input text: Cats, dogs and birds are common pets. So are ﬁsh.

Output tokens: cat, dog, bird, common, pet, ﬁsh

DataCamp Introduction to Natural Language Processing in Python

Text preprocessing with Python

In [1]: from ntlk.corpus import stopwords

In [2]: text = """The cat is in the box. The cat likes the box.
The box is over the cat."""

In [3]: tokens = [w for w in word_tokenize(text.lower())

if w.isalpha()]

In [4]: no_stops = [t for t in tokens

if t not in stopwords.words('english')]

In [5]: Counter(no_stops).most_common(2)
Out[5]: [('cat', 3), ('box', 3)]
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Introduction to gensim

Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python

What is gensim?
Popular open-source NLP library
Uses top academic models to perform complex tasks
Building document or word vectors
Performing topic identiﬁcation and document comparison
DataCamp Introduction to Natural Language Processing in Python

What is a word vector?

DataCamp Introduction to Natural Language Processing in Python

Gensim Example

(Source: http://tlfvincent.github.io/2015/10/23/presidential-speech-topics)
DataCamp Introduction to Natural Language Processing in Python

Creating a gensim dictionary

In [1]: from gensim.corpora.dictionary import Dictionary

In [2]: from nltk.tokenize import word_tokenize

In [3]: my_documents = ['The movie was about a spaceship and aliens.',

...: 'I really liked the movie!',
...: 'Awesome action scenes, but boring characters.',
...: 'The movie was awful! I hate alien films.',
...: 'Space is cool! I liked the movie.',
...: 'More space films, please!',]

In [4]: tokenized_docs = [word_tokenize(doc.lower())

...: for doc in my_documents]

In [5]: dictionary = Dictionary(tokenized_docs)

In [6]: dictionary.token2id
Out[6]:
{'!': 11,
',': 17,
'.': 7,
'a': 2,
'about': 4,
...
DataCamp Introduction to Natural Language Processing in Python

Creating a gensim corpus

In [7]: corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

In [8]: corpus
Out[8]:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...
]

gensim models can be easily saved, updated, and reused

Our dictionary can also be updated

This more advanced and feature rich bag-of-words can be used in
future exercises
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Tf-idf with gensim

Katharine Jarmul
Founder, kjamistan
DataCamp Introduction to Natural Language Processing in Python

What is tf-idf?
Term frequency - inverse document frequency
Allows you to determine the most important words in each document
Each corpus may have shared words beyond just stopwords
These words should be down-weighted in importance
Example from astronomy: "Sky"
Ensures most common words don't show up as key words
Keeps document speciﬁc frequent words weighted high
DataCamp Introduction to Natural Language Processing in Python

Tf-idf formula

N
wi,j = tfi,j ∗ log( )
dfi

wi,j = tf-idf weight for token i in document j

tfi,j = number of occurences of token i in document j

dfi = number of documents that contain token i

N = total number of documents
DataCamp Introduction to Natural Language Processing in Python

Tf-idf with gensim

In [10]: from gensim.models.tfidfmodel import TfidfModel

In [11]: tfidf = TfidfModel(corpus)

In [12]: tfidf[corpus[1]]
Out[12]:
[(0, 0.1746298276735174),
(1, 0.1746298276735174),
(9, 0.29853166221463673),
(10, 0.7716931521027908),
...
]
DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!

Chapter 2
No ratings yet
Chapter 2
22 pages
Final Summary NLP
No ratings yet
Final Summary NLP
446 pages
Named Entity Recognition: Katharine Jarmul
No ratings yet
Named Entity Recognition: Katharine Jarmul
17 pages
Introduction To Regular Expressions: Katharine Jarmul
No ratings yet
Introduction To Regular Expressions: Katharine Jarmul
31 pages
Chapter1 NLP
No ratings yet
Chapter1 NLP
31 pages
NLP 9
No ratings yet
NLP 9
44 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Classifying Fake News Using Supervised Learning With NLP: Katharine Jarmul
No ratings yet
Classifying Fake News Using Supervised Learning With NLP: Katharine Jarmul
20 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
31 pages
Session 1
No ratings yet
Session 1
60 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP
No ratings yet
NLP
13 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Natural Language Process (NLP)
No ratings yet
Natural Language Process (NLP)
29 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
NLP PDF
No ratings yet
NLP PDF
25 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
Natural Language Processing With Python
100% (1)
Natural Language Processing With Python
504 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Python NLP
No ratings yet
Python NLP
15 pages
Natural Language Processing With Python's NLTK Package – Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package – Real Python
27 pages
Module 3
No ratings yet
Module 3
40 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
UBC Summer School in NLP - VSP 2019 Lecture 8
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 8
27 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Text Processing, Tokenization & Characteristics
No ratings yet
Text Processing, Tokenization & Characteristics
89 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Module 05 - Learners Guide
No ratings yet
Module 05 - Learners Guide
31 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
text-processing
No ratings yet
text-processing
114 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
AP for NLP-LO1
No ratings yet
AP for NLP-LO1
61 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
slides_lec1-3
No ratings yet
slides_lec1-3
225 pages
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
janya lo
No ratings yet
1 Introduction
No ratings yet
1 Introduction
99 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Fast Tokenizers' Special Powers - Hugging Face NLP Course
No ratings yet
Fast Tokenizers' Special Powers - Hugging Face NLP Course
23 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
Logistic Regression and Regularization: Michael (Mike) Gelbart
No ratings yet
Logistic Regression and Regularization: Michael (Mike) Gelbart
19 pages
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
No ratings yet
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
12 pages
SAAI1-AI Analyst 2019-Course Guide 1
No ratings yet
SAAI1-AI Analyst 2019-Course Guide 1
166 pages
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
No ratings yet
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
22 pages
Welcome To The Course!: Michael (Mike) Gelbart
No ratings yet
Welcome To The Course!: Michael (Mike) Gelbart
17 pages
Introduction To Databases in Python: Calculating Values Ina Query
No ratings yet
Introduction To Databases in Python: Calculating Values Ina Query
30 pages
Supervised Learning With Scikit-Learn: Introduction To Regression
No ratings yet
Supervised Learning With Scikit-Learn: Introduction To Regression
31 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
32 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Introduction To Databases in Python: Filtering and Targeting Data
No ratings yet
Introduction To Databases in Python: Filtering and Targeting Data
32 pages
Introduction To Databases in Python: Creating Databases and Tables
No ratings yet
Introduction To Databases in Python: Creating Databases and Tables
31 pages
Importing Data in Python I: Introduction To Relational Databases
No ratings yet
Importing Data in Python I: Introduction To Relational Databases
33 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Data Mining Cheat Sheet
No ratings yet
Data Mining Cheat Sheet
6 pages
NLPPAP
No ratings yet
NLPPAP
8 pages
Email Hijack How Hackers Break in
No ratings yet
Email Hijack How Hackers Break in
33 pages
Shadow City 2065
100% (1)
Shadow City 2065
44 pages
RSM470 H1S Winter 2025 Course Outline 2025-01-06
No ratings yet
RSM470 H1S Winter 2025 Course Outline 2025-01-06
7 pages
Quiz 1 Quiz 1: Congratulations! You Passed!
No ratings yet
Quiz 1 Quiz 1: Congratulations! You Passed!
1 page
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
State of AI Report - 2024 ONLINE
No ratings yet
State of AI Report - 2024 ONLINE
213 pages
Unit 5 CNN
No ratings yet
Unit 5 CNN
151 pages
Future of Jobs Report 2023.docx - Summary
No ratings yet
Future of Jobs Report 2023.docx - Summary
13 pages
Management Information System: Dr. Anand Vyas
No ratings yet
Management Information System: Dr. Anand Vyas
10 pages
Practical Ontologies For Information Professionals
No ratings yet
Practical Ontologies For Information Professionals
192 pages
Researchpaper
No ratings yet
Researchpaper
8 pages
2024 Trends to Watch Cloud Computing PDF
No ratings yet
2024 Trends to Watch Cloud Computing PDF
30 pages
AI Write An Essay For Me
No ratings yet
AI Write An Essay For Me
27 pages
Ebook 5 Best Practices Rpa
No ratings yet
Ebook 5 Best Practices Rpa
12 pages
teravm-ai-rsg-data-sheets-en
No ratings yet
teravm-ai-rsg-data-sheets-en
4 pages
Steps For Training A Recurrent Neural Network: Advantages
No ratings yet
Steps For Training A Recurrent Neural Network: Advantages
13 pages
PPT-DNE
No ratings yet
PPT-DNE
28 pages
Building A Database-Driven Chatbot With LangChain and OpenAI - A Practical Approach (Part 1, Warm-Up) - by Mathews Pious - Aug, 2024 - GoPenAI
No ratings yet
Building A Database-Driven Chatbot With LangChain and OpenAI - A Practical Approach (Part 1, Warm-Up) - by Mathews Pious - Aug, 2024 - GoPenAI
17 pages
Hopfield Network
No ratings yet
Hopfield Network
14 pages
Educating Farmersin Artificial Intelligence AIand Io T
No ratings yet
Educating Farmersin Artificial Intelligence AIand Io T
7 pages
L2 数量课件
No ratings yet
L2 数量课件
196 pages
The AI Maturity Model - How To Move The Needle of Digital Transformation Towards An AI-Driven Company
No ratings yet
The AI Maturity Model - How To Move The Needle of Digital Transformation Towards An AI-Driven Company
9 pages
Robotics Engineering: ME 406T Unit-1 Robot Fundamentals Lecture-1
100% (1)
Robotics Engineering: ME 406T Unit-1 Robot Fundamentals Lecture-1
29 pages
Robotics 2035 AH
No ratings yet
Robotics 2035 AH
102 pages
Battle Realms - World Master
100% (2)
Battle Realms - World Master
12 pages
2024 MTH058 Lecture08 N ShotLearning
No ratings yet
2024 MTH058 Lecture08 N ShotLearning
39 pages
1) What Is A Blockchain?
No ratings yet
1) What Is A Blockchain?
33 pages
Artificial Intelligence: Fast Facts
No ratings yet
Artificial Intelligence: Fast Facts
2 pages
Depression Detect PPT 1
No ratings yet
Depression Detect PPT 1
19 pages

Chapter2 PDF

Uploaded by

Chapter2 PDF

Uploaded by

DataCamp Introduction to Natural Language Processing in Python

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Word counts with bag-

Bag of words (stripped punctuation):

In [2]: from collections import Counter

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Output tokens: cat, dog, bird, common, pet, ﬁsh

Text preprocessing with Python

In [3]: tokens = [w for w in word_tokenize(text.lower())

In [4]: no_stops = [t for t in tokens

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

What is a word vector?

Creating a gensim dictionary

In [2]: from nltk.tokenize import word_tokenize

In [3]: my_documents = ['The movie was about a spaceship and aliens.',

In [4]: tokenized_docs = [word_tokenize(doc.lower())

In [5]: dictionary = Dictionary(tokenized_docs)

Creating a gensim corpus

gensim models can be easily saved, updated, and reused

Our dictionary can also be updated

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Tf-idf with gensim

Tf-idf with gensim

In [11]: tfidf = TfidfModel(corpus)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

You might also like