0% found this document useful (0 votes)

31 views

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.

Uploaded by

1nt21ai012.vynavi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Extra Feature NLP

Uploaded by

1nt21ai012.vynavi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Extra-Feature-NLP

March 21, 2024

[ ]: # One Hot Encoding

[1]: import pandas as pd

from sklearn.preprocessing import OneHotEncoder

# Example categorical data

categories = ['teacher', 'nurse', 'police', 'doctor']

# Convert categorical data into a DataFrame

data = pd.DataFrame({'Category': categories})

# Initialize the OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, dtype=int)

# Fit and transform the categorical data

encoded_data = encoder.fit_transform(data)

# Convert the encoded data to a DataFrame

encoded_df = pd.DataFrame(encoded_data, columns=categories)

# Print the encoded DataFrame

encoded_df.head()

[1]: teacher nurse police doctor

0 0 0 0 1
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0

[2]: #Count Vectorization

[3]: # Bag Of Words (BOW):

[4]: # It creates a vocabulary of unique words from the corpus and represents each␣
↪document as a vector of word frequencies.

[5]: import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

1
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer

vectorizer = CountVectorizer()

# Fit and transform the text data

bow_vectors = vectorizer.fit_transform(data['Text'])

# Convert the BOW vectors to a DataFrame

bow_df = pd.DataFrame(bow_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the BOW DataFrame

bow_df.head()

[5]: and document first is one second the third this

0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1

[6]: # N-gram features

[7]: import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# Example text data

documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer with desired n-gram range

ngram_vectorizer = CountVectorizer(ngram_range=(2,3))

# Fit and transform the text data

2
ngram_vectors = ngram_vectorizer.fit_transform(data['Text'])

# Convert the N-gram vectors to a DataFrame

ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram_vectorizer.
↪get_feature_names_out())

# Print the N-gram DataFrame

ngram_df.head()

[7]: and this and this is document is document is the first document \
0 0 0 0 0 1
1 0 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 1

is the is the first is the second is the third is this … \

0 1 1 0 0 0 …
1 1 0 1 0 0 …
2 1 0 0 1 0 …
3 0 0 0 0 1 …

the second document the third the third one third one this document \
0 0 0 0 0 0
1 1 0 0 0 1
2 0 1 1 1 0
3 0 0 0 0 0

this document is this is this is the this the this the first
0 0 1 1 0 0
1 1 0 0 0 0
2 0 1 1 0 0
3 0 0 0 1 1

[4 rows x 25 columns]

[8]: # TF-IDF Vectorizer:

[9]: # TF (Term Frequency) represents the frequency of a term in a document. It␣

↪refers to the number of

#times a particular term occurs in a document.

[10]: #IDF (Inverse Document Frequency) is used to determine the importance of a term␣
↪in a document.

[11]: import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

3
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame

data = pd.DataFrame({'Text': documents})

# Initialize the TF-IDF Vectorizer

vectorizer = TfidfVectorizer()

# Fit and transform the text data

tfidf_vectors = vectorizer.fit_transform(data['Text'])

# Convert the TF-IDF vectors to a DataFrame

tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the TF-IDF DataFrame

tfidf_df.head()

[11]: and document first is one second the \

0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085

third this
0 0.000000 0.384085
1 0.000000 0.281089
2 0.511849 0.267104
3 0.000000 0.384085

[ ]: # word Embedding

[12]: # FastText

[15]: #It learns word embeddings using the Skip-gram or Continuous Bag-of-Words␣
↪(CBOW) architecture,

# making it effective for various natural language processing tasks

[ ]: #FastText can handle out-of-vocabulary words and capture morphological and␣

↪semantic similarities, even for rare or unseen words.

[16]: import pandas as pd

from gensim.models import FastText

4
# Training data
sentences = [["I", "like", "apples"],
["I", "enjoy", "eating", "fruits"]]

# Training the FastText model

model_fasttext = FastText(sentences, min_count=1, window=5, vector_size=100)

# Accessing word vectors

word_vectors = model_fasttext.wv

# Creating a DataFrame for word vectors

word_vectors_df = pd.DataFrame(word_vectors.vectors, index=word_vectors.
↪index_to_key)

# Displaying the word vectors DataFrame

word_vectors_df.head(10)

[16]: 0 1 2 3 4 5 6 \
I -0.003053 0.001144 -0.001130 0.004910 -0.003084 -0.007648 0.007188
fruits -0.001457 0.001947 0.001137 -0.001536 -0.001588 -0.001997 -0.002027
eating 0.000412 0.001230 -0.002208 0.000289 0.001082 0.000401 0.001171
enjoy -0.001593 0.000200 0.000983 -0.001493 -0.000503 0.001380 0.001440
apples -0.000257 -0.000776 -0.000108 -0.001688 0.002155 -0.001124 0.002533
like 0.001024 -0.003016 0.001939 -0.001192 -0.003485 -0.001892 0.001637

7 8 9 … 90 91 92 \
I 0.007860 -0.001688 -0.002615 … 0.005416 0.001654 0.002986
fruits 0.002295 0.002176 -0.001157 … 0.000342 0.000272 -0.001761
eating -0.000369 -0.000706 0.002063 … -0.002273 0.001385 0.001710
enjoy -0.002292 -0.000112 -0.001617 … -0.003175 -0.001866 0.000952
apples 0.000522 0.000874 -0.000778 … 0.001021 0.000565 -0.001394
like -0.000633 -0.001284 0.001069 … -0.000179 0.002047 -0.000875

93 94 95 96 97 98 99
I 0.002967 0.007579 -0.002151 -0.003800 0.001423 0.001112 -0.000259
fruits -0.001308 -0.000937 -0.000236 -0.000219 -0.000568 -0.003610 -0.001075
eating -0.000360 -0.000841 0.002985 0.000116 -0.000775 -0.000186 0.001993
enjoy -0.002678 0.002496 -0.000418 -0.002535 -0.002113 -0.001011 0.000997
apples -0.000912 0.001105 -0.000151 0.001271 0.001879 0.001152 -0.000260
like -0.000740 0.002278 0.000509 0.001111 -0.001301 0.000404 0.001636

[6 rows x 100 columns]

TP1 NLP
No ratings yet
TP1 NLP
7 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
x0 Process
No ratings yet
x0 Process
4 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
NLP Assignment(917722H031)
No ratings yet
NLP Assignment(917722H031)
18 pages
16 - Practical - 6-7.ipynb - Colab
No ratings yet
16 - Practical - 6-7.ipynb - Colab
3 pages
Assignment No - 7
No ratings yet
Assignment No - 7
4 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Assign 3
No ratings yet
Assign 3
1 page
AI Lab Programs
No ratings yet
AI Lab Programs
9 pages
Thesis Final - Pham Dung - Quang Anh - ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - ver2
30 pages
Ex. No.: Text Mining On Commercial Application Date: Motivation
No ratings yet
Ex. No.: Text Mining On Commercial Application Date: Motivation
9 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Practical No 05
No ratings yet
Practical No 05
4 pages
NLP Projects
No ratings yet
NLP Projects
4 pages
04 Deep Learning Lab Guide-Student Version
No ratings yet
04 Deep Learning Lab Guide-Student Version
33 pages
Tensor Flow
No ratings yet
Tensor Flow
6 pages
Section 6 - Jupyter Notebook
No ratings yet
Section 6 - Jupyter Notebook
11 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
MKT4080 Review Notes-R Part
No ratings yet
MKT4080 Review Notes-R Part
13 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
Chatbot.ipynb - Colaboratory
No ratings yet
Chatbot.ipynb - Colaboratory
5 pages
Module III
No ratings yet
Module III
42 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Recreating PyTorch From Scratch (With GPU Support and Automatic Differentiation)
No ratings yet
Recreating PyTorch From Scratch (With GPU Support and Automatic Differentiation)
35 pages
Sumati
No ratings yet
Sumati
10 pages
Getting Started With Tensorflow Tutorial: A Guide To The Fundamentals
No ratings yet
Getting Started With Tensorflow Tutorial: A Guide To The Fundamentals
32 pages
Pytorch For Beginners
No ratings yet
Pytorch For Beginners
13 pages
Program 1
No ratings yet
Program 1
2 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
IR
No ratings yet
IR
12 pages
Homework 9: Independent and Paired Samples T-Tests: Information 1
No ratings yet
Homework 9: Independent and Paired Samples T-Tests: Information 1
7 pages
PyTorch - Basic Operations
No ratings yet
PyTorch - Basic Operations
20 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
2023 Aug How To Produce Data For A Neural networkORG
No ratings yet
2023 Aug How To Produce Data For A Neural networkORG
6 pages
Chapter No 04 PWP-1
No ratings yet
Chapter No 04 PWP-1
91 pages
3 Mark Python Imp
No ratings yet
3 Mark Python Imp
18 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Heart Disease Classification ML Assignment - Jupyter Notebook
No ratings yet
Heart Disease Classification ML Assignment - Jupyter Notebook
7 pages
Emotion Classification with DistilBERT
No ratings yet
Emotion Classification with DistilBERT
25 pages
List Declaration-Programs
No ratings yet
List Declaration-Programs
14 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
01 Python Variables Types and Basic Io
No ratings yet
01 Python Variables Types and Basic Io
8 pages
Streamlit PDF Application Setup All Commands in One Single File
No ratings yet
Streamlit PDF Application Setup All Commands in One Single File
8 pages
7 idf
No ratings yet
7 idf
5 pages
NLP lab Manual (3)
No ratings yet
NLP lab Manual (3)
7 pages
Dsbda 7
No ratings yet
Dsbda 7
1 page
IR practical
No ratings yet
IR practical
24 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
ir-journal
No ratings yet
ir-journal
41 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Jack Goody The Character of Kinship
100% (2)
Jack Goody The Character of Kinship
264 pages
Post Graduate Diploma in Data Science (PGDDS)
No ratings yet
Post Graduate Diploma in Data Science (PGDDS)
2 pages
Change Your Life by Rewiring Your Brain
No ratings yet
Change Your Life by Rewiring Your Brain
12 pages
Diass L-1
No ratings yet
Diass L-1
16 pages
k5 Learning Vocabulary 1 Workbook
100% (4)
k5 Learning Vocabulary 1 Workbook
208 pages
COT - DLL in TLE 9
No ratings yet
COT - DLL in TLE 9
11 pages
Per Dev Pointers To Review 1st
0% (1)
Per Dev Pointers To Review 1st
5 pages
Social Skills Assessment by Taylor and Jasper (2001) : Page 1 of 2
No ratings yet
Social Skills Assessment by Taylor and Jasper (2001) : Page 1 of 2
2 pages
IMC An Integrative Review
No ratings yet
IMC An Integrative Review
18 pages
Programming The Actel M1A3P Evaluation Board With The Cortex™-M1 Processor And Using The Realview Microcontroller Development Kit Mdk Featuring The Keil Μvision 3 Ide
No ratings yet
Programming The Actel M1A3P Evaluation Board With The Cortex™-M1 Processor And Using The Realview Microcontroller Development Kit Mdk Featuring The Keil Μvision 3 Ide
6 pages
Purnima Upadhyaya Resume
No ratings yet
Purnima Upadhyaya Resume
7 pages
BSC Mathematics Full Syllabus
No ratings yet
BSC Mathematics Full Syllabus
19 pages
Goals Standards Key Terms
No ratings yet
Goals Standards Key Terms
4 pages
RBC Map
No ratings yet
RBC Map
1 page
Performance of Schools Pharmacist Exam 2014
No ratings yet
Performance of Schools Pharmacist Exam 2014
2 pages
Module 2 Pem 5
No ratings yet
Module 2 Pem 5
2 pages
English 2050 Syl Lab Us
No ratings yet
English 2050 Syl Lab Us
9 pages
Finance Assessment Centre Guide and Preparation
No ratings yet
Finance Assessment Centre Guide and Preparation
16 pages
Statistics Study Proposal
No ratings yet
Statistics Study Proposal
5 pages
6th English
No ratings yet
6th English
2 pages
Skill Day 2023 DRAFT LEAFLET
No ratings yet
Skill Day 2023 DRAFT LEAFLET
2 pages
Girl Child Education
No ratings yet
Girl Child Education
4 pages
Web-Based New Student Registration Development Using Agile Methods
No ratings yet
Web-Based New Student Registration Development Using Agile Methods
7 pages
View NEETSS
No ratings yet
View NEETSS
15 pages
Jadwal Emsb Feb
No ratings yet
Jadwal Emsb Feb
2 pages
Jupeb Undergraduate Foundation Program
No ratings yet
Jupeb Undergraduate Foundation Program
2 pages
02.therapist Experience and Style As Factors in Co-Therapy - Rice
No ratings yet
02.therapist Experience and Style As Factors in Co-Therapy - Rice
7 pages
Research Objective Result
No ratings yet
Research Objective Result
2 pages
Profile of Students by Barangay - HUMSS12
No ratings yet
Profile of Students by Barangay - HUMSS12
9 pages
Download Full Studying Religion An Introduction Third Edition Russell T. Mccutcheon PDF All Chapters
100% (14)
Download Full Studying Religion An Introduction Third Edition Russell T. Mccutcheon PDF All Chapters
60 pages