0% found this document useful (0 votes)
31 views

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Extra Feature NLP

The document discusses various natural language processing techniques including one hot encoding, count vectorization, TF-IDF, n-grams, and word embeddings using FastText. Code examples are provided for implementing each technique using scikit-learn and gensim libraries in Python.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Extra-Feature-NLP

March 21, 2024

[ ]: # One Hot Encoding

[1]: import pandas as pd


from sklearn.preprocessing import OneHotEncoder

# Example categorical data


categories = ['teacher', 'nurse', 'police', 'doctor']

# Convert categorical data into a DataFrame


data = pd.DataFrame({'Category': categories})

# Initialize the OneHotEncoder


encoder = OneHotEncoder(sparse_output=False, dtype=int)

# Fit and transform the categorical data


encoded_data = encoder.fit_transform(data)

# Convert the encoded data to a DataFrame


encoded_df = pd.DataFrame(encoded_data, columns=categories)

# Print the encoded DataFrame


encoded_df.head()

[1]: teacher nurse police doctor


0 0 0 0 1
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0

[2]: #Count Vectorization

[3]: # Bag Of Words (BOW):

[4]: # It creates a vocabulary of unique words from the corpus and represents each␣
↪document as a vector of word frequencies.

[5]: import pandas as pd


from sklearn.feature_extraction.text import CountVectorizer

1
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer


vectorizer = CountVectorizer()

# Fit and transform the text data


bow_vectors = vectorizer.fit_transform(data['Text'])

# Convert the BOW vectors to a DataFrame


bow_df = pd.DataFrame(bow_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the BOW DataFrame


bow_df.head()

[5]: and document first is one second the third this


0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1

[6]: # N-gram features

[7]: import pandas as pd


from sklearn.feature_extraction.text import CountVectorizer

# Example text data


documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the CountVectorizer with desired n-gram range


ngram_vectorizer = CountVectorizer(ngram_range=(2,3))

# Fit and transform the text data

2
ngram_vectors = ngram_vectorizer.fit_transform(data['Text'])

# Convert the N-gram vectors to a DataFrame


ngram_df = pd.DataFrame(ngram_vectors.toarray(), columns=ngram_vectorizer.
↪get_feature_names_out())

# Print the N-gram DataFrame


ngram_df.head()

[7]: and this and this is document is document is the first document \
0 0 0 0 0 1
1 0 0 1 1 0
2 1 1 0 0 0
3 0 0 0 0 1

is the is the first is the second is the third is this … \


0 1 1 0 0 0 …
1 1 0 1 0 0 …
2 1 0 0 1 0 …
3 0 0 0 0 1 …

the second document the third the third one third one this document \
0 0 0 0 0 0
1 1 0 0 0 1
2 0 1 1 1 0
3 0 0 0 0 0

this document is this is this is the this the this the first
0 0 1 1 0 0
1 1 0 0 0 0
2 0 1 1 0 0
3 0 0 0 1 1

[4 rows x 25 columns]

[8]: # TF-IDF Vectorizer:

[9]: # TF (Term Frequency) represents the frequency of a term in a document. It␣


↪refers to the number of

#times a particular term occurs in a document.

[10]: #IDF (Inverse Document Frequency) is used to determine the importance of a term␣
↪in a document.

[11]: import pandas as pd


from sklearn.feature_extraction.text import TfidfVectorizer

3
# Example text data
documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

# Convert text data into a DataFrame


data = pd.DataFrame({'Text': documents})

# Initialize the TF-IDF Vectorizer


vectorizer = TfidfVectorizer()

# Fit and transform the text data


tfidf_vectors = vectorizer.fit_transform(data['Text'])

# Convert the TF-IDF vectors to a DataFrame


tfidf_df = pd.DataFrame(tfidf_vectors.toarray(), columns=vectorizer.
↪get_feature_names_out())

# Print the TF-IDF DataFrame


tfidf_df.head()

[11]: and document first is one second the \


0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085

third this
0 0.000000 0.384085
1 0.000000 0.281089
2 0.511849 0.267104
3 0.000000 0.384085

[ ]: # word Embedding

[12]: # FastText

[15]: #It learns word embeddings using the Skip-gram or Continuous Bag-of-Words␣
↪(CBOW) architecture,

# making it effective for various natural language processing tasks

[ ]: #FastText can handle out-of-vocabulary words and capture morphological and␣


↪semantic similarities, even for rare or unseen words.

[16]: import pandas as pd


from gensim.models import FastText

4
# Training data
sentences = [["I", "like", "apples"],
["I", "enjoy", "eating", "fruits"]]

# Training the FastText model


model_fasttext = FastText(sentences, min_count=1, window=5, vector_size=100)

# Accessing word vectors


word_vectors = model_fasttext.wv

# Creating a DataFrame for word vectors


word_vectors_df = pd.DataFrame(word_vectors.vectors, index=word_vectors.
↪index_to_key)

# Displaying the word vectors DataFrame


word_vectors_df.head(10)

[16]: 0 1 2 3 4 5 6 \
I -0.003053 0.001144 -0.001130 0.004910 -0.003084 -0.007648 0.007188
fruits -0.001457 0.001947 0.001137 -0.001536 -0.001588 -0.001997 -0.002027
eating 0.000412 0.001230 -0.002208 0.000289 0.001082 0.000401 0.001171
enjoy -0.001593 0.000200 0.000983 -0.001493 -0.000503 0.001380 0.001440
apples -0.000257 -0.000776 -0.000108 -0.001688 0.002155 -0.001124 0.002533
like 0.001024 -0.003016 0.001939 -0.001192 -0.003485 -0.001892 0.001637

7 8 9 … 90 91 92 \
I 0.007860 -0.001688 -0.002615 … 0.005416 0.001654 0.002986
fruits 0.002295 0.002176 -0.001157 … 0.000342 0.000272 -0.001761
eating -0.000369 -0.000706 0.002063 … -0.002273 0.001385 0.001710
enjoy -0.002292 -0.000112 -0.001617 … -0.003175 -0.001866 0.000952
apples 0.000522 0.000874 -0.000778 … 0.001021 0.000565 -0.001394
like -0.000633 -0.001284 0.001069 … -0.000179 0.002047 -0.000875

93 94 95 96 97 98 99
I 0.002967 0.007579 -0.002151 -0.003800 0.001423 0.001112 -0.000259
fruits -0.001308 -0.000937 -0.000236 -0.000219 -0.000568 -0.003610 -0.001075
eating -0.000360 -0.000841 0.002985 0.000116 -0.000775 -0.000186 0.001993
enjoy -0.002678 0.002496 -0.000418 -0.002535 -0.002113 -0.001011 0.000997
apples -0.000912 0.001105 -0.000151 0.001271 0.001879 0.001152 -0.000260
like -0.000740 0.002278 0.000509 0.001111 -0.001301 0.000404 0.001636

[6 rows x 100 columns]

You might also like