0% found this document useful (0 votes)
17 views

Unit2 Full

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Unit2 Full

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit-2

Unstructured Text Analysis


&
Chatbot Development
Unstructured Text Analysis
TextBlob is a Python library for processing textual data. It provides a simple API for diving into common
natural language processing (NLP) tasks, such as part-of-speech tagging, noun phrase extraction, sentiment
analysis, classification, translation, and more.
To install TextBlob, you can use pip:
pip install textblob
You will also need to download the necessary NLTK corpora:
python -m textblob.download_corpora
functionalities provided by TextBlob:
1. Creating a TextBlob Object:
from textblob import TextBlob
text = "TextBlob is amazingly simple to use. What great fun!"
blob = TextBlob(text)
2. Part-of-Speech Tagging:
print(blob.tags) # [('TextBlob', 'NNP'), ('is', 'VBZ'), ('amazingly', 'RB'), ...]
3. Noun Phrase Extraction:
print(blob.noun_phrases) # WordList(['textblob', 'great fun'])
4. Sentiment Analysis:
TextBlob's sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within
the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
print(blob.sentiment)
# Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
5. Tokenization
You can break TextBlob into sentences or words:
print(blob.sentences)
# [Sentence("TextBlob is amazingly simple to use."), Sentence("What great fun!")]
print(blob.words)
# WordList(['TextBlob', 'is', 'amazingly', 'simple', 'to', 'use', 'What', 'great', 'fun'])
6. Word Inflection and Lemmatization
from textblob import Word
w = Word("running")
print(w.lemmatize("v")) # 'run'
7. Spelling Correction
pythonCopy codeblob = TextBlob("I havv goood speling!")
print(blob.correct()) # 'I have good spelling!'
8. Translation and Language Detection
You can use TextBlob to translate text between languages and detect the language of a text.
blob = TextBlob("Simple is better than complex.")
print(blob.translate(to="es")) # 'Simple es mejor que complejo.'
9. WordNet Integration
TextBlob integrates with WordNet, a lexical database for the English language, which can be used for synonyms and antonyms.
from textblob.wordnet import Synset
syn = Synset('ship.n.01')
print(syn.hypernyms()) # [Synset('vessel.n.02')]
Example: Sentiment Analysis
Here is a full example demonstrating how to use TextBlob for sentiment analysis:

from textblob import TextBlob

text = "I love this library. It's so simple to use! However, sometimes it can be a bit slow."
blob = TextBlob(text)

# Analyze sentiment
for sentence in blob.sentences:
print(f"Sentence: {sentence}")
print(f"Sentiment: {sentence.sentiment}")

Sentence: I love this library.


Sentiment: Sentiment(polarity=0.5, subjectivity=0.6)
Sentence: It's so simple to use!
Sentiment: Sentiment(polarity=0.375, subjectivity=0.75)
Sentence: However, sometimes it can be a bit slow.
Sentiment: Sentiment(polarity=-0.15000000000000002, subjectivity=0.5333333333333333)
Text Classification using Naive Bayes
Naive Bayes is a simple yet powerful classification algorithm based on Bayes' theorem. It is particularly effective for text
classification tasks such as spam detection and sentiment analysis. Effective for text classification, using scikit-learn's
MultinomialNB.
Here's a step-by-step guide to implementing Naive Bayes for text classification using Python and the scikit-learn library:

Step 1: Import Libraries


import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Load Dataset
For demonstration purposes, let's use a simple dataset. You can replace this with any text dataset you have.
data = {
'text': ["I love this movie", "I hate this movie", "This was an amazing experience", "This was a terrible experience"],
'label': ["positive", "negative", "positive", "negative"]
}
df = pd.DataFrame(data)
Step 3: Preprocess Data
Convert text data to feature vectors.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['label']
Step 4: Split Data
Split the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Train Naive Bayes Classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
Step 6: Make Predictions and Evaluate
y_pred = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Noun Phrase Extraction
Noun phrase extraction involves identifying noun phrases in text, which can be useful for various NLP tasks. We can use
libraries like NLTK or spaCy for this purpose. Identifying noun phrases using NLTK's chunking or spaCy's
noun_chunks.Here’s how to do it with both:
Using NLTK
import nltk
from nltk import word_tokenize, pos_tag
from nltk.chunk import RegexpParser
# Sample text
text = "Natural language processing is a field of artificial intelligence."
# Tokenize and POS tagging
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
# Define a chunk grammar
grammar = "NP: {<DT>?<JJ>*<NN>}"
# Create a chunk parser
chunk_parser = RegexpParser(grammar)
# Parse the text
tree = chunk_parser.parse(tagged)
# Extract noun phrases
for subtree in tree.subtrees():
if subtree.label() == 'NP':
print(' '.join(word for word, tag in subtree.leaves()))
Using spaCy
import spacy
# Load spaCy model
nlp = spacy.load('en_core_web_sm')
# Sample text
text = "Natural language processing is a field of artificial intelligence."
# Process text with spaCy
doc = nlp(text)
# Extract noun phrases
for np in doc.noun_chunks:
print(np.text)
TextBlob for DataCleaning & Tokenization, etc.
Data cleaning is a crucial step in preparing text data for analysis and involves several tasks such as removing noise, correcting
errors, and standardizing text. TextBlob provides simple and effective tools for some of these tasks. Here are some common
data cleaning tasks using TextBlob and other Python libraries:
1. Lowercasing
Convert all text to lowercase to ensure consistency.
from textblob import TextBlob
text = "This is an Example of Text with Mixed CASE."
blob = TextBlob(text.lower())
print(blob)
2. Removing Punctuation
Remove punctuation to focus on the words.
import string
text = "This is an example, with punctuation!"
blob = TextBlob(text)
cleaned_text = ''.join([char for char in blob if char not in string.punctuation])
print(cleaned_text)
3. Correcting Spelling
Correct spelling mistakes.
text = "I havv a speling mistakke."
blob = TextBlob(text)
corrected_text = blob.correct()
4. Removing Stopwords
Remove common stopwords that do not contribute much meaning.
pythonCopy codefrom textblob import TextBlob
from nltk.corpus import stopwords
# Ensure you have downloaded the stopwords corpus
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
text = "This is a sample sentence, showing off the stop words filtration."
blob = TextBlob(text)
filtered_words = [word for word in blob.words if word.lower() not in stop_words]
print(' '.join(filtered_words))

5. Tokenization
Split text into words and sentences.
blob = TextBlob("TextBlob is a great tool. It makes NLP tasks simple.")
words = blob.words
sentences = blob.sentences
print("Words:", words)
print("Sentences:", sentences)

6. Lemmatization
Reduce words to their base or root form.
from textblob import Word
words = ["running", "jumps", "easily", "fairly"]
lemmatized_words = [Word(word).lemmatize() for word in words]
print(lemmatized_words)
7. Removing Non-Alphanumeric Characters
Remove characters that are not letters or numbers.
import re
text = "This is a sample sentence with numbers 123 and symbols #!@."
blob = TextBlob(text)
cleaned_text = re.sub(r'\W+', ' ', blob.raw)
print(cleaned_text)

8. Removing Extra Whitespace


Remove extra spaces and newlines.
text = "This is a sample text with extra spaces."
blob = TextBlob(text)
cleaned_text = ' '.join(blob.words)
print(cleaned_text)

9. Stemming
Reduce words to their stem or root form (less sophisticated than lemmatization).
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "jumps", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
Combining Data Cleaning Steps
Combining multiple data cleaning steps into a single process.
pythonCopy codefrom textblob import TextBlob
from nltk.corpus import stopwords
import string
import re
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def clean_text(text):
# Lowercasing
text = text.lower()
# Removing punctuation
text = ''.join([char for char in text if char not in string.punctuation])
# Removing stopwords
words = TextBlob(text).words
filtered_words = [word for word in words if word not in stop_words]
# Removing non-alphanumeric characters
cleaned_text = re.sub(r'\W+', ' ', ' '.join(filtered_words))
# Removing extra whitespace
cleaned_text = ' '.join(cleaned_text.split())
return cleaned_text
text = "This is a sample TEXT with punctuation, numbers 123, and stopwords!"
cleaned_text = clean_text(text)
TextBlob is a versatile tool for a variety of NLP tasks:

Data Cleaning: Correct spelling in multiple sentences.

Tokenization: Tokenize complex text into words and sentences.

POS Tagging: Tag words in a complex sentence with their parts of speech.

Noun Phrase Extraction: Extract noun phrases from a paragraph.

Sentiment Analysis: Analyze the sentiment of multiple sentences.

Translation and Language Detection: Translate and detect the language of multiple texts.

Text Classification: Train and use a classifier with more extensive data.

Basic NLP Tasks: Pluralize and singularize words.
Introduction to Transformers
Transformers are a type of model architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in
2017. They have revolutionized the field of natural language processing (NLP) by enabling highly efficient training and
achieving state-of-the-art results in various tasks. Transformers rely on a mechanism called self-attention, which allows
them to consider the entire input sequence when making predictions.Transformers have transformed NLP with their self-
attention mechanism, allowing models to capture context and dependencies more effectively than previous architectures.
The Hugging Face Transformers library provides an easy-to-use interface for leveraging these powerful models in various
NLP tasks, from text classification to machine translation and beyond. By fine-tuning pre-trained models on specific
datasets, users can achieve state-of-the-art performance.

Hugging Face Transformers is a popular open-source library that provides easy access to a vast array of pre-trained
models for natural language processing (NLP) tasks. It allows you to quickly use these models for inference or fine-tune
them on your own datasets for specific tasks, like text classification, question answering, or text generation. The library
supports various transformer-based models, including BERT, GPT, RoBERTa, and many others, making it a versatile tool
for NLP practitioners.
Key Concepts
Self-Attention: The self-attention mechanism enables the model to weigh the importance of different words in a sentence when
encoding a word. This allows the model to capture context and dependencies between words, regardless of their distance in the
sequence.
Positional Encoding: Since Transformers do not have a built-in notion of the order of words (unlike RNNs or LSTMs), positional
encodings are added to the input embeddings to provide information about the position of each word in the sequence.
Multi-Head Attention: Multi-head attention allows the model to focus on different parts of the sentence simultaneously,
improving its ability to capture various aspects of the context.
Feed-Forward Networks: Each position in the sequence is processed independently by a feed-forward neural network, adding
non-linearity and complexity to the model.
Layer Normalization: Normalization layers are used to stabilize and speed up training by maintaining the mean and variance of
the activations.
Encoder-Decoder Architecture: The original Transformer architecture consists of an encoder and a decoder, making it suitable
for sequence-to-sequence tasks such as translation. The encoder processes the input sequence, and the decoder generates the output
sequence, attending to the encoder's output.
Practical Implementation
The Hugging Face Transformers library provides a comprehensive implementation of various Transformer models, making it easy to use
them for different NLP tasks. Here's an introduction to using the library:
Installation
Install the transformers library:
pip install transformers
Loading a Pre-trained Model
Here’s how to load a pre-trained BERT model and tokenizer for a simple text classification task:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Example text
text = "Transformers are a groundbreaking innovation in NLP."
# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
# Make predictions
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(f"Predicted class: {predictions.item()}")
Fine-Tuning a Pre-trained Model
Fine-tuning a pre-trained Transformer model on your specific dataset involves training the model on labeled examples. Here’s a basic example of how to fine-
tune BERT on a text classification dataset:
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset (example: IMDB)
dataset = load_dataset('imdb')
train_dataset = dataset['train'].map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)
test_dataset = dataset['test'].map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Train the model
trainer.train()
Common Transformer Models
BERT (Bidirectional Encoder Representations from Transformers): Designed for pre-training deep bidirectional
representations by jointly conditioning on both left and right context in all layers.
GPT (Generative Pre-trained Transformer): An autoregressive model designed for generating text and fine-tuning on
various downstream tasks.
T5 (Text-To-Text Transfer Transformer): Converts all NLP tasks into a text-to-text format, simplifying the input-
output interface.
RoBERTa (Robustly Optimized BERT Approach): An optimized version of BERT with improved training strategies.
Applications of Transformers
Text Classification: Sentiment analysis, spam detection, etc.
Named Entity Recognition (NER): Identifying entities like names, dates, and locations in text.
Machine Translation: Translating text from one language to another.
Text Generation: Generating coherent and contextually relevant text.
Question Answering: Answering questions based on context from a given passage.
Summarization: Creating concise summaries of long documents.
DistilBERT

DistilBERT, a smaller and faster version of BERT, is widely used for various natural language processing (NLP)
tasks, including text classification and sentiment analysis. Below, I'll provide an overview and examples of how to
use DistilBERT for these tasks using the Hugging Face Transformers library.
Text Classification with DistilBERT
Text classification involves categorizing text into predefined labels. Here's how you can use DistilBERT for this task:
1. Installation
Install the necessary libraries:
pip install transformers datasets
2. Loading the Model and Tokenizer
Load the DistilBERT model and tokenizer from Hugging Face:
from transformers import DistilBERTTokenizer, DistilBERTForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load the tokenizer and model
tokenizer = DistilBERTTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBERTForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
# Load your dataset
dataset = load_dataset('csv', data_files='path/to/your/dataset.csv')
3. Tokenize the Dataset
Tokenize the dataset to prepare it for training:
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
4. Define Training Arguments and Trainer
Set up the training arguments and the Trainer:
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch', 
DistilBERT is a versatile model that can
per_device_train_batch_size=8,
be effectively used for text classification
per_device_eval_batch_size=8,
num_train_epochs=3, and sentiment analysis tasks. Whether
weight_decay=0.01, you use pre-trained models directly or
) fine-tune them on your datasets, the
trainer = Trainer( Hugging Face Transformers library
model=model, provides robust tools to streamline the
args=training_args,
process.
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
)

5. Train the Model


Train the model:
trainer.train()
Sentiment Analysis with DistilBERT
Sentiment analysis involves determining the sentiment expressed in a text (e.g., positive, negative, neutral). You can use a pre-trained
DistilBERT model fine-tuned on a sentiment analysis dataset like SST-2.
1. Using a Pre-trained Pipeline
For quick sentiment analysis, you can use the pre-trained pipeline:
from transformers import pipeline
# Load pre-trained DistilBERT sentiment analysis pipeline
classifier = pipeline('sentiment-analysis')
# Sample text
text = "I love using Hugging Face's Transformers library!"
# Perform sentiment analysis
result = classifier(text)
print(result)
2. Fine-Tuning DistilBERT for Sentiment Analysis
If you need a custom sentiment analysis model, fine-tuning might be necessary:
from transformers import DistilBERTTokenizer, DistilBERTForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load the tokenizer and model
tokenizer = DistilBERTTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBERTForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
# Load your dataset (e.g., from a CSV file)
dataset = load_dataset('csv', data_files='path/to/your/sentiment_dataset.csv')
# Tokenize the dataset
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)
# Define Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
)
# Train the model
trainer.train()
Hugging Face Transformers library:Hugging Face Transformers is a powerful library for working with
transformer-based models in NLP, offering a wide range of models, easy-to-use API, and extensive
community support.
Model Support: Hugging Face Transformers supports a wide range of transformer-based models, including BERT, GPT, RoBERTa,
DistilBERT, and many others. These models can be used for various NLP tasks such as text classification, sequence labeling, text
generation, and more.

Model Hub: The library provides a model hub (https://huggingface.co/models) where you can discover and download pre-trained models
and tokenizer files for your specific task. This makes it easy to access state-of-the-art models and use them in your projects.

Fine-Tuning: One of the key features of Hugging Face Transformers is its support for fine-tuning pre-trained models on custom datasets.
This allows you to adapt a pre-trained model to perform well on your specific task or domain.

Easy-to-Use API: The library provides a simple and intuitive API for working with pre-trained models. You can easily load a model,
tokenize text input, and perform inference using just a few lines of code.

Community and Resources: Hugging Face has a large and active community of developers working on NLP projects. They provide
extensive documentation, tutorials, and example code to help you get started with the library.

You might also like