Unit2 Full
Unit2 Full
text = "I love this library. It's so simple to use! However, sometimes it can be a bit slow."
blob = TextBlob(text)
# Analyze sentiment
for sentence in blob.sentences:
print(f"Sentence: {sentence}")
print(f"Sentiment: {sentence.sentiment}")
5. Tokenization
Split text into words and sentences.
blob = TextBlob("TextBlob is a great tool. It makes NLP tasks simple.")
words = blob.words
sentences = blob.sentences
print("Words:", words)
print("Sentences:", sentences)
6. Lemmatization
Reduce words to their base or root form.
from textblob import Word
words = ["running", "jumps", "easily", "fairly"]
lemmatized_words = [Word(word).lemmatize() for word in words]
print(lemmatized_words)
7. Removing Non-Alphanumeric Characters
Remove characters that are not letters or numbers.
import re
text = "This is a sample sentence with numbers 123 and symbols #!@."
blob = TextBlob(text)
cleaned_text = re.sub(r'\W+', ' ', blob.raw)
print(cleaned_text)
9. Stemming
Reduce words to their stem or root form (less sophisticated than lemmatization).
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "jumps", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
Combining Data Cleaning Steps
Combining multiple data cleaning steps into a single process.
pythonCopy codefrom textblob import TextBlob
from nltk.corpus import stopwords
import string
import re
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def clean_text(text):
# Lowercasing
text = text.lower()
# Removing punctuation
text = ''.join([char for char in text if char not in string.punctuation])
# Removing stopwords
words = TextBlob(text).words
filtered_words = [word for word in words if word not in stop_words]
# Removing non-alphanumeric characters
cleaned_text = re.sub(r'\W+', ' ', ' '.join(filtered_words))
# Removing extra whitespace
cleaned_text = ' '.join(cleaned_text.split())
return cleaned_text
text = "This is a sample TEXT with punctuation, numbers 123, and stopwords!"
cleaned_text = clean_text(text)
TextBlob is a versatile tool for a variety of NLP tasks:
Data Cleaning: Correct spelling in multiple sentences.
Tokenization: Tokenize complex text into words and sentences.
POS Tagging: Tag words in a complex sentence with their parts of speech.
Noun Phrase Extraction: Extract noun phrases from a paragraph.
Sentiment Analysis: Analyze the sentiment of multiple sentences.
Translation and Language Detection: Translate and detect the language of multiple texts.
Text Classification: Train and use a classifier with more extensive data.
Basic NLP Tasks: Pluralize and singularize words.
Introduction to Transformers
Transformers are a type of model architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in
2017. They have revolutionized the field of natural language processing (NLP) by enabling highly efficient training and
achieving state-of-the-art results in various tasks. Transformers rely on a mechanism called self-attention, which allows
them to consider the entire input sequence when making predictions.Transformers have transformed NLP with their self-
attention mechanism, allowing models to capture context and dependencies more effectively than previous architectures.
The Hugging Face Transformers library provides an easy-to-use interface for leveraging these powerful models in various
NLP tasks, from text classification to machine translation and beyond. By fine-tuning pre-trained models on specific
datasets, users can achieve state-of-the-art performance.
Hugging Face Transformers is a popular open-source library that provides easy access to a vast array of pre-trained
models for natural language processing (NLP) tasks. It allows you to quickly use these models for inference or fine-tune
them on your own datasets for specific tasks, like text classification, question answering, or text generation. The library
supports various transformer-based models, including BERT, GPT, RoBERTa, and many others, making it a versatile tool
for NLP practitioners.
Key Concepts
Self-Attention: The self-attention mechanism enables the model to weigh the importance of different words in a sentence when
encoding a word. This allows the model to capture context and dependencies between words, regardless of their distance in the
sequence.
Positional Encoding: Since Transformers do not have a built-in notion of the order of words (unlike RNNs or LSTMs), positional
encodings are added to the input embeddings to provide information about the position of each word in the sequence.
Multi-Head Attention: Multi-head attention allows the model to focus on different parts of the sentence simultaneously,
improving its ability to capture various aspects of the context.
Feed-Forward Networks: Each position in the sequence is processed independently by a feed-forward neural network, adding
non-linearity and complexity to the model.
Layer Normalization: Normalization layers are used to stabilize and speed up training by maintaining the mean and variance of
the activations.
Encoder-Decoder Architecture: The original Transformer architecture consists of an encoder and a decoder, making it suitable
for sequence-to-sequence tasks such as translation. The encoder processes the input sequence, and the decoder generates the output
sequence, attending to the encoder's output.
Practical Implementation
The Hugging Face Transformers library provides a comprehensive implementation of various Transformer models, making it easy to use
them for different NLP tasks. Here's an introduction to using the library:
Installation
Install the transformers library:
pip install transformers
Loading a Pre-trained Model
Here’s how to load a pre-trained BERT model and tokenizer for a simple text classification task:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Example text
text = "Transformers are a groundbreaking innovation in NLP."
# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
# Make predictions
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(f"Predicted class: {predictions.item()}")
Fine-Tuning a Pre-trained Model
Fine-tuning a pre-trained Transformer model on your specific dataset involves training the model on labeled examples. Here’s a basic example of how to fine-
tune BERT on a text classification dataset:
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load dataset (example: IMDB)
dataset = load_dataset('imdb')
train_dataset = dataset['train'].map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)
test_dataset = dataset['test'].map(lambda e: tokenizer(e['text'], truncation=True, padding='max_length'), batched=True)
# Define training arguments
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Train the model
trainer.train()
Common Transformer Models
BERT (Bidirectional Encoder Representations from Transformers): Designed for pre-training deep bidirectional
representations by jointly conditioning on both left and right context in all layers.
GPT (Generative Pre-trained Transformer): An autoregressive model designed for generating text and fine-tuning on
various downstream tasks.
T5 (Text-To-Text Transfer Transformer): Converts all NLP tasks into a text-to-text format, simplifying the input-
output interface.
RoBERTa (Robustly Optimized BERT Approach): An optimized version of BERT with improved training strategies.
Applications of Transformers
Text Classification: Sentiment analysis, spam detection, etc.
Named Entity Recognition (NER): Identifying entities like names, dates, and locations in text.
Machine Translation: Translating text from one language to another.
Text Generation: Generating coherent and contextually relevant text.
Question Answering: Answering questions based on context from a given passage.
Summarization: Creating concise summaries of long documents.
DistilBERT
DistilBERT, a smaller and faster version of BERT, is widely used for various natural language processing (NLP)
tasks, including text classification and sentiment analysis. Below, I'll provide an overview and examples of how to
use DistilBERT for these tasks using the Hugging Face Transformers library.
Text Classification with DistilBERT
Text classification involves categorizing text into predefined labels. Here's how you can use DistilBERT for this task:
1. Installation
Install the necessary libraries:
pip install transformers datasets
2. Loading the Model and Tokenizer
Load the DistilBERT model and tokenizer from Hugging Face:
from transformers import DistilBERTTokenizer, DistilBERTForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load the tokenizer and model
tokenizer = DistilBERTTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBERTForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
# Load your dataset
dataset = load_dataset('csv', data_files='path/to/your/dataset.csv')
3. Tokenize the Dataset
Tokenize the dataset to prepare it for training:
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
4. Define Training Arguments and Trainer
Set up the training arguments and the Trainer:
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
DistilBERT is a versatile model that can
per_device_train_batch_size=8,
be effectively used for text classification
per_device_eval_batch_size=8,
num_train_epochs=3, and sentiment analysis tasks. Whether
weight_decay=0.01, you use pre-trained models directly or
) fine-tune them on your datasets, the
trainer = Trainer( Hugging Face Transformers library
model=model, provides robust tools to streamline the
args=training_args,
process.
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
)
Model Hub: The library provides a model hub (https://huggingface.co/models) where you can discover and download pre-trained models
and tokenizer files for your specific task. This makes it easy to access state-of-the-art models and use them in your projects.
Fine-Tuning: One of the key features of Hugging Face Transformers is its support for fine-tuning pre-trained models on custom datasets.
This allows you to adapt a pre-trained model to perform well on your specific task or domain.
Easy-to-Use API: The library provides a simple and intuitive API for working with pre-trained models. You can easily load a model,
tokenize text input, and perform inference using just a few lines of code.
Community and Resources: Hugging Face has a large and active community of developers working on NLP projects. They provide
extensive documentation, tutorials, and example code to help you get started with the library.