Sentiment Classification Using BERT

Last Updated : 21 Jan, 2025

BERT stands for Bidirectional Representation for Transformers and was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT becomes one of the most important and complete architectures for various natural language tasks having generated state-of-the-art results on Sentence pair classification tasks, question-answer tasks, etc.

Bidirectional Representation for Transformers (BERT)

BERT is a powerful technique for natural language processing that can improve how well computers comprehend human language. The foundation of BERT is the idea of exploiting bidirectional context to acquire complex and insightful word and phrase representations. By simultaneously examining both sides of a word’s context, BERT can capture a word’s whole meaning in its context, in contrast to earlier models that only considered the left or right context of a word. This enables BERT to deal with ambiguous and complex linguistic phenomena including polysemy, co-reference, and long-distance relationships.

For that, the paper also proposed the architecture of different tasks. In this post, we will be using BERT architecture for Sentiment classification tasks specifically the architecture used for the CoLA (Corpus of Linguistic Acceptability) binary classification task.

Single Sentence Classification Task

BERT has proposed two versions:

BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.

For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercase before WordPiece tokenization.

Sentiment Classification Using BERT:

Step 1: Import the necessary libraries

Python

import os
import shutil
import tarfile
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import pandas as pd
from bs4 import BeautifulSoup
import re
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Step 2: Load the dataset

Python

# Get the current working directory
current_folder = os.getcwd()

dataset = tf.keras.utils.get_file(
    fname ="aclImdb.tar.gz", 
    origin ="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    cache_dir=  current_folder,
    extract = True)

Output

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84125825/84125825 [==============================] - 12s 0us/step

check the dataset folder

Python

dataset_path = os.path.dirname(dataset)
# Check the dataset
os.listdir(dataset_path)

Output:

['aclImdb.tar.gz', 'aclImdb']

Check the ‘aclImdb’ directory

Python

# Dataset directory
dataset_dir = os.path.join(dataset_path, 'aclImdb')

# Check the Dataset directory
os.listdir(dataset_dir)

Output:

['README', 'test', 'imdb.vocab', 'imdbEr.txt', 'train']

Check the ‘Train’ dataset folder

Python

train_dir = os.path.join(dataset_dir,'train')
os.listdir(train_dir)

Output:

['urls_pos.txt',
 'urls_neg.txt',
 'labeledBow.feat',
 'neg',
 'unsup',
 'unsupBow.feat',
 'urls_unsup.txt',
 'pos']

Read the files of the ‘Train’ directory files

Python

for file in os.listdir(train_dir):
    file_path = os.path.join(train_dir, file)
    # Check if it's a file (not a directory)
    if os.path.isfile(file_path): 
        with open(file_path, 'r', encoding='utf-8') as f:
            first_value = f.readline().strip()
            print(f"{file}: {first_value}")
    else:
        print(f"{file}: {file_path}")

Output:

urls_pos.txt: http://www.imdb.com/title/tt0453418/usercomments
urls_neg.txt: http://www.imdb.com/title/tt0064354/usercomments
labeledBow.feat: 9 0:9 1:1 2:4 3:4 4:6 5:4 6:2 7:2 8:4 10:4 12:2 26:1 27:1 28:1 29:2 32:1 41:1 45:1 47:1 50:1 54:2 57:1 59:1 63:2 64:1 66:1 68:2 70:1 72:1 78:1 100:1 106:1 116:1 122:1 125:1 136:1 140:1 142:1 150:1 167:1 183:1 201:1 207:1 208:1 213:1 217:1 230:1 255:1 321:5 343:1 357:1 370:1 390:2 468:1 514:1 571:1 619:1 671:1 766:1 877:1 1057:1 1179:1 1192:1 1402:2 1416:1 1477:2 1940:1 1941:1 2096:1 2243:1 2285:1 2379:1 2934:1 2938:1 3520:1 3647:1 4938:1 5138:4 5715:1 5726:1 5731:1 5812:1 8319:1 8567:1 10480:1 14239:1 20604:1 22409:4 24551:1 47304:1
neg: /content/datasets/aclImdb/train/neg
unsup: /content/datasets/aclImdb/train/unsup
unsupBow.feat: 0 0:8 1:6 3:5 4:2 5:1 7:1 8:5 9:2 10:1 11:2 13:3 16:1 17:1 18:1 19:1 22:3 24:1 26:3 28:1 30:1 31:1 35:2 36:1 39:2 40:1 41:2 46:2 47:1 48:1 52:1 63:1 67:1 68:1 74:1 81:1 83:1 87:1 104:1 105:1 112:1 117:1 131:1 151:1 155:1 170:1 198:1 225:1 226:1 288:2 291:1 320:1 331:1 342:1 364:1 374:1 384:2 385:1 407:1 437:1 441:1 465:1 468:1 470:1 519:1 595:1 615:1 650:1 692:1 851:1 937:1 940:1 1100:1 1264:1 1297:1 1317:1 1514:1 1728:1 1793:1 1948:1 2088:1 2257:1 2358:1 2584:2 2645:1 2735:1 3050:1 4297:1 5385:1 5858:1 7382:1 7767:1 7773:1 9306:1 10413:1 11881:1 15907:1 18613:1 18877:1 25479:1
urls_unsup.txt: http://www.imdb.com/title/tt0018515/usercomments
pos: /content/datasets/aclImdb/train/pos

Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment

Here 0 means Negative and 1 means Positive

Python

def load_dataset(directory):
    data = {"sentence": [], "sentiment": []}
    for file_name in os.listdir(directory):
        print(file_name)
        if file_name == 'pos':
            positive_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(positive_dir):
                text = os.path.join(positive_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(1)
        elif file_name == 'neg':
            negative_dir = os.path.join(directory, file_name)
            for text_file in os.listdir(negative_dir):
                text = os.path.join(negative_dir, text_file)
                with open(text, "r", encoding="utf-8") as f:
                    data["sentence"].append(f.read())
                    data["sentiment"].append(0)
            
    return pd.DataFrame.from_dict(data)

Load the training datasets

Python

# Load the dataset from the train_dir
train_df = load_dataset(train_dir)
print(train_df.head())

Output:

urls_pos.txt
urls_neg.txt
labeledBow.feat
neg
unsup
unsupBow.feat
urls_unsup.txt
pos
                                            sentence  sentiment
0  When I rented this movie, I had very low expec...          0
1  'Major Payne' is a film about a major who make...          0
2  I'd been following this films progress for qui...          0
3  Although the beginning suggests All Quiet on t...          0
4  Cabin Fever is the first feature film directed...          0

Load the test dataset respectively

Python

test_dir = os.path.join(dataset_dir,'test')

# Load the dataset from the train_dir
test_df = load_dataset(test_dir)
print(test_df.head())

Output:

urls_pos.txt
urls_neg.txt
labeledBow.feat
neg
pos
                                            sentence  sentiment
0  The movie is nothing extraordinary. As a matte...          0
1  Rented the video with a lot of expectations, b...          0
2  The first time I saw a commercial for this sho...          0
3  We can conclude that there are 10 types of peo...          0
4  I seem to remember a lot of hype about this mo...          0

Step 3: Preprocessing

Python

sentiment_counts = train_df['sentiment'].value_counts()

fig =px.bar(x= {0:'Negative',1:'Positive'},
            y= sentiment_counts.values,
            color=sentiment_counts.index,
            color_discrete_sequence =  px.colors.qualitative.Dark24,
            title='<b>Sentiments Counts')

fig.update_layout(title='Sentiments Counts',
                  xaxis_title='Sentiment',
                  yaxis_title='Counts',
                  template='plotly_dark')

# Show the bar chart
fig.show()
pyo.plot(fig, filename = 'Sentiments Counts.html', auto_open = True)

Output:

Sentiment Counts

Text Cleaning

Python

def text_cleaning(text):
    soup = BeautifulSoup(text, "html.parser")
    text = re.sub(r'\[[^]]*\]', '', soup.get_text())
    pattern = r"[^a-zA-Z0-9\s,']"
    text = re.sub(pattern, '', text)
    return text

Apply text_cleaning

Python

# Train dataset
train_df['Cleaned_sentence'] = train_df['sentence'].apply(text_cleaning).tolist()
# Test dataset
test_df['Cleaned_sentence'] = test_df['sentence'].apply(text_cleaning)

Plot reviews on WordCLouds

Python

# Function to generate word cloud
def generate_wordcloud(text,Title):
    all_text = " ".join(text)
    wordcloud = WordCloud(width=800, 
                          height=400,
                          stopwords=set(STOPWORDS), 
                          background_color='black').generate(all_text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(Title)
    plt.show()

Positive Reviews

Python

positive = train_df[train_df['sentiment']==1]['Cleaned_sentence'].tolist()
generate_wordcloud(positive,'Positive Review')

Output:

Positive Reviews WordClound

Negative Reviews

Python

negative = train_df[train_df['sentiment']==0]['Cleaned_sentence'].tolist()
generate_wordcloud(negative,'Negative Review')

Output:

Negative Reviews WordCloud

Separate input text and target sentiment of both train and test

Python

# Training data
#Reviews = "[CLS] " +train_df['Cleaned_sentence'] + "[SEP]"
Reviews = train_df['Cleaned_sentence']
Target = train_df['sentiment']

# Test data
#test_reviews =  "[CLS] " +test_df['Cleaned_sentence'] + "[SEP]"
test_reviews = test_df['Cleaned_sentence']
test_targets = test_df['sentiment']

Split TEST data into test and validation

Python

x_val, x_test, y_val, y_test = train_test_split(test_reviews,
                                                    test_targets,
                                                    test_size=0.5, 
                                                    stratify = test_targets)

Step 4: Tokenization & Encoding

BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model’s input format. Let’s understand some of the key features of the BERT tokenization model.

BERT tokenizer splits the words into subwords or workpieces. For example, the word “geeksforgeeks” can be split into “geeks” “##for”, and”##geeks”. The “##” prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
- [CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
- [SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
- [MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
BERT tokenizer gives their components as outputs:
- input_ids: The numerical identifiers of the vocabulary tokens
- token_type_ids: It identifies which segment or sentence each token belongs to.
- attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.

Load the pre-trained BERT tokenizer

Python

#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Apply the BERT tokenization in training, testing and validation dataset

Python

max_len= 128
# Tokenize and encode the sentences
X_train_encoded = tokenizer.batch_encode_plus(Reviews.tolist(),
                                              padding=True, 
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

X_val_encoded = tokenizer.batch_encode_plus(x_val.tolist(), 
                                              padding=True, 
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

X_test_encoded = tokenizer.batch_encode_plus(x_test.tolist(), 
                                              padding=True, 
                                              truncation=True,
                                              max_length = max_len,
                                              return_tensors='tf')

Check the encoded dataset

Python

k = 0
print('Training Comments -->>',Reviews[k])
print('\nInput Ids -->>\n',X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n',tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n',X_train_encoded['attention_mask'][k])
print('\nLabels -->>',Target[k])

Output:

Training Comments -->> When I rented this movie, I had very low expectationsbut when I saw it, I realized that the movie was less a lot less than what I expected The actors were bad the doctor's wife was one of the worst, the story was so stupidit could work for a Disney movie except for the murders, but this one is not a comedy, it is a laughable masterpiece of stupidity The title is well chosen except for one thing they could add stupid movie after Dead Husbands I give it 0 and a half out of 5

Input Ids -->>
 tf.Tensor(
[  101  2043  1045 12524  2023  3185  1010  1045  2018  2200  2659 10908
  8569  2102  2043  1045  2387  2009  1010  1045  3651  2008  1996  3185
  2001  2625  1037  2843  2625  2084  2054  1045  3517  1996  5889  2020
  2919  1996  3460  1005  1055  2564  2001  2028  1997  1996  5409  1010
  1996  2466  2001  2061  5236  4183  2071  2147  2005  1037  6373  3185
  3272  2005  1996  9916  1010  2021  2023  2028  2003  2025  1037  4038
  1010  2009  2003  1037  4756  3085 17743  1997 28072  1996  2516  2003
  2092  4217  3272  2005  2028  2518  2027  2071  5587  5236  3185  2044
  2757 19089  1045  2507  2009  1014  1998  1037  2431  2041  1997  1019
   102     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0], shape=(128,), dtype=int32)

Decoded Ids -->>
 [CLS] when i rented this movie, i had very low expectationsbut when i saw it, i realized that the movie was less a lot less than what i expected the actors were bad the doctor's wife was one of the worst, the story was so stupidit could work for a disney movie except for the murders, but this one is not a comedy, it is a laughable masterpiece of stupidity the title is well chosen except for one thing they could add stupid movie after dead husbands i give it 0 and a half out of 5 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Attention Mask -->>
 tf.Tensor(
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(128,), dtype=int32)

Labels -->> 0

Step 5: Build the classification model

Lad the model

Python

# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Output:

model.safetensors: 100%  ------------------ 440M/440M [00:07<00:00, 114MB/s]
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able t

If the task at hand is similar to the one on which the checkpoint model was trained, we can use TFBertForSequenceClassification to provide predictions without further training.

Compile the model

Python

# Compile the model with an appropriate optimizer, loss function, and metrics
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Train the model

Python

# Step 5: Train the model
history = model.fit(
    [X_train_encoded['input_ids'], X_train_encoded['token_type_ids'], X_train_encoded['attention_mask']],
    Target,
    validation_data=(
      [X_val_encoded['input_ids'], X_val_encoded['token_type_ids'], X_val_encoded['attention_mask']],y_val),
    batch_size=32,
    epochs=3
)

Output:

Epoch 1/3
782/782 [==============================] - 808s 980ms/step - loss: 0.3348 - accuracy: 0.8480 - val_loss: 0.2891 - val_accuracy: 0.8764
Epoch 2/3
782/782 [==============================] - 765s 979ms/step - loss: 0.1963 - accuracy: 0.9238 - val_loss: 0.2984 - val_accuracy: 0.8906
Epoch 3/3
782/782 [==============================] - 764s 978ms/step - loss: 0.1007 - accuracy: 0.9632 - val_loss: 0.3652 - val_accuracy: 0.8816

Step 6:Evaluate the model

Python

#Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
    y_test
)
print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')

Output:

391/391 [==============================] - 106s 271ms/step - loss: 0.3560 - accuracy: 0.8798
Test loss: 0.3560144007205963, Test accuracy: 0.8797600269317627

Save the model and tokenizer to the local folder

Python

path = '/content'
# Save tokenizer
tokenizer.save_pretrained(path +'/Tokenizer')

# Save model
model.save_pretrained(path +'/Model')

# This code is modified by Susobhan Akhuli

Load the model and tokenizer from the local folder

Python

# Load tokenizer
bert_tokenizer = BertTokenizer.from_pretrained(path +'/Tokenizer')

# Load model
bert_model = TFBertForSequenceClassification.from_pretrained(path +'/Model')

Predict the sentiment of the test dataset

Python

pred = bert_model.predict(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']])

# pred is of type TFSequenceClassifierOutput
logits = pred.logits

# Use argmax along the appropriate axis to get the predicted labels
pred_labels = tf.argmax(logits, axis=1)

# Convert the predicted labels to a NumPy array
pred_labels = pred_labels.numpy()

label = {
    1: 'positive',
    0: 'Negative'
}

# Map the predicted labels to their corresponding strings using the label dictionary
pred_labels = [label[i] for i in pred_labels]
Actual = [label[i] for i in y_test]

print('Predicted Label :', pred_labels[:10])
print('Actual Label    :', Actual[:10])

Output:

391/391 [==============================] - 108s 270ms/step
Predicted Label : ['positive', 'positive', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'Negative', 'Negative']
Actual Label : ['positive', 'Negative', 'Negative', 'Negative', 'Negative', 'positive', 'Negative', 'positive', 'Negative', 'Negative']

Classification Report

Python

print("Classification Report: \n", classification_report(Actual, pred_labels))

Output:

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.87      0.90      0.88      6250
    positive       0.90      0.86      0.88      6250

    accuracy                           0.88     12500
   macro avg       0.88      0.88      0.88     12500
weighted avg       0.88      0.88      0.88     12500

Step 7: Prediction with user inputs

Python

def Get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
    # Convert Review to a list if it's not already a list
    if not isinstance(Review, list):
        Review = [Review]

    Input_ids, Token_type_ids, Attention_mask = Tokenizer.batch_encode_plus(Review,
                                                                             padding=True,
                                                                             truncation=True,
                                                                             max_length=128,
                                                                             return_tensors='tf').values()
    prediction = Model.predict([Input_ids, Token_type_ids, Attention_mask])

    # Use argmax along the appropriate axis to get the predicted labels
    pred_labels = tf.argmax(prediction.logits, axis=1)

    # Convert the TensorFlow tensor to a NumPy array and then to a list to get the predicted sentiment labels
    pred_labels = [label[i] for i in pred_labels.numpy().tolist()]
    return pred_labels

Let’s predict with our own review

Python

Review ='''Bahubali is a blockbuster Indian movie that was released in 2015. 
It is the first part of a two-part epic saga that tells the story of a legendary hero who fights for his kingdom and his love. 
The movie has received rave reviews from critics and audiences alike for its stunning visuals, 
spectacular action scenes, and captivating storyline.'''
Get_sentiment(Review)

Output:

1/1 [==============================] - 3s 3s/step
['positive']

You can download the source code: Sentiment Classification Using BERT

Sentiment Analysis with an Recurrent Neural Networks (RNN)

pawangfg

Improve

Article Tags :

Sentiment Classification Using BERT

Bidirectional Representation for Transformers (BERT)

Sentiment Classification Using BERT:

Step 1: Import the necessary libraries

Step 2: Load the dataset

check the dataset folder

Check the ‘aclImdb’ directory

Check the ‘Train’ dataset folder

Read the files of the ‘Train’ directory files

Load the Movies reviews and convert them into the pandas’ data frame with their respective sentiment

Load the test dataset respectively

Step 3: Preprocessing

Text Cleaning

Plot reviews on WordCLouds

Split TEST data into test and validation

Step 4: Tokenization & Encoding

Load the pre-trained BERT tokenizer

Apply the BERT tokenization in training, testing and validation dataset

Check the encoded dataset

Step 5: Build the classification model

Lad the model

Compile the model

Train the model

Step 6:Evaluate the model

Save the model and tokenizer to the local folder

Load the model and tokenizer from the local folder

Predict the sentiment of the test dataset

Classification Report

Step 7: Prediction with user inputs

Let’s predict with our own review

Similar Reads

Classification Projects

Regression Projects

Computer Vision Projects

Natural Language Processing Projects

Clustering Projects

Recommender System Project

Thank You!

What kind of Experience do you want to share?