Sentiment Analysis for Customer Reviews in R

Sentiment analysis, also known as opinion mining, computationally identifies and categorizes opinions expressed in text data. It involves analyzing the polarity (positive, negative or neutral) of textual content to gauge the sentiment or attitude of the author. In the context of customer reviews, sentiment analysis helps businesses understand how customers perceive their products or services. In this article, we delve into the world of sentiment analysis for customer reviews using the R Programming Language.

Understanding the Dataset

The dataset used in this project contains TripAdvisor Hotel Reviews where each row represents a customer review. The dataset includes the following key columns:

S.No.: Serial number of the review.
Review: The actual customer review text.
Rating: Customer rating (usually on a scale of 1 to 5, representing their experience).

We will focus on the Review column which contains the textual data we need to analyze for sentiment. The Rating column can be used as an additional reference to compare how well our sentiment analysis matches the numerical ratings

You can download the dataset from here: TripAdvisor

1. Installing and Loading Required Packages

We need to install and load the required R packages.

tm: Provides text mining functions
SnowballC: Implements stemming for text data
syuzhet: Provides sentiment analysis functions
tidyverse: A collection of packages for data manipulation
wordcloud: Used for visualizing word frequencies
ggplot2: Used for data visualization

install.packages(c("tm", "SnowballC", "syuzhet", "tidyverse", "wordcloud", "ggplot2"))

library(tm)
library(SnowballC)
library(syuzhet)
library(tidyverse)
library(wordcloud)
library(ggplot2)

2. Loading the Dataset

Next, we will load the CSV file containing the reviews. The str() function will display the structure of the dataframe, showing the data types of each column and a preview of the data.

data <- read.csv("/content/tripadvisor.csv", header = TRUE)
str(data)

Output:

3. Creating and Inspecting the Corpus

We convert the review text to a character vector and create a corpus for text processing.

corpus <- iconv(data$Review, to = "UTF-8", sub = "byte")
corpus <- Corpus(VectorSource(corpus))

inspect(corpus[1:5])

Output:

4. Cleaning the Corpus

We will clean the text by converting it to lowercase, removing punctuation, numbers, stopwords, extra whitespaces and applying stemming.

cleaned_corpus <- tm_map(corpus, content_transformer(tolower))
cleaned_corpus <- tm_map(cleaned_corpus, removePunctuation)
cleaned_corpus <- tm_map(cleaned_corpus, removeNumbers)
cleaned_corpus <- tm_map(cleaned_corpus, removeWords, stopwords('english'))
cleaned_corpus <- tm_map(cleaned_corpus, stripWhitespace)
cleaned_corpus <- tm_map(cleaned_corpus, stemDocument)

inspect(cleaned_corpus[1:5])

Output:

5. Sampling the Data

We will sample a subset of reviews to make the analysis more manageable.

set.seed(123)

sampled_reviews <- sample(data$Review, 200)
sampled_corpus <- Corpus(VectorSource(iconv(sampled_reviews, to = "UTF-8", sub = "byte")))

6. Cleaning the Sampled Corpus

We will now clean our sampled corpus similarly we did full corpus.

cleaned_sampled_corpus <- tm_map(sampled_corpus, content_transformer(tolower))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removePunctuation)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeNumbers)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeWords, stopwords('english'))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stripWhitespace)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stemDocument)

7. Creating Sparse Term Document Matrix

We will create a sparse Term Document Matrix (TDM) for efficient processing and memory usage.

tdm_sparse <- TermDocumentMatrix(cleaned_sampled_corpus, control = list(weighting = weightTfIdf))
tdm_m_sparse <- as.matrix(tdm_sparse)

8. Analyzing Term Frequencies

We analyze the frequency of terms in the corpus and display the most frequent ones.

term_freq <- rowSums(tdm_m_sparse)
term_freq_sorted <- sort(term_freq, decreasing = TRUE)
tdm_d_sparse <- data.frame(word = names(term_freq_sorted), freq = term_freq_sorted)

head(tdm_d_sparse, 5)

Output:

9. Performing Sentiment Analysis

We use three different methods (syuzhet, bing, afinn) to perform sentiment analysis on the text data.

text <- iconv(data$Review)

syuzhet_vector <- get_sentiment(text, method = "syuzhet")
cat("Syuzhet method",head(syuzhet_vector),"\n")

bing_vector <- get_sentiment(text, method = "bing")
cat("Bing method:",head(bing_vector),"\n")

afinn_vector <- get_sentiment(text, method = "afinn")
cat("Afinn method:",head(afinn_vector),"\n")

Output:

10. Comparing Sentiment Methods

We compare the sentiment scores using the three methods.

rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)

Output:

comparison — Comparing Sentiment Methods

Visualization of Sentiment Analysis for Customer Reviews in R

We will now visualize the sentiment analysis results using different methods, including a Word Cloud, Sentiment Histogram, Emotion Bar Plot and Pie Chart of Sentiment Distribution.

1. Word Cloud

We create a word cloud to visualize the most frequent terms in the reviews. A word cloud provides a quick and intuitive way to visualize the most common words in a text corpus, making it easier to identify patterns and trends.

wordcloud(words = tdm_d_sparse$word, freq = tdm_d_sparse$freq, 
          min.freq = 5, max.words = 100, colors = brewer.pal(8, "Dark2"))

Output:

Words with higher frequencies will appear larger and more prominent in the word cloud. The colors of the words are determined by the specified color palette, with each color representing a different word in the cloud.

2. Sentiment Histogram

We create a histogram to visualize the distribution of sentiment scores using the Syuzhet method. A histogram allows for a quick assessment of the overall sentiment distribution within the sampled text data.

text_sampled <- iconv(sampled_reviews)
syuzhet_vector_sampled <- get_sentiment(text_sampled, method = "syuzhet")

ggplot(data.frame(syuzhet_vector_sampled), aes(x = syuzhet_vector_sampled)) + 
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") + 
  labs(title = "Sentiment Distribution using Syuzhet Method (Sampled Data)", 
       x = "Sentiment Score", y = "Frequency") + 
  theme_minimal()

Output:

Sentiment-Histogram — Sentiment Histogram

Each bar in the histogram represents a range of sentiment scores and the height of the bar indicates the frequency of occurrence of sentiment scores within that range.

3. Bar Plot of emotions

We will use ggplot2 package to create a bar plot of emotions along with the sentiment scores categorized into different emotions.

nrc_sampled <- get_nrc_sentiment(text_sampled)
nrct_sampled <- data.frame(t(nrc_sampled))
nrcs_sampled <- data.frame(rowSums(nrct_sampled))
nrcs_sampled <- cbind("sentiment" = rownames(nrcs_sampled), nrcs_sampled)

rownames(nrcs_sampled) <- NULL
names(nrcs_sampled)[1] <- "sentiment"
names(nrcs_sampled)[2] <- "frequency"

nrcs_sampled <- nrcs_sampled %>% mutate(percent = frequency/sum(frequency))
nrcs2_sampled <- nrcs_sampled[1:8, ]
colnames(nrcs2_sampled)[1] <- "emotion"

ggplot(nrcs2_sampled, aes(x = reorder(emotion, -frequency), y = frequency, 
                          fill = emotion)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Emotion Distribution (Sampled Data)", x = "Emotion", y = "Frequency") + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Set3")

Output:

The bar plot shows the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.

4. Bar Plot of Most Popular Words

Creating a bar plot of the most popular words in a text dataset involves visualizing the frequency distribution of words within the corpus. This visualization helps in identifying the most common words in the text data.

tdm_d_sparse <- tdm_d_sparse[1:10, ]
tdm_d_sparse$word <- reorder(tdm_d_sparse$word, tdm_d_sparse$freq)
ggplot(tdm_d_sparse, aes(x = word, y = freq, fill = word)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  labs(title = "Most Popular Words", x = "Word", y = "Frequency") + 
  theme_minimal()

Output:

Bar-Plot-of-Most-Popular-Words — Bar plot of most popular word

The horizontal bar plot shows the frequency of the top 10 most popular words in the text data. Each bar represents a word and the length of the bar indicates the frequency of that word in the dataset. The colors of the bars are determined by the words themselves, providing visual differentiation between them.

5. Pie Chart of Sentiment Distribution

Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.

library(ggplot2)
library(RColorBrewer)

sentiment_df <- data.frame(
  sentiment = c("Positive", "Negative", "Neutral"),
  count = c(sum(syuzhet_vector_sampled > 0), sum(syuzhet_vector_sampled < 0), 
            sum(syuzhet_vector_sampled == 0))
)

ggplot(sentiment_df, aes(x = "", y = count, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Sentiment Distribution", x = "", y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Output:

The pie chart shows the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category ("Positive", "Negative", "Neutral") and the size of each segment corresponds to the count of that sentiment category in the dataset. The colors of the segments are determined by the specified color palette, allowing for easy differentiation between sentiment categories.

Conclusion

From our analysis, we can see that the majority of customers had a positive experience using TripAdvisor, expressing emotions of trust, joy and anticipation most often.

Sentiment Analysis for Customer Reviews in R

Understanding the Dataset

1. Installing and Loading Required Packages

2. Loading the Dataset

3. Creating and Inspecting the Corpus

4. Cleaning the Corpus

5. Sampling the Data

6. Cleaning the Sampled Corpus

7. Creating Sparse Term Document Matrix

8. Analyzing Term Frequencies

9. Performing Sentiment Analysis

10. Comparing Sentiment Methods

Visualization of Sentiment Analysis for Customer Reviews in R

1. Word Cloud

2. Sentiment Histogram

3. Bar Plot of emotions

4. Bar Plot of Most Popular Words

5. Pie Chart of Sentiment Distribution

Conclusion

Explore