Sentiment analysis, also known as opinion mining, computationally identifies and categorizes opinions expressed in text data. It involves analyzing the polarity (positive, negative or neutral) of textual content to gauge the sentiment or attitude of the author. In the context of customer reviews, sentiment analysis helps businesses understand how customers perceive their products or services. In this article, we delve into the world of sentiment analysis for customer reviews using the R Programming Language.
Understanding the Dataset
The dataset used in this project contains TripAdvisor Hotel Reviews where each row represents a customer review. The dataset includes the following key columns:
- S.No.: Serial number of the review.
- Review: The actual customer review text.
- Rating: Customer rating (usually on a scale of 1 to 5, representing their experience).
We will focus on the Review column which contains the textual data we need to analyze for sentiment. The Rating column can be used as an additional reference to compare how well our sentiment analysis matches the numerical ratings
You can download the dataset from here: TripAdvisor
1. Installing and Loading Required Packages
We need to install and load the required R packages.
- tm: Provides text mining functions
- SnowballC: Implements stemming for text data
- syuzhet: Provides sentiment analysis functions
- tidyverse: A collection of packages for data manipulation
- wordcloud: Used for visualizing word frequencies
- ggplot2: Used for data visualization
install.packages(c("tm", "SnowballC", "syuzhet", "tidyverse", "wordcloud", "ggplot2"))
library(tm)
library(SnowballC)
library(syuzhet)
library(tidyverse)
library(wordcloud)
library(ggplot2)
2. Loading the Dataset
Next, we will load the CSV file containing the reviews. The str() function will display the structure of the dataframe, showing the data types of each column and a preview of the data.
data <- read.csv("/content/tripadvisor.csv", header = TRUE)
str(data)
Output:

3. Creating and Inspecting the Corpus
We convert the review text to a character vector and create a corpus for text processing.
corpus <- iconv(data$Review, to = "UTF-8", sub = "byte")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
Output:

4. Cleaning the Corpus
We will clean the text by converting it to lowercase, removing punctuation, numbers, stopwords, extra whitespaces and applying stemming.
cleaned_corpus <- tm_map(corpus, content_transformer(tolower))
cleaned_corpus <- tm_map(cleaned_corpus, removePunctuation)
cleaned_corpus <- tm_map(cleaned_corpus, removeNumbers)
cleaned_corpus <- tm_map(cleaned_corpus, removeWords, stopwords('english'))
cleaned_corpus <- tm_map(cleaned_corpus, stripWhitespace)
cleaned_corpus <- tm_map(cleaned_corpus, stemDocument)
inspect(cleaned_corpus[1:5])
Output:

5. Sampling the Data
We will sample a subset of reviews to make the analysis more manageable.
set.seed(123)
sampled_reviews <- sample(data$Review, 200)
sampled_corpus <- Corpus(VectorSource(iconv(sampled_reviews, to = "UTF-8", sub = "byte")))
6. Cleaning the Sampled Corpus
We will now clean our sampled corpus similarly we did full corpus.
cleaned_sampled_corpus <- tm_map(sampled_corpus, content_transformer(tolower))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removePunctuation)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeNumbers)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeWords, stopwords('english'))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stripWhitespace)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stemDocument)
7. Creating Sparse Term Document Matrix
We will create a sparse Term Document Matrix (TDM) for efficient processing and memory usage.
tdm_sparse <- TermDocumentMatrix(cleaned_sampled_corpus, control = list(weighting = weightTfIdf))
tdm_m_sparse <- as.matrix(tdm_sparse)
8. Analyzing Term Frequencies
We analyze the frequency of terms in the corpus and display the most frequent ones.
term_freq <- rowSums(tdm_m_sparse)
term_freq_sorted <- sort(term_freq, decreasing = TRUE)
tdm_d_sparse <- data.frame(word = names(term_freq_sorted), freq = term_freq_sorted)
head(tdm_d_sparse, 5)
Output:

9. Performing Sentiment Analysis
We use three different methods (syuzhet, bing, afinn) to perform sentiment analysis on the text data.
text <- iconv(data$Review)
syuzhet_vector <- get_sentiment(text, method = "syuzhet")
cat("Syuzhet method",head(syuzhet_vector),"\n")
bing_vector <- get_sentiment(text, method = "bing")
cat("Bing method:",head(bing_vector),"\n")
afinn_vector <- get_sentiment(text, method = "afinn")
cat("Afinn method:",head(afinn_vector),"\n")
Output:

10. Comparing Sentiment Methods
We compare the sentiment scores using the three methods.
rbind(
sign(head(syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector))
)
Output:

Visualization of Sentiment Analysis for Customer Reviews in R
We will now visualize the sentiment analysis results using different methods, including a Word Cloud, Sentiment Histogram, Emotion Bar Plot and Pie Chart of Sentiment Distribution.
1. Word Cloud
We create a word cloud to visualize the most frequent terms in the reviews. A word cloud provides a quick and intuitive way to visualize the most common words in a text corpus, making it easier to identify patterns and trends.
wordcloud(words = tdm_d_sparse$word, freq = tdm_d_sparse$freq,
min.freq = 5, max.words = 100, colors = brewer.pal(8, "Dark2"))
Output:

Words with higher frequencies will appear larger and more prominent in the word cloud. The colors of the words are determined by the specified color palette, with each color representing a different word in the cloud.
2. Sentiment Histogram
We create a histogram to visualize the distribution of sentiment scores using the Syuzhet method. A histogram allows for a quick assessment of the overall sentiment distribution within the sampled text data.
text_sampled <- iconv(sampled_reviews)
syuzhet_vector_sampled <- get_sentiment(text_sampled, method = "syuzhet")
ggplot(data.frame(syuzhet_vector_sampled), aes(x = syuzhet_vector_sampled)) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
labs(title = "Sentiment Distribution using Syuzhet Method (Sampled Data)",
x = "Sentiment Score", y = "Frequency") +
theme_minimal()
Output:

Each bar in the histogram represents a range of sentiment scores and the height of the bar indicates the frequency of occurrence of sentiment scores within that range.
3. Bar Plot of emotions
We will use ggplot2 package to create a bar plot of emotions along with the sentiment scores categorized into different emotions.
nrc_sampled <- get_nrc_sentiment(text_sampled)
nrct_sampled <- data.frame(t(nrc_sampled))
nrcs_sampled <- data.frame(rowSums(nrct_sampled))
nrcs_sampled <- cbind("sentiment" = rownames(nrcs_sampled), nrcs_sampled)
rownames(nrcs_sampled) <- NULL
names(nrcs_sampled)[1] <- "sentiment"
names(nrcs_sampled)[2] <- "frequency"
nrcs_sampled <- nrcs_sampled %>% mutate(percent = frequency/sum(frequency))
nrcs2_sampled <- nrcs_sampled[1:8, ]
colnames(nrcs2_sampled)[1] <- "emotion"
ggplot(nrcs2_sampled, aes(x = reorder(emotion, -frequency), y = frequency,
fill = emotion)) +
geom_bar(stat = "identity") +
labs(title = "Emotion Distribution (Sampled Data)", x = "Emotion", y = "Frequency") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
Output:

The bar plot shows the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.
4. Bar Plot of Most Popular Words
Creating a bar plot of the most popular words in a text dataset involves visualizing the frequency distribution of words within the corpus. This visualization helps in identifying the most common words in the text data.
tdm_d_sparse <- tdm_d_sparse[1:10, ]
tdm_d_sparse$word <- reorder(tdm_d_sparse$word, tdm_d_sparse$freq)
ggplot(tdm_d_sparse, aes(x = word, y = freq, fill = word)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Most Popular Words", x = "Word", y = "Frequency") +
theme_minimal()
Output:

The horizontal bar plot shows the frequency of the top 10 most popular words in the text data. Each bar represents a word and the length of the bar indicates the frequency of that word in the dataset. The colors of the bars are determined by the words themselves, providing visual differentiation between them.
5. Pie Chart of Sentiment Distribution
Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.
library(ggplot2)
library(RColorBrewer)
sentiment_df <- data.frame(
sentiment = c("Positive", "Negative", "Neutral"),
count = c(sum(syuzhet_vector_sampled > 0), sum(syuzhet_vector_sampled < 0),
sum(syuzhet_vector_sampled == 0))
)
ggplot(sentiment_df, aes(x = "", y = count, fill = sentiment)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Sentiment Distribution", x = "", y = "") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
Output:

The pie chart shows the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category ("Positive", "Negative", "Neutral") and the size of each segment corresponds to the count of that sentiment category in the dataset. The colors of the segments are determined by the specified color palette, allowing for easy differentiation between sentiment categories.
Conclusion
From our analysis, we can see that the majority of customers had a positive experience using TripAdvisor, expressing emotions of trust, joy and anticipation most often.