0% found this document useful (0 votes)
722 views

Business Analytics Unit 1-5

The document outlines the syllabus for a Business Analytics course offered by the University of Delhi, detailing the structure and content of the curriculum. It covers various topics including data science, data preparation, visualization, and analytics using R, along with specific lessons dedicated to descriptive and predictive analytics. Additionally, it emphasizes the importance of data quality and the application of analytics in business decision-making.

Uploaded by

ajha25605
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
722 views

Business Analytics Unit 1-5

The document outlines the syllabus for a Business Analytics course offered by the University of Delhi, detailing the structure and content of the curriculum. It covers various topics including data science, data preparation, visualization, and analytics using R, along with specific lessons dedicated to descriptive and predictive analytics. Additionally, it emphasizes the importance of data quality and the application of analytics in business decision-making.

Uploaded by

ajha25605
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

1504-Business Analytics [BCPH-DSC16-S6-CC4] Cover Jan25.

pdf - January 18, 2025


BUSINESS ANALYTICS

[FOR LIMITED CIRCULATION]

Editor

Dr. Charu Gupta


Content Writer

Ms. Asha Yadav


Academic Coordinator

Deekshant Awasthi

Department
De
epa
part
rtme
ment of D
Distance
ist
stan
ance
nce aand
nd C
Continuing
onti
on tinu
n ing Ed
Education
ducattio
ionn
E-mail:
E ma
E- maill: dd
[email protected]
dce
cepr
prin
pr inti
ting
ngg@c
@ ol
ol.d
.du.
u ac
u. ac.i
.in
n
commme
mercr e@
rc @co
c l.
l.du
du.a
[email protected] .ac.
c.in
in

Published
Pu ubl
blis
i he
is h d by
by::
Department
Deepa
part
rttment
rtme nt ooff Di
Distance
D ist
stan ce aand
ance nd CContinuing
onnti
tinu
nuuin
ing E
ing Education
duccattion
Campus
Ca
amp us ooff Op
mpus Open
Ope en L
en Learning,
ea
arnrnin
ing,
g S
g, School
chhoo Open
ooll off O pen Lear
pen
pe Learning,
rning,
University
Univ
Un iver
errsi
s ty ooff De
Delhi,
elhhi, D
Delhi-110007
elh
el hi-1
110
1000
0077
00

Printed
Pri
Printed
inted by
by::
School
Scho
Sc o ooff Op
hool Open
O pen
e LLearning,
earn
ea rnin
ing, University
g, U niversit
ni Delhi
i y off D elhi
BUSINESS ANALYTICS

Reviewer
Ms. Aishwarya Anand Arora

Corrections/Modifications/Suggestions proposed by Statutory Body, DU/


Stakeholder/s in the Self Learning Material (SLM) will be incorporated in
WKH QH[W HGLWLRQ +RZHYHU WKHVH FRUUHFWLRQVPRGL¿FDWLRQVVXJJHVWLRQV ZLOO EH
uploaded on the website https://sol.du.ac.in. Any feedback or suggestions may
be sent at the email- [email protected]

Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (11800 Copies, 2025)

Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi
Syllabus
Business Analytics

Syllabus Mapping
Unit - I: Introduction Lesson 1: Introduction to
Data and Data Science; Data analytics and data analysis, Classification of Data Science
Analytics, Application of analytics in business, Types of data: nominal, (Pages 1–18)
ordinal, scale; Big Data and its characteristics, Applications of Big data.
Challenges in data analytics.
Unit - II: Data Preparation, Summarisation and Visualisation Using Lesson 2: Data
Spreadsheet Preparation, Summarisation
Data Preparation and Cleaning, Sort and filter, Conditional formatting, Text and Visualisation Using
to Column, Removing Duplicates, Data Validation, identifying outliers in Spreadsheet
the data, covariance and correlation matrix, Moving Averages, Finding the (Pages 19–50)
missing value from data; Summarisation; Visualisation: scatter plots, line
charts, histogram, etc., Pivot Tables, pivot charts and interactive dashboards.
Unit - III: Getting started with R Lesson 3: Getting Started
Introduction to R, Advantages of R, Installation of R Packages, Importing with R
data from spreadsheet files, Commands and Syntax, Packages and Libraries, (Pages 51–67)
Data Structures in R - Vectors, Matrices, Arrays, Lists, Factors, Data Frames,
Lesson 4: Data Structures
Conditionals and Control Flows, Loops, Functions, and Apply family.
in R
(Pages 68–101)
Unit - IV: Descriptive Statistics Using R Lesson 5: Descriptive
Importing Data file; Data visualisation using charts: histograms, bar charts, Statistics Using R
box plots, line graphs, scatter plots. etc; Data description: Measure of (Pages 102–118)
Central Tendency, Measure of Dispersion, Relationship between variables:
Covariance, Correlation and coefficient of determination.
Unit - V: Predictive and Textual Analytics Lesson 6: Predictive and
Simple Linear Regression models; Confidence & Prediction intervals; Textual Analytics
Multiple Linear Regression; Interpretation of Regression Coefficients; (Pages 119–137)
heteroscedasticity; multi-collinearity. Basics of textual data analysis, signif-
icance, application, and challenges. Introduction to Textual Analysis using
R. Methods and Techniques of textual analysis: Text Mining, Categorization
and Sentiment Analysis.

Department of Distance & Continuing Education, Campus of Open Learning,


School of Open Learning, University of Delhi

Syllebus_Business Analytics.indd 1 10-Jan-25 3:57:54 PM


Syllebus_Business Analytics.indd 2 10-Jan-25 3:57:54 PM
Contents

PAGE
Lesson 1: Introduction to Data Science 1–18

Lesson 2: Data Preparation, Summarisation and Visualisation Using Spreadsheet 19–50

Lesson 3: Getting Started with R 51–67

Lesson 4: Data Structures in R 68–101

Lesson 5: Descriptive Statistics Using R 102–118

Lesson 6: Predictive and Textual Analytics 119–137

Glossary 139–140

PAGE i
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics_TOC.indd 1 10-Jan-25 3:56:04 PM


Business Analytics_TOC.indd 2 10-Jan-25 3:56:04 PM
L E S S O N

1
Introduction to Data
Science
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Data and Its Types
1.4 Data Analytics and Data Analysis
1.5 Application of Analytics in Business
1.6 Big Data and Its Characteristics
1.7 Applications of Big Data
1.8 Challenges in Data Analytics
1.9 Summary
1.10 Answers to In-Text Questions
1.11 Self-Assessment Questions
1.12 References
1.13 Suggested Readings

1.1 Learning Objectives


After reading this lesson students will be able to:
‹ Define key terms like data, data analytics, big data.
‹ Explain the differences between data analytics and data analysis.
‹ Classify different types of data.
‹ Explain the characteristics and applications of big data.

PAGE 1
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 1 10-Jan-25 3:51:25 PM


BUSINESS ANALYTICS

Notes
1.2 Introduction
Data Science is an interdisciplinary field that combines statistics, data
analysis, and machine learning to obtain meaningful insights and knowl-
edge from data. It is based on the processes of gathering, analysing data,
and making informed decisions by using the patterns that have evolved.
It is very versatile as it allows businesses and organizations to enhance
decision-making, perform predictive analyses, and discover hidden patterns
within datasets. Data science is applied in many spheres, including banks,
healthcare, manufacturing, and e-commerce, to serve critical applications
such as optimizing routes, forecasting revenues, creating targeted promo-
tional offers, and even predicting election outcomes.
Data Scientist combines expertise in machine learning, statistics, pro-
gramming (using tools like R), mathematics, and database management
to work with raw data. This brings together a systematic approach to
asking the right questions in defining a problem, gathering and cleaning
data, standardizing it for analysis, finding trends, and presenting action-
able insights in a clear and impactful manner. By using Data Science,
organizations could tap into this capability and discover the full potential
in their data; it could be improving operational efficiency or enhancing
customer experiences and providing competitive advantage. Truly, the
field is still growing and expanding its scope and application, making it
quite in-demand in today’s data-driven world.
This lesson will serve as your foundation to understand data and data
analysis. How can we classify analytics and how this can be applied to
various businesses? We will also understand which data comes into the
category of big data and various applications and challenges that occur
while dealing with big data.

1.3 Data and Its Types


Data is the raw material for extracting information, for example, num-
bers, text, observations or recordings. Data can be structured, i.e. they
are organized into predefined categories or concepts, such as lists, tables,
datasets, databases or spreadsheets. Data can also be unstructured, which
means they are not organized. Unstructured data like paragraph of text,

2 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 2 10-Jan-25 3:51:25 PM


INTRODUCTION TO DATA SCIENCE

Satellite images need to be processed or parsed to become structured before Notes


any further work can be done on them. Data have meaning and value, but
they are difficult to identify unless we process them and apply statistical
method. Such methods are a way of summarizing the data so that the
meaning becomes clear. Statistical methods are applied to data to derive
meaning or find relationships.
It’s important to understand different types of data in order to choose the
appropriate method for analysing data and presenting the results. Data
can be divided into two main categories: Qualitative (or Categorial) and
quantitative. Qualitative data can be further subdivided into nominal and
ordinal data. Quantitative data can be discrete or continuous and are also
known as numerical data.
Qualitative data represent characteristics, such as gender, languages spoken,
types of disease or clothing sizes. For example, the languages spoken by
a particular person could be Hindi, English, German and Spanish. The
categories are referred to as classes or classifications. Every possible value
for a characteristic should be in one and only one category. When the
categories have no inherent order, the data are called nominal. The data
values in this case are labels like categories are types of disease, gender,
ethnicity, types of pets, and eye color. Nominal data can be analyzed and
summarized using frequencies, proportions, percentages, cross-tabulations,
and the mode, and they can be visualized using pie charts and bar graphs.
Ordinal values represent categorical data that can be ordered. Ordinal
data are very similar to nominal data, but—as the name implies—order
is important. The categories follow some logical order, such as sizes cat-
egorized as small, medium and large. Similarly to nominal data, ordinal
data can be analyzed, summarized and visualized. However, ordinal data
can also be described using percentiles, medians and modes. If the ordinal
data are numeric, interquartile ranges can also be used.
Quantitative data also called numerical data can be either discrete or
continuous. When the data values are distinct and separate, and they can
take on certain values only, they are called discrete data. Discrete data
can be only counted, not measured. For example, population, number of
cars etc. Continuous data, on the other hand, represent measurements, not
counts like height, weight and distance.

PAGE 3
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 3 10-Jan-25 3:51:25 PM


BUSINESS ANALYTICS

Notes Apart from types of data, its quality is also an important aspect. We are
exposed to data every day, for example, in news stories, weather reports
and advertising, but how can we determine whether the data is of good
quality or not. Quality is something that is important throughout the en-
tire data journey. The six aspects of data quality are relevance, accuracy,
timeliness, interpretability, coherence, accessibility.
‹ Relevance: The relevance of data or statistical information reflects
the degree to which it meets the needs of data users. Some questions
that must be answered are, “does this information matter?” “Does
it fill an existing data gap?”
‹ Accuracy: Accurate data give a true reflection of reality, a data
which is not accurate doesn’t help to gain any fruitful decision and
hence has no value.
‹ Timeliness: It is the time when data is available to the user or
decision maker. It is the delay between the time when the data are
meaningful and when they are available. For example, the stock
information of an e-commerce needs to be updated and available
as soon as an order is placed.
‹ Interpretability: An information that people can’t understand has
no value and could even be misleading. To avoid such misunderstandings,
data is followed by meta data which is supplementary information or
documentation that allows users to interpret the data properly.
‹ Coherence: It can be split into two concepts: consistency and
commonality. Consistency means using the same concepts, definitions
and methods over time. Commonality means using the same or
similar concepts, definitions and methods across different statistical
programs. If there is good consistency and good commonality, then
it is easier to compare results from different studies or track how
they stay the same or change over time.
‹ Accessibility: It is defined as how easy it is for people to find,
get, understand, and use data. When determining whether data
are accessible, make sure they are organized, available, accountable,
and interpretable.

4 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 4 10-Jan-25 3:51:26 PM


INTRODUCTION TO DATA SCIENCE

Notes
1.4 Data Analytics and Data Analysis
Data analytics and data analysis are two terms frequently used inter-
changeably, but they also have different meanings in context of working
with data and extracting useful insights. While both have relevance for
data-based decisions making, each has its own scope and purpose.
Data analytics refers to the whole process of examining datasets in or-
der to find trends, patterns, relationships, and other insights that might
help in the decision-making processes. There are various techniques and
processes through which data is analyzed and interpreted meaningfully.
Data analytics is usually applied to answer questions or solve problems or
predict outcomes. There are four main types of data analytics: descriptive,
diagnostic, predictive and prescriptive.
‹ Descriptive Analytics: This type of analytics focuses on summarizing
past data and describing what happened. It includes the use of
historical data to identify trends and patterns, often through statistical
measures like mean, median, and mode. Descriptive analytics answers
questions like, “What happened?” and provides insights into the
past performance of an entity or system, like how well a business
performed last year.
‹ Diagnostic Analytics: IT is a step ahead, which identifies the
causes of certain trends or patterns identified in the descriptive
analytics phase. It addresses the “Why did it happen?” by focusing
on deeper analysis to understand root causes of the data observed,
like determining the factors of drop in subscribers of an Instagram
account.
‹ Predictive Analytics: It has the ability to make predictions based
on historical data through statistical models, and the output could
be a machine learning algorithm. It answers the question, “What is
likely to happen?” Analyzing trends, patterns, and relationships for
future behaviours or outcomes. For example, predicting the sales
of a new trend for coming six months.
‹ Prescriptive Analytics: This type of analytics suggests possible
actions and outcomes based on the analysis. It combines insights
from all other types of analytics to answer “What should we

PAGE 5
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 5 10-Jan-25 3:51:26 PM


BUSINESS ANALYTICS

Notes do?” Prescriptive analytics recommends solutions for optimizing


performance, minimizing risks, or maximizing profits. For example
examining the past data on inventory levels, supplier performance,
and shipping paths using prescriptive analytics to suggest improved
supply chain management.
Table 1.1 shows the various analytics types, its focus and tools used.

Table 1.1: Classification of Analytics


Question Techniques/
Type Focus Answered Methods
Descriptive An- Historical Data What happened? Statistical analy-
alytics Analysis sis, reporting
Diagnostic Ana- Cause Analysis Why did it hap- Correlation, re-
lytics pen? gression
Predictive Ana- Future Forecast- What could hap- Machine learning,
lytics ing pen? forecasting
Prescriptive An- Actionable Rec- What should we Optimization, de-
alytics ommendations do? cision analysis
On the other hand, data analysis refers to inspecting, cleaning,
transforming, and modeling data to discover useful information,
draw conclusions, and support decision-making. It is mostly focused
on extracting insights and answering specific questions from data,
whether qualitative or quantitative. The process of data analysis
can be divided into a number of steps as follows:
‹ Data Collection: Data analysis begins from where one generates data
from different sources. These may include databases, spreadsheets,
surveys, or data sources from other locations. Quality and relevance
in the collected data will determine the accuracy of the analysis.
‹ Data Cleaning: Raw data, oftentimes, contains missing values,
duplicates, or inconsistencies. Data cleaning is thus the identification
and resolution of data inaccuracies to ascertain the reliability of the
dataset. This can include removing or correcting erroneous data,
completing missing values, or standardization.
‹ Data Exploration: After the data has been cleaned, the structure,
distribution, and relationship of the data are understood. Exploratory
Data Analysis (EDA) applies statistical graphics and summary

6 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 6 10-Jan-25 3:51:26 PM


INTRODUCTION TO DATA SCIENCE

statistics to make inferences based on patterns, correlations, and Notes


outlying observations.
‹ Data Transformation: The steps of transforming data into a format
suitable for analysis involve normalizing values, aggregating data,
or creating new features to enhance the effectiveness of the model.
‹ Data Modeling: Here, statistical, mathematical, or machine learning
models are applied to the data for answering specific research
questions. Common techniques used include regression analysis,
classification algorithms, clustering, and time-series analysis.
‹ Data Interpretation: This is the last stage of data analysis wherein
results are interpreted and conclusions drawn. The process involves
grasping the meaning of the findings as well as using the insights
to support decision-making or make recommendations.
Hence, Data analytics is a more comprehensive field, ranging from several
techniques, tools, to a set of processes used in analyzing data; while, data
analysis is a subset of data analytics and typically encompasses specific
tasks such as cleaning up, transforming, and interpreting. The key difference
lies in their purpose-data analytics strives to identify hidden insights and
patterns that can be used for business decisions and predictions, whereas
data analysis is essentially about extracting meaning from the data to an-
swer specific questions or solve problems. For example, in business: data
analytics helps optimize marketing strategy, customer experience, and sales
trend forecasts for businesses. Data analysis allows businesses to monitor
their performance, note inefficiencies, and base data-driven decisions.
Data analytics and data analysis both work to play crucial roles in any
decision-making process of the modern world in any industry. Data an-
alytics offers a holistic view to discover insights, whereas data analysis
will focus on extracting practical conclusions from data. Therefore, it
becomes essential to understand the difference and application of both
to help organizations and people use the power of data effectively.

1.5 Application of Analytics in Business


As discussed in previous section, data analytics is very important to re-
trieve meaning from data. Analytics is an essential tool that organizations
of all sizes and industries rely on to make informed decisions, enhance

PAGE 7
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 7 10-Jan-25 3:51:26 PM


BUSINESS ANALYTICS

Notes processes, and gain a competitive edge in today’s fast-paced and da-
ta-driven business world. Businesses can understand their past and present
by harnessing data from different forms of analytics as well as forecast
future trends, optimize operations, and create more personalized customer
experiences. Let’s see some examples that will help you understand the
underlying worth of analysis.
‹ One of the most powerful applications of analytics in business
is understanding customer behavior and preferences. Business
organizations can identify trends, predict needs, and personalize their
offerings to increase customer satisfaction and loyalty by analyzing
customer data. This personalized shopping experience boosts sales
and enhances customer engagement by making it easier for customers
to find what they need. Retailers, e-commerce platforms, and even
service industries like healthcare use analytics to segment customers,
predict their future needs, and tailor their marketing campaigns or
product recommendations accordingly.
‹ Analytics is also an important function for improving business
processes. From such studies, analyses of inventory, supply chain
logistics, and workforce performance can help tune business processes
to lower costs and optimize efficiency. The world’s largest retailer,
Walmart, makes use of predictive analytics to optimize inventory
management: it considers the seasonality of demand for particular
goods, thus ensuring that the right merchandise arrives in time at
the stores and is not overstored. It can be easily applied across
various sectors like manufacturing, logistics, and even hospitality,
where strong demand forecasting and resource optimization are key
to operational success.
‹ Companies need to make wise financial decisions to stay ahead.
Analytics can help in forecasting revenue, manage budgets, and assess
financial risks so that companies can have a well-supported decision-
making process regarding investments and expenses. For example,
predictive analytics will help a bank assess the creditworthiness
of someone applying for a loan. Through analyzing a consumer’s
financial history, spending patterns, or even social media activity,
a bank can forecast the probability of loan repayment and charge
the appropriate interest rate.

8 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 8 10-Jan-25 3:51:26 PM


INTRODUCTION TO DATA SCIENCE

‹ Analytics helps marketing teams better understand the performance Notes


of campaigns, customer segmentations, and spend optimization.
Businesses can refine marketing strategies by analyzing customer
data and campaign performance as they improve targeting resulting
in more sales. Netflix makes use of customers’ view data to create
original content that is highly specific in line with user preferences.
Analytics are also used to determine which genres, actors, or themes
will move an audience, thus increasing customer satisfaction and
retention. Businesses track the effectiveness of digital advertising
through analytics, optimize email marketing campaigns, and even
decide what products to feature or at what price to discount based
on what the consumer would do.
‹ Analytics can also apply to Human Resources (HR) by helping
improve employee performance, reduce turnover, and optimize the
hiring process. Through data analytics on employee job performance,
satisfaction surveys, and engagement metrics, HR teams will be
able to make decisions based on data, ultimately leading to positive
workplace culture.
‹ Effective supply chain management is crucial to maintaining smooth
operations and meeting customer demand. Analytics can help
businesses optimize their supply chains by forecasting demand,
managing supplier relationships, and minimizing disruptions.
‹ Analytics can be the most important tool for developing new products
or services. Analyzing what customers say, industry trends, and
conditions of the market can often guide companies into designing
a product that matches the wants of consumers and hopefully also
gains considerable space in the market.
Hence, analytics is no longer a luxury; it is key to successful business
and has to be done. This is because analytics gives businesses the tools
to make smarter, data-driven decisions and consequently improve their
efficiency, effectiveness toward customers, and profitability. Whether this
means optimizing operations, predicting market trends, or personalizing
customer experiences, analytics makes business sustainability possible in
the face of a continuously changing environment. The opportunities con-
tinue to keep piling on the analytics-simulated innovation and success in
businesses which continue to collect even more data from time to time.

PAGE 9
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 9 10-Jan-25 3:51:26 PM


BUSINESS ANALYTICS

Notes Whenever analytics is implemented across multiple business operation


activities, companies will still lead in advancing with continued delivery
of value to customers and stakeholders.
IN-TEXT QUESTIONS
1. Data analysis focuses primarily on __________, while data analytics
involves using advanced tools and techniques to __________.
2. Which of the following is NOT one of the four main classifications
of analytics?
(a) Descriptive analytics
(b) Diagnostic analytics
(c) Predictive analytics
(d) Intuitive analytics
3. A company uses predictive analytics to forecast future sales
trends based on historical data. This is an example of __________
analytics.
4. Which of the following is an example of ordinal data?
(a) Customer ID numbers
(b) Ranking of employees (1st, 2nd, 3rd)
(c) Employee names
(d) Age of employees
5. The temperature measured in degrees Celsius is an example of
__________ data.

1.6 Big Data and Its Characteristics


In this section we discuss big data as we know that the field of big
data analytics is very popular nowadays this is because we can see data
everywhere. Let’s understand the term Big Data with the simple day-
to-day example: Imagine you are in a busy shopping mall during a big
sale like black Friday sale. Thousands of people are shopping, and every
time someone buys something, data is generated—like what they bought,
when they bought it, and how much they spent. Now, think about how

10 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 10 10-Jan-25 3:51:27 PM


INTRODUCTION TO DATA SCIENCE

many malls, stores, and websites around the world are generating similar Notes
information every second.
This huge amount of data is termed as big data. Businesses use this
data to understand the preference of customers, predict trends, and even
recommend products to you (like “You might also like.”). For example,
when Netflix suggests shows based on what you have watched, it’s using
big data to make smart recommendations tailored just for you.
In the modern technological world the data is expanding far too quickly
and people frequently rely on it, additionally because of the rate at which
the data is expanding it has become increasingly difficult to store the
data on any server hence to handle this to process and to analyze this
huge amount of data the concept of Big Data came into picture. It is a
collection of data that is large in volume (obviously the data is generated
every day hence it grows exponentially with time), it becomes difficult
for us to store it and manage it hence the traditional methods which were
used to store manage and handle data were proven to be inefficient.
Hence we can say Big data refers to extremely large datasets that are too
complex, vast, or fast-moving to be processed, stored, or analyzed using
traditional data processing methods. It is the accumulation, management,
and analysis of huge amounts of structured, semi-structured, and unstruc-
tured data to expose patterns, trends, and insights.
Some real-world examples of big data could be Instagram every minute
many photos and videos are shared across the world. Twitter generates
billion tweets per year each tweet can contain textual data, or it can be
image video or audio data. Gmail or outlook can also be an example as
around billion emails are sent every day and most of them contain differ-
ent attachments like text, video, photo etc. Banks, e-commerce, weather
monitoring systems, CCTVs etc. all contribute to big data.
Let’s understand the characteristics of big data or we can say 5 V’s of
big data:
‹ Volume: It is the huge data amount generated which is major in
terms of Terabytes, petabytes, or even exabytes For example, the
likes, comments and post shared by billions of users of Facebook
every day.

PAGE 11
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 11 10-Jan-25 3:51:27 PM


BUSINESS ANALYTICS

Notes ‹ Velocity: The speed at which data is generated, processed, and


analyzed. Real-time processing is often necessary like stock market
systems process millions of transactions per second to give real-
time updates.
‹ Variety: It refers to different type of data it can be; structured data
is the data that is ready for modeling and analyzing, unstructured
data which is very much scattered data which cannot be straight
away used, semi-structured data which lies in between structured
and unstructured. For example, videos from YouTube, tweets etc.
have different structure.
‹ Veracity: It refers to the accuracy, quality, and trustworthiness of
data. Data can be messy, incomplete, or misleading, hence we need
filtering before processing.
‹ Value: The insights and benefits that organizations are able to derive
from the analysis of big data. E-commerce companies use data to
recommend products, increasing sales and customer satisfaction.
These are the characteristics that distinguish big data from traditional data
and emphasize its challenges and potential. Some of the key difference
between big data and traditional data is shown in Table 1.2.

Table 1.2: Traditional vs. Big Data


Parameter Traditional Data Big Data
Data Size Limited (gigabytes to Vast (terabytes to petabytes
terabytes) or more)
Data Type Primarily structured data Structured, semi-structured,
(rows, columns) and unstructured
Processing Speed Batch processing; slower Real-time or near-real-time
insights processing
Storage Relational databases Distributed systems (e.g.,
(e.g., SQL) Hadoop, NoSQL)
Complexity Manageable with tradi- Requires advanced tools and
tional tools and methods technologies
Technology Uses tools like RDBMS Uses tools like Hadoop,
(MySQL, Oracle) Spark, NoSQL
Data Sources Limited (e.g., business Multiple (e.g., social media,
transactions, logs) loT, sensors)

12 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 12 10-Jan-25 3:51:27 PM


INTRODUCTION TO DATA SCIENCE

Parameter Traditional Data Big Data Notes


Analysis Focus Historical data analysis Predictive, real-time, and
trend analysis
Scalability Limited scalability Highly scalable using dis-
tributed systems
Cost of Analysis Higher for large datasets Cost-efficient with modern
big data tools

1.7 Applications of Big Data


Big data can be found present in almost every field like banking, healthcare,
entertainment etc., in this section we will see some major applications
of big data along with example.
‹ Healthcare: Big data is crucial in health care because it allows
for predictive analytics, which enables them to identify potential
health risks and personalize treatments. Based on a patient’s medical
history, genetic data, and lifestyle habits, healthcare providers can
predict diseases and recommend measures to prevent them. IBM
Watson Health uses big data to assist doctors in diagnosing and
treating diseases like cancer. It processes massive datasets, including
medical journals, patient records, and clinical trials, to provide
evidence-based treatment options.
‹ Retail and E-commerce: Retailers leverage big data to analyze
customer behavior and preferences, enabling personalized shopping
experiences. This data-driven approach improves customer satisfaction
and boosts sales. For example, Amazon uses big data to recommend
products based on a user’s past purchases, browsing history, and
items in their cart. For instance, if you search for a smartphone,
Amazon may suggest accessories like cases or screen protectors.
‹ Finance and Banking: Big data analytics enables financial institutions
to identify fraud by monitoring transaction patterns and discovering
anomalies in real time. Such systems utilize machine learning for the
differentiation between normal and suspicious activities. PayPal uses
big data algorithms to monitor transactions for unusual activity. If a
user suddenly makes a purchase from an unfamiliar location or device,
the system flags it and alerts the user, preventing unauthorized access.

PAGE 13
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 13 10-Jan-25 3:51:27 PM


BUSINESS ANALYTICS

Notes ‹ Transportation: Transportation systems use big data to analyze


traffic patterns, optimize routes, and reduce congestion, improving
travel efficiency and safety. Google Maps collects data from GPS
devices, sensors, and user inputs to provide real-time traffic updates.
For example, if there’s an accident on a highway, Google Maps
suggests alternative routes to avoid delays.
‹ Media and Entertainment: Streaming platforms use big data to
analyze viewing habits and preferences, ensuring users receive
personalized content recommendations. This enhances user engagement
and retention. Netflix studies what shows you watch, pause, or skip
and uses this data to recommend similar content. If you binge-watch
a crime series, Netflix might suggest other crime or thriller genres
that align with your interests.
‹ Education: Educational platforms use big data to track student
performance, identify learning gaps, and customize study materials.
This approach makes learning more effective and engaging. Platforms
like Khan Academy analyze user progress to recommend practice
exercises or videos. For instance, if a student struggles with
algebra, the platform suggests targeted lessons to strengthen their
understanding.
‹ Manufacturing: Big data allows manufacturers to keep track of the
performance of the equipment and predict potential failure, which
decreases downtime and maintenance costs. Sensors embedded in
machines collect data; the data is then analyzed to point out patterns
that signal a problem. General Electric monitors its engines and turbines
through big data. If it senses overheating or abnormal vibration in a
machine, it schedules maintenance before it breaks down.
‹ Government: Governments utilize big data to improve urban planning
and optimize resources, creating smarter and more sustainable cities.
For example, Singapore’s Smart Nation initiative uses data from IoT
devices to manage traffic, monitor air quality, and improve public
transport systems. For instance, smart traffic lights adjust their
timings based on real-time vehicle flow data, reducing congestion.
‹ Energy and Utilities: Energy providers use big data to analyze
consumption patterns, forecast demand, and improve energy efficiency.
This helps in reducing costs and promoting sustainability. Smart

14 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 14 10-Jan-25 3:51:27 PM


INTRODUCTION TO DATA SCIENCE

meters in homes collect data on electricity usage and send it Notes


to utility companies. For example, if peak usage occurs during
evenings, companies can adjust power distribution to avoid outages
and encourage off-peak usage.
‹ Social media sites scan user-generated content such as posts,
comments, and tweets to understand public opinion and trends. This
is very important for businesses and policymakers. Twitter uses big
data for sentiment analysis during elections. It analyzes millions of
tweets to identify the public’s opinion about candidates, which is
useful for campaign strategies.
The examples are few scenarios that illustrate how the capability of big
data is revolutionizing industries by offering innovative solutions to
complex challenges. Therefore understanding and harnessing the power
of big data is need of the hour.

1.8 Challenges in Data Analytics


Although it seems very overwhelming and powerful, but data analytics
comes with challenges that the organization needs to overcome when
meaning from this power is expected. Below are the major issues that
are being faced.
‹ Getting poor-quality data like old data or data that might be
completely inconsistent. Such data is likely to produce incorrect
analysis and eventually wrong decisions. For example, If a company
uses incorrect customer details, it might send irrelevant marketing
messages, reducing customer trust.
‹ Integrating data from diverse sources, such as databases, IoT
devices, and social media, can be complex due to varying formats
and structures. This incompatibility between systems can result in
delays and incomplete analysis.
‹ The sheer size of big data makes storage, processing, and analysis
resource intensive. Handling such petabytes of data requires significant
computational power, which can strain infrastructure.
‹ Processing data in real-time for immediate insights is difficult,
especially with high-velocity data streams but delayed insights can

PAGE 15
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 15 10-Jan-25 3:51:27 PM


BUSINESS ANALYTICS

Notes lead to missed opportunities in time-sensitive scenarios like fraud


detection. Hence processing in real time is an issue.
‹ Protecting sensitive data from breaches and ensuring compliance
with data protection regulations (e.g., GDPR, HIPAA) is critical.
Any data leaks can damage reputations and result in legal penalties.
‹ There is a shortage of qualified data scientists and analysts who
can work with advanced tools and techniques. Organizations may
struggle to derive insights from complex datasets, slowing decision-
making processes. Techniques like machine learning and predictive
analytics require advanced knowledge and computational resources.
Smaller organizations might find it challenging to adopt these
methods effectively.
The above challenges indicate that we should not forget to surpass the
hurdles to reveal all the vast potential data analytics has for any orga-
nization and to enjoy all its benefits. Investments in technology, skilled
professionals, and ethical practices will allow organizations to change
obstacles into opportunities for growth.
IN-TEXT QUESTIONS
6. Which of the following is NOT one of the 5 V’s that define
the characteristics of Big Data?
(a) Volume
(b) Velocity
(c) Veracity
(d) Vulnerability
7. Big Data is used in healthcare to analyze patient data and predict
disease outbreaks. This application is an example of how Big
Data is used in __________.
8. The velocity characteristic of Big Data refers to __________.

1.9 Summary
This lesson introduced the basic concepts of data and its importance in
the field of data science and analytics. It distinguished between data
analysis and data analytics, with the insight into the types of analytics:

16 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 16 10-Jan-25 3:51:27 PM


INTRODUCTION TO DATA SCIENCE

descriptive, diagnostic, predictive, and prescriptive. The lesson highlighted Notes


the importance of data types: nominal, ordinal, and scale, as well as the
characteristics of big data using the 5Vs: Volume, Velocity, Variety, Ve-
racity, and Value. The real-world applications of big data across various
industries were discussed, along with some of the current challenges in
data analytics, such as data integration, privacy, and the need for highly
competent professionals.

1.10 Answers to In-Text Questions


1. Examining and interpreting data, Discover patterns and insights
2. (d) Intuitive analytics
3. Predictive
4. (b) Ranking of employees (1st, 2nd, 3rd)
5. Scale
6. (d) Vulnerability
7. Healthcare
8. The speed at which data is generated and processed

1.11 Self-Assessment Questions


1. Define the following terms: data, data science, and data analytics.
2. Explain the classification of analytics with examples.
3. Differentiate between nominal, ordinal, and scale data with real-world
examples.
4. What are the characteristics of big data?
5. Discuss three applications of big data in business.
6. List and explain any four challenges in data analytics.

1.12 References
‹ Provost, F., & Fawcett, T. (2013). Data Science for Business.
O’Reilly Media.

PAGE 17
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 17 10-Jan-25 3:51:27 PM


BUSINESS ANALYTICS

Notes ‹ Sharda, R., Delen, D., & Turban, E. (2020). Analytics, Data Science,
and Artificial Intelligence: Systems for Decision Support. Pearson.

1.13 Suggested Readings


‹ Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive
analytics, and big data: A revolution that will transform supply chain
design and management. Journal of Business Logistics, 34(2), 77–84.
‹ Davenport, T. H., & Harris, J. G. (2017). Competing on Analytics:
The New Science of Winning. Harvard Business Review Press.

18 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 18 10-Jan-25 3:51:27 PM


L E S S O N

2
Data Preparation,
Summarisation and
Visualisation Using
Spreadsheet
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
2.1 Learning Objectives
2.2 Data Preparation
2.3 Data Cleaning
2.4 Data Summarization
2.5 Data Sorting
2.6 Filtering Data
2.7 Conditional Formatting
2.8 Text to Column
2.9 Find and Remove Duplicates
2.10 Removing Duplicate Values
2.11 Data Validation
2.12 Identifying Outliers in Data
2.13 Covariance
2.14 Correlation Matrix
2.15 Moving Average
2.16 Finding Missing Values
2.17 Data Summarization
2.18 Data Visualization
PAGE 19
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 19 10-Jan-25 3:51:28 PM


BUSINESS ANALYTICS

Notes 2.19 Types of Data Visualizations in Excel


2.20 Pivot Tables
2.21 Pivot Chart
2.22 Interactive Dashboard
2.23 Summary
2.24 Answers to In-Text Questions
2.25 Self-Assessment Questions
2.26 References
2.27 Suggested Readings

2.1 Learning Objectives


After reading this chapter students will be able to:
‹ Data Preparation, Cleaning, Summarization, Sorting, Filtering,
Validation and Visualization.
‹ Find and remove duplicates.
‹ Calculate covariance and moving average.
‹ Conditional Formatting, Co-relational Analysis.
‹ Create Pivot Charts, Pivot Tables and Interactive Dashboard.

2.2 Data Preparation


Data preparation is one of the major processes in the pipeline of data
analysis and machine learning. This involves tasks such as data cleaning,
transforming, and arranging raw data in a format that enables effective
analysis or model training. Thus, the aim of data preparation is to ensure
good quality and consistency of data for specific tasks.
Some important steps in data preparation include:
Data Collection: This involves collecting raw data from a large number
of sources, such as databases, spreadsheets, APIs, or even sensors.
Data Cleaning: It involves detecting and correcting errors, inconsistencies,
and missing values in a dataset. Treatment of outliers, duplicate entries,
or irrelevant data points is essential in this stage.

20 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 20 10-Jan-25 3:51:28 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Data Transformation: This refers to the conversion of data into a format Notes
or structure fit for analysis. Arguably, it will involve the normalization or
scaling of numeric values, encoding categorical variables, and aggregation
or disaggregation of data.
Data Integration: This is the integration of data from different sources
into one dataset. It may involve table merging, dataset joining, or another
kind of data conflict resolution.
Data Reduction: This is a process for reducing either the size or the
complexity of the dataset, and it involves feature selection, dimensionality
reduction, and sampling, among others.
Data Formatting: Consistency in format, including standardized date
formats and variable naming conventions.
Data Splitting: Basically, it is the division of data into subsets, usually
training, validation, and test sets. These sets help a model builder to build
models with the data, tune their hyperparameters, and finally estimate
their performance.
Good data preparation is important in order for one to generate valid
and accurate insights; otherwise, if the data quality is low, meaningful
conclusions will not be obtained.

2.3 Data Cleaning


Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying, correcting, or removing inaccurate, incomplete,
or irrelevant data from a dataset. This step is crucial to ensure that the
data is of high quality, which is essential for accurate analysis, reporting,
and decision-making.
PAGE 21
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 21 10-Jan-25 3:51:28 PM


BUSINESS ANALYTICS

Notes Key Steps in Data Cleaning:


Handling Missing Data: Detect missing values in the dataset, which
can appear as blanks, NA, null, or other placeholders. This data can be
handled by using the following strategies.
Removal: Delete rows or columns with missing values if they are not
critical.
Imputation: Replace missing values with estimates, such as the mean,
median, mode, or by using more advanced methods like predictive modeling.
Placeholder: Leave the missing values as they are, but flag them for
attention in future analysis.
Removing Duplicate Data: For this identify duplicate records that may
occur due to repeated data entry or merging datasets and then, delete
duplicate values to prevent skewed results in analysis.
Correcting Data Errors: To correct commonly occurring data errors,
identify and rectify the following issues:
Inconsistent Data: Fix inconsistencies in formatting (e.g., date for-
mats, text case) or values (e.g., “NY” vs. “New York”).
Data Entry Errors: Identify and correct typographical errors or mis-
entered data, such as incorrect numerical values or misspelled words.
Standardizing Data: Data is standardized by normalization or
transformation.
Normalization: Ensure that data is consistent in format, especially for
categorical data (e.g., “Male” vs. “M” or “1/1/2024” vs. “01-Jan-2024”).
Transformation: Convert data into a common scale or unit, such as
converting all weights to kilograms or all prices to a single currency.
Outlier Detection and Treatment: Detect outliers that fall outside the
expected range of values. These could be due to errors or may require
special attention. Then, decide whether to remove, correct, or leave out-
liers in the dataset. Sometimes outliers are valid and should be kept, but
in other cases, they may need correction or exclusion.
Validating Data Accuracy: To validate accuracy of the data, check the
data against reliable sources or business rules to ensure accuracy. And,

22 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 22 10-Jan-25 3:51:28 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

ensure that the data is logically consistent, such as ensuring all transac- Notes
tions have corresponding dates.
Removing Irrelevant Data: This can be done by filtering data. That is
by removing data that is not relevant to the analysis or that does not
contribute useful information. This can include unnecessary columns,
out-dated records, or noise in the data.
Formatting and Structuring Data: This is done by ensuring that data
is in the correct format, such as consistent date formats or proper text
casing. Also, re-structure the data to meet the needs of the analysis, such
as pivoting tables or separating combined fields into distinct columns.
IN-TEXT QUESTIONS
1. What is the primary goal of data cleaning in a spreadsheet?
(a) To improve the appearance of the spreadsheet
(b) To remove inconsistencies and errors in the data
(c) To format data for printing
(d) To reduce the size of the spreadsheet
2. In data cleaning, what does “imputation” refer to?
(a) Removing unnecessary columns
(b) Filling in missing data with estimated values
(c) Filtering out irrelevant data
(d) Detecting outliers

2.4 Data Summarization


Data summarization is the process of transforming a given large dataset
into a smaller form, usually presentable, for reporting, analysis, and
further examination. It involves extracting central insights and patterns
from data without losing vital information. It allows quick realization of
an overview of the structure and general features of the dataset, hence
facilitates further analysis and inference.

PAGE 23
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 23 10-Jan-25 3:51:28 PM


BUSINESS ANALYTICS

Notes Key Techniques in Data Summarization:


Data summarization can be further divided into different categories as
given below:
Descriptive Statistics: Measures of Central Tendency: Summarize data using
mean, median, and mode, which describe the middle of the distribution.
Measures of Dispersion: Describe dispersion or variability in data: range,
variance, and standard deviation.
Percentiles and Quartiles: The former provide insight into the distribu-
tion by telling about the relative standing of the data points.
Data Aggregation: This would involve combining many data points into
summary values. For example, the addition of the sales data by month
or the average score across different categories.
Data Grouping: The grouping of data into categories or segments and
summarizing each group in isolation. This can be done using techniques
like pivot tables which summarize data based on different dimensions.
Visualization: Charts and Graphs Setting up trends and distributions of
data with bar charts, histograms, pie charts, and line graphs. Box Plots
can be used to visualize distribution, central value, variability of the data,
and possible outliers.
Dimensionality Reduction: The techniques, like PCA or t-SNE, reduce
the number of variables in a dataset by keeping as much as possible of
the variability of the data while summarizing it into lower dimensions.
Summarization: Those are methods for summarizing large documents or
data sets, such as rapid keyword extraction, topic modeling, or abstract
generation.
Data Profiling: It contains information about the structure of a dataset,
the count of missing values, or the data types, or on the distribution of
categorical variables.

Why Summarize Data?


Simplifies Analysis: Summarization of data makes analysis and interpretation
of large datasets easy. It also helps to quickly identify patterns and trends.

24 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 24 10-Jan-25 3:51:29 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Facilitates Decision-Making: These summaries represent data in such Notes


a way that they become helpful in supporting stakeholders for making
certain decisions.
Improves Reporting: Since summaries of data are used in most reports,
dashboards, and presentations for effective communication. In other words,
data summarization reduces complex datasets into their major parts so that
understanding and action by an analyst, stakeholder, and decision-maker
is enabled with such information.

2.5 Data Sorting


Sorting helps users to organize data in a specific order. You can sort a
text column in alphabetical order (A-Z or Z-A). We can sort a numerical
column from largest to smallest or smallest to largest. We can also sort a
date and time column from oldest to newest or newest to oldest. In this
section, we will see how data can be sorted in MS Excel.
Example: Sorting a Column in descending order.
Step 1: Select the data and use the shortcut key Ctrl + Shift + L.
Step 2: Click on the down arrow on the column. Select Largest to Smallest
or Z to A.

2.6 Filtering Data


Filters are used to temporarily hide some of the data in a table. This helps
users to focus on the data that is important for the current task at hand.
Example: Filter a range of data
Step 1: Select the column on which to
apply filter.
Step 2: Select Data > Filter.
Step 3: Select the column header arrow
.
Step 4: In case of text data, uncheck the values that you want to see
(as shown in figure).

PAGE 25
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 25 10-Jan-25 3:51:29 PM


BUSINESS ANALYTICS

Notes

For filtering data on numeric values, you can even select a comparison,
like Between to see only the values that lie in a given range.

Step 5: Click on OK.


26 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 26 10-Jan-25 3:51:31 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Notes
2.7 Conditional Formatting
Conditional Formatting allows users to
fill cells with certain color depending
on the condition. This enhances data
visualization and its interpretation. It also
helps in identifying patterns in data. Let
us see how conditional formatting can
be done in MS Excel.
Example: Highlight cells that have a value greater than 350.
Step 1: Select the range of cells
on which conditional formatting
has to be applied.
Step 2: On the Home tab, under
Styles Group, click Conditional
Formatting.
Step 3: Click Highlight Cells Rules
> Greater Than....
Step 4: Enter the desired value and
select the formatting style.

Step 5: Click OK

PAGE 27
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 27 10-Jan-25 3:51:32 PM


BUSINESS ANALYTICS

Notes
2.8 Text to Column
Text to column feature is used to separate a single column data into mul-
tiple columns. This enhances readability of the data. For example, if a
column contains first name, last name and profession in a single column,
then this information can be separated in different columns. This allows
columns to have atomic values. Note that this separation is possible only
if multiple values are separated by the same delimiter in the cell. These
delimiters can be Comma, Semicolon, Space, or other characters. Let us
see how we can split data in MS Excel.
Step 1: Select the cell or column that contains the text to be split.
Step 2: Select Data > Text to Columns.
Step 3: In the Convert Text to Columns Wizard displayed on the screen,
select Delimited > Next.
Step 4: Select the Delimiters for your data.
Step 5: Select Next.
Step 6: Preview the split and select Finish.

2.9 Find and Remove Duplicates


Duplicate data is sometimes useful, but it often just makes the data harder
to understand. Finding, highlighting, and reviewing the duplicates before
removal is better than removing all the duplicates straightway.

2.10 Removing Duplicate Values


Select the range of cells containing duplicate values that should be re-
moved. To do this in MS Excel,
Step 1: Select the data from which duplicate values have to be removed.
Step 2: Select Data > Remove Duplicates.
Step 3: Uncheck the columns to be purged to remove duplicate records.
Step 4: Click OK.

28 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 28 10-Jan-25 3:51:32 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Notes

IN-TEXT QUESTIONS
3. What does Conditional Formatting allow you to do in a spreadsheet?
(a) Apply formulas automatically
(b) Highlight cells based on certain criteria
(c) Change data values based on formatting
(d) Sort data based on custom rules
4. To highlight only duplicate values in a range of data using
Conditional Formatting, which rule would you apply?
(a) Text that contains
(b) Top/Bottom Rules
(c) Highlight Cell Rules > Duplicate Values
(d) New Rule > Use a Formula

2.11 Data Validation


Excel is a powerful tool for data analysis, reporting, and decision-making.
But, the reliability of these activities depends on the accuracy and integrity

PAGE 29
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 29 10-Jan-25 3:51:33 PM


BUSINESS ANALYTICS

Notes of the data. Data validation helps users control the input to ensure ac-
curacy and consistency.
While validating data, specific criteria for accepting data in cell(s) are
set. This restricts users from entering invalid data. Thus, validating data,
not only enhances accuracy, reliability and integrity of data but it also
cuts time in manual checking and correcting data entries. In Excel, this
can be done using the steps given below:
Step 1: Select the Cells for Data Validation
Step 2: In the Data Tab, click on Data Validation to open the Data Val-
idation Dialog Box
Step 3: In the Data Validation dialog box, under the Settings tab, define
the validation criteria:
Allow: Select the type of data. This data can be Whole Number, Decimal,
List (only values from a predefined list are allowed), Date, Time, Text Length
(only text of a certain length is allowed). The last option is Custom which
is used for more complex criteria and can be specified using a formula.
Data: Specify the condition (e.g., between, not between, equal to, not
equal to, etc.).
Minimum/Maximum: Enter the acceptable range or limits based on the
above selection. For example, to allow values between 100 and 1000,
select “Whole Number,” “between,” and then set the minimum to 100
and the maximum to 1000.
You can even configure (optional) an Input Message that will appear when
the cell is selected. For this, click on InputMessage Tab in the dialog
box. Give a brief title for the input message box and enter the guidance
text that will appear when someone selects the cell. The guidance text
will instruct user on what type of data to enter.
Another optional feature in MS Excel is that you can customize the Error
Alert. To do this, under the Error Alert tab, specify what would happen
if user enters invalid data:
Show Error Alert after Invalid Data is entered: Check this to enable
error alerts.
Style: Choose from Stop, Warning, or Information to indicate the severity
of the alert.

30 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 30 10-Jan-25 3:51:33 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Title: Enter a title for the error message box. Notes


Error Message: Type the message to be displayed. It must explain the
error and suggest ways to correct it.

2.12 Identifying Outliers in Data


When analysing, visualizing and interpreting data, outliers if present in
the data impacts accuracy, reliability and usability of the data. Therefore,
it is very important to identify and minimize outliers in order to avoid
potential discrepancies they might cause.
Basically, an outlier is a data point or a set of values that are significantly
different from the average or expected range. Presence of outliers can
give detrimental results while forecasting certain crucial values. Thus, to
ensure the accuracy of the data reports, we need to identify the outliers,
calculate their impact and minimize them. To handle outliers in MS Excel,
follow the steps given below:
Review the Data: Errors can creep in data while entering or transferring
data. So, review the data to ensure there are no typos or other errors that
create inaccuracies. This can be done manually or by using automated
tools.
Sort the Data Values: We have already seen how data can be sorted in
MS Excel.

PAGE 31
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 31 10-Jan-25 3:51:33 PM


BUSINESS ANALYTICS

Notes Analyze Data Values: After sorting the values, identify large data dis-
crepancies and outliers to eliminate them. Such values can be straight-
away deleted. But, a better option is to remove only statistical anomalies.
Identify Data Quartiles: To calculate the outliers in the data, calculate
quartiles using Excel’s automated quartile formula beginning with “=
QUARTILE ()” in an empty cell. After the left parenthesis, specify the
first and last cells in your data range separated by a colon and followed
by a comma and the quartile you want to define. For example, formu-
la like “= QUARTILE (A5:A50, 1)” or “= QUARTILE (B2:B200, 3).”
Will find values from A1 cell to A50 cells that belong to quartile 1 (the
25th percentile, or the value below which 25% of data points fall when
arranged in increasing order).
Define the Interquartile Range (IQR): IQR represents the expected
average range of the data (without outlier values). It is calculated by
subtracting the first quartile from the third quartile.
Calculate the Upper and Lower Bounds: Defining the upper and lower
bounds of data allows identification of values that are higher than expected
value (upper bound) and smaller than the lower bound.
Calculate the upper bound of data by multiplying IQR by 1.5 and adding it
to the third quartile. The formula can be given as, “= Q3 + (1.5 * IQR).”
Similarly, to find the lower bound of data, multiply the IQR by 1.5 and
subtract it to from your first quartile value. The formula can be given
as, “= Q1 + (1.5 * IQR).”
Remove the Outliers: After defining the upper and lower bounds of data,
review the data to identify values that are higher than the upper bound
or lower than the lower bound. These values are statistical outliers. So,
delete them for more accurate analysis or visualization reports.

2.13 Covariance
Covariance is a statistical function that calculates the joint variability of
two random variables, given two sets of data. To calculate covariance in
Excel, use the covariance.p functions. The syntax is = COVARIANCE.P
(array1, array2), where
Array1 is a range or array of integer values.

32 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 32 10-Jan-25 3:51:34 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Array2 is a second range or array of integer values. Notes


Note the following points:
‹ If the given arrays contain text or logical values, they are ignored
by the COVARIANCE in Excel function.
‹ The data should contain numbers, names, arrays, or references that are
numeric. If some cells do not contain numeric data, they are ignored.
‹ The data sets should be of the same size, with the same number
of data points.
‹ The data sets should be neither empty nor should the standard
deviation of their values be zero.
To find covariance in Excel and determine if there is any relation between
the two columns C and D, we can write =COVARIANCE.P(C1:C10,D1:D10).
Mathematically, covariance is calculated as:
¦ ( x  x )( y  y )
COV ( X , Y )
n

2.14 Correlation Matrix


A correlation matrix is a table that displays the correlation coefficients
for different variables. The matrix depicts the correlation between all the
possible pairs of values in a table. Such a table is very useful to summarize
a large dataset and to identify and visualize patterns in the given data.
A correlation matrix consists of rows and columns that show the correlation
coefficient between the variables. The correlation matrix is helpful in the
analysis of multiple linear regression models where several independent
variables are present.
Correlation is a statistical measure that describes the extent to which
two or more variables are related to each other. It indicates the strength
and direction of a relationship between variables. When two variables
are correlated, a change in one variable is associated with changes in
another—either positively or negatively.
‹ Positive Correlation: When values of two variables increase or
decrease together, they are said to be positively correlated. For

PAGE 33
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 33 10-Jan-25 3:51:35 PM


BUSINESS ANALYTICS

Notes example, height and weight are positively correlated; as height


increases, weight tends to increase as well.
‹ Negative Correlation: When two values are negatively correlated, an
increase in one variable results in decline of the other. For example,
speed and time are negatively correlated. When speed increases it
takes less time to reach the destination.
Figure shows this concept graphically.

In Excel, the CORREL function returns the Pearson correlation coefficient


for two sets of values. Its syntax is CORREL (array1, array2)
Where, Array1 is the first range of values and Array2 is the second range
of values. However, the two arrays should have equal length.
Assuming we have a set of independent variables (x) in B2:B13 and
dependent variables (y) in C2:C13, our correlation coefficient formula
goes as follows:
=CORREL(B2:B13, C2:C13)
However, remember that:
‹ If cells in an array contain text, logical values or blanks, then they
are ignored.
‹ If the arrays are of different lengths, N/A error is returned.

34 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 34 10-Jan-25 3:51:36 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

‹ If either of the arrays is empty or if the standard deviation of their Notes


values equals zero, then divide by zero occurs.
The PEARSON function in Excel does the same thing. It calculates
the Pearson Product Moment Correlation coefficient. The syntax of this
function is PEARSON(array1, array2) where, Array1 is a range of inde-
pendent values and Array2 is a range of dependent values. Continuing
with the same data, we can write = PEARSON(B2:B13, C2:C13) as
shown in the figure.

Interpreting Correlation Analysis Results


In the correlation matrix, the coefficients are shown at the intersection
of rows and columns. If the column and row coordinates are the same, it
has value 1. The negative coefficient shows a strong inverse correlation
between the dependent and the independent variable. Correspondingly,
positive coefficient value indicates a strong direct connection between
the variables.

2.15 Moving Average


Moving average also known as rolling average, run-
ning average or moving mean is defined as a series
of averages for different subsets of the same data set.
This measure is frequently used in statistics, season-
ally-adjusted economic and weather forecasting to get
insights into underlying trends. For example, in stock
trading, moving average gives the average value of
a security over a given period of time. Similarly, in

PAGE 35
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 35 10-Jan-25 3:51:39 PM


BUSINESS ANALYTICS

Notes business, moving average of sales for the last 3 months is calculated to
understand the market trends. To forecast weather, the moving average
of three-month temperatures is calculated.
We can compute different types of moving average - simple (or arithme-
tic), exponential, variable, triangular, and weighted. But in this section,
let us see how to calculate simple moving average. In Excel, simple
moving average is calculated by using formulas and trendline options.
A simple moving average can be calculated using the AVERAGE func-
tion. Given a list of average monthly temperatures in column B, moving
average for first 3 months can be calculated as = AVERAGE(B2:B4) or
=SUM(B2:B4)/3. To find subsequent averages, the formula can be copied
in other rows.

To visualize the moving average on a chart by drawing a trendline follow


the steps given below:
Step 1: Click anywhere in the chart.
Step 2: On the Layout tab, in the Analysis group, select the trendline
option.
Step 3: Click the desired option.

36 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 36 10-Jan-25 3:51:39 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Notes

IN-TEXT QUESTIONS
5. What is the main purpose of data validation in spreadsheets?
(a) To perform mathematical calculations on data
(b) To ensure that data entered meets specific criteria
(c) To visualize data using charts
(d) To automatically sort data
6. What is an outlier in a dataset?
(a) A value that is similar to other values
(b) A value that falls within the Interquartile Range (IQR)
(c) A value significantly different from other values in the
dataset
(d) A missing or blank value
7. Which statistical method can be used to detect outliers using
quartiles?
(a) Standard deviation
(b) Z-score
(c) Interquartile Range (IQR)
(d) Median

PAGE 37
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 37 10-Jan-25 3:51:40 PM


BUSINESS ANALYTICS

Notes
2.16 Finding Missing Values
Excel does not have any particular function to list missing values. But
it is important because of the following reasons:
Data Integrity which ensures that the dataset is complete.
Data Reconciliation that facilitates the reconciliation process (mostly
used in finance).
Quality Assurance to identify anomalies or data entry errors.
Efficient Analysis to perform accurate data analysis by spotting and
addressing gaps.
List missing Values in Excel
To identify and list missing values in Excel, you can use the following
functions:
IF, ISNUMBER and MATCH Functions:
‹ IF: Returns one value if a condition is true and another if it’s false.
‹ ISNUMBER: Checks if a value is a number.
‹ MATCH: Searches for a value in a range and returns its relative
position.
Example: If a column A has a list of values in the range 1 to 100, then
missing values in this data can be identified by using the formula
= IF(ISNUMBER(MATCH(ROW(A1), A:A, 0)), “”, ROW(A1))
Note that the syntax of the MATCH
function is,
MATCH(lookup_value, lookup_ar-
ray, [match_type])
Where,
lookup_value is the value to be
matched in the lookup_array.
lookup_array is the range of cells
being searched.
match_type is optional. It can have

38 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 38 10-Jan-25 3:51:40 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

value -1, 0, or 1. The default value is 1. The argument specifies how Notes
Excel matches lookup_value with values in lookup_array.
Now, drag and apply the formula from B1 to B100. This will result in
column B displaying the missing values in the list.
Missing values can also be identified using the Filter feature on column
B to display only the missing numbers by excluding blank cells.

2.17 Data Summarization


Data summarization in Excel can be done in multiple ways like:
Using Descriptive Statistics: For example, given a list of values in col-
umn A, we can use Excel functions to summarize the values.
SUM, AVERAGE, MEDIAN: Calculate the total, mean, and median of
a dataset.
Example: = SUM (A2:A100) sums all values in the range A2 to A100.
Example: = AVERAGE (A2:A100) calculates
the average.
COUNT, COUNTA: Count the number of cells
with numbers (COUNT) or any data (COUNTA).
Example: =COUNT (A2:A100) counts numeric
entries.
STDEV.P, VAR.P: Calculate the standard devi-
ation and variance of a dataset.
Example: =STDEV.P (A2:A100) for standard
deviation.
MIN, MAX: Find the smallest and largest values.
Example: =MIN (A2:A100) and = MAX (A2:A100).

2.18 Data Visualization


Data visualization helps users to transform raw data into meaningful
visual stories that enables them to spot trends in data and communicate
complex information effectively.

PAGE 39
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 39 10-Jan-25 3:51:41 PM


BUSINESS ANALYTICS

Notes Microsoft Excel provides different types of charts to visualize data in the
spreadsheet. To draw a chart, you need to follow the steps given below:
Step 1: Organize the data in rows and columns within the Excel sheet.
Every row and column should be labelled clearly to identify the data to
be visualized.
Step 2: Select the data by clicking and dragging mouse to highlight
the data to be visualized. In this selection, include the row and column
headers (as shown in the figure).

Step 3: Choose a chart type by clicking on the “Insert” tab. In the “Charts”
section, select the required chart option (Column, Line, Pie, Bar, Area,
Scatter, etc.) by clicking on the dropdown arrow below the chart type.

40 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 40 10-Jan-25 3:51:41 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Step 4: Insert the chart. Once the desired chart is selected, it is auto- Notes
matically created and inserted in the worksheet. Now, it can be clicked
and dragged to change its position or resized by using the sizing handles
at the corners.
Step 5: Customize the chart. For this, click on the chart to select it. Now,
you would be able to see two additional tabs: “Design” and “Format”.
Use these tabs to customize the chart’s appearance, style and layout. Im-
portant information like chart title, axis labels, legend, data labels, etc.
can be added to enhance visualization and data interpretation.

Step 6: Edit the data (optional). In case you wish to make changes to
the data, simply edit it in the worksheet. Excel will automatically update
the chart to reflect the changes.

2.19 Types of Data Visualizations in Excel


Excel offers a variety of charts to
visualize data. Some commonly
used charts for visualizing data
are:
Column Chart: It displays data
using vertical bars. Each bar
represents a category. A column
chart is preferred when certain

PAGE 41
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 41 10-Jan-25 3:51:42 PM


BUSINESS ANALYTICS

Notes values have to be compared across categories or to visualize trends over


time.
Bar Chart: It is similar to a column chart, but instead of vertical bars, it
has horizontal bars. They are usually used to compare values across cat-
egories when the category names are long or there are several categories.

Line Chart: The line chart plots data points and then connects these
points by lines. These lines show trends or change in values over time.
Line charts are widely used for continuous data like stock prices or
temperature measurements.

Pie Chart: A pie chart plots data as slices of a circle. Size of each slice
is proportional to the value it represents. That is, it represents the pro-
portion of each category within a whole.

42 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 42 10-Jan-25 3:51:42 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Notes

Scatter Plot: A scatter plot displays data points on a Cartesian coordinate


system, with each axis representing a variable. These charts depict the
relationship between two variables and identify patterns or correlations.

Thus, these charts in Excel help users to understand the composition,


distribution and overlapping of data. After effectively visualizing data,
users can formulate meaningful and engaging stories to decision makers.

2.20 Pivot Tables


Pivot tables are an important part of MS Excel that allows users to
quickly summarize large amounts of data, analyze numerical data in de-
tail, and answer unanticipated questions about the data. Such a table is
specifically designed to query data in user-friendly and interactive way.
For example, consider the dataset given in figure. There are 6 columns
and 213 rows.

PAGE 43
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 43 10-Jan-25 3:51:42 PM


BUSINESS ANALYTICS

Notes To insert a pivot table, follow the steps given below:


Step 1: Click any single cell inside the data set.
Step 2: Click on the Insert tab, in the Tables group.
Step 3: Click on Piv-
otTable.
Step 4: From the dia-
log box that appears,
Excel automatically
selects the data and
the default location set
for a new pivot table
is New Worksheet.
Step 5: Click on OK.
Step 6: Now, drag the
fields. For example,
to get the total amount exported of each product, drag the Product field
to the Rows area, Amount field to the Values area and the Country field
to the Filters area.

2.21 Pivot Chart


Pivot Chart is a dynamic visualization tool that helps users summarize
and analyze large datasets. Trends and patterns can be easily identified
by pivot charts.
Pivot charts are used to present complex data in a clear, concise, inter-
active and flexible manner. For example, with a pivot chart, users can
easily visualize how sales vary across different geographical regions,
product categories, or specific time periods.
To insert a pivot chart using data from the pivot table, follow the steps
given below:
Step 1: Click any cell inside the pivot table.
Step 2: On the PivotTable Analyze tab, click on PivotChart in the Tools group.

44 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 44 10-Jan-25 3:51:43 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Step 3: Click OK on the Insert Chart dialog box. Notes


The pivot chart will appear on the screen. Any changes made in the pivot
chart are immediately reflected in the pivot table and vice versa.

2.22 Interactive Dashboard


In Microsoft Excel, an interactive dashboard is usually a one-pager report
that allows business users to track and measure crucial business KPIs
and metrics under one roof. It combines charts, figures, and tables to
help users visualize complicated data in an easy-to-understand format. An
interactive dashboard can be created by following the steps given below:
Step 1: Define the Purpose of the Dashboard. For this, you must be clear
with answers for two questions - why is the dashboard being created and
for whom it is created. Different stakeholders or departments within the
organization want to analyze different facts and figures. For example, the
Chief Financial Officer (CFO) focuses on key financial metrics, while the
investor is interested in a summarized dashboard of all the departments.
So, it is important to understand the purpose of the dashboard and then
collect data around it for accurate and effective decision-making.
Step 2: Gather Data in the form of a table and then convert this table
into a pivot table. This is done by:
(i) Selecting the table.
(ii) In the Insert Tab, click on Pivot Table.
(iii) Click on OK and the Pivot Table will be inserted in a new sheet.

PAGE 45
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 45 10-Jan-25 3:51:43 PM


BUSINESS ANALYTICS

Notes If you want 3 pivot charts on the interactive dashboard then you must
have 3 pivot tables. So, you can simply duplicate the pivot table sheet
in the Excel workbook.
Step 3: Create Charts using the Pivot Table. For example:
The first chart would represent every product’s monthly sales. For this
chart, we need 3 data entries - Sales, Product, and Month. In the Pivot
table sheet, drag and drop the Month data in the rows area, product in
the columns area, and Sales in the values area.

Step 4: In the PivotTable Analyze group, click on PivotChart and select


a suitable chart from the chart drop-down.
Step 5: Click on Ok. The pivot chart will be created.
Once the chart is created, you can style it using formatting options. Just
click on the + sign that besides the chart and then, tick the following items.

Chart Title to change the title of the chart.


Legend to enable, disable, or edit the legend.
Axes to edit horizontal axis and vertical axis of the chart.

46 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 46 10-Jan-25 3:51:43 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Data Table to insert a table representing all values in the data table. Notes

After formatting the chart, it can be moved to the Interactive dashboard


sheet. Repeat the same steps to create other pivot charts and place them
on the interactive dashboard.
Step 6: Add Interactive Features to the dashboard design. For this, select
any chart and click on PivotChart Analyze.
Step 7: Click on Insert Timeline. How-
ever, to insert a timeline to any pivot
chart, there must be a Date column
in the data. Make sure that the Date
checkbox is ticked before you press OK.
Step 8: Like timeline, slicer can also be added on the interactive dash-
board. A slicer is just a fancy name for a filter. To add a slicer, perform
the following steps:
(i) Click on one of the pivot charts to activate it.
(ii) In the Insert tab, click on the Slicer option.
(iii) From the list of all the variables, perform slice and dice operations.
But before performing these operations, you need to connect the
Slicer to the Charts. To connect the slicer:
(a) Click on the slicer to activate it.
(b) In the Slicer tab, click on the Report Connections button.
(c) From the list of pivot tables, check all the boxes.

PAGE 47
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 47 10-Jan-25 3:51:44 PM


BUSINESS ANALYTICS

Notes

IN-TEXT QUESTIONS
8. Which of the following is the most suitable chart type for
displaying the proportion of different categories in a dataset?
(a) Line Chart
(b) Scatter Plot
(c) Pie Chart
(d) Histogram
9. Which of the following operations can you perform using a
pivot table?
(a) Filter data based on specific criteria
(b) Create complex formulas
(c) Sort data in a specific column
(d) All of the above
10. Which type of chart is commonly used in pivot charts to show
data changes over time?
(a) Bar Chart
(b) Pie Chart
(c) Line Chart
(d) Scatter Plot
11. What is a common benefit of using a dashboard for data analysis?
(a) It provides detailed data without summarization
(b) It allows for real-time monitoring of key metrics
(c) It removes the need for data visualization
(d) It only displays raw data without analysis

48 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 48 10-Jan-25 3:51:45 PM


DATA PREPARATION, SUMMARISATION AND VISUALISATION USING SPREADSHEET

Notes
2.23 Summary
The aim of data preparation is to ensure good quality and consistency of
data for specific tasks. While data preparation, we need to detect outliers
that fall outside the expected range of values. These unexpected values
could be due to errors or may require special attention. We need to decide
whether to remove, correct, or leave outliers in the dataset. Sometimes
outliers are valid and should be kept, but in other cases, they may need
correction or exclusion.
Moreover, to validate accuracy of the data, check the data against reliable
sources or business rules to ensure accuracy. And, ensure that the data is
logically consistent, such as ensuring all transactions have corresponding dates.
Data summarization is done to transform a given large dataset into a
smaller form, usually presentable, for reporting, analysis, and further
examination. It involves extracting central insights and patterns from
data without losing vital information. Pivot tables are an important part
of MS Excel that allows users to quickly summarize large amounts of
data, analyze numerical data in detail, and answer unanticipated questions
about the data. Correspondingly, Pivot Chart is a dynamic visualization
tool that helps users summarize and analyze large datasets. Trends and
patterns can be easily identified by pivot charts.

2.24 Answers to In-Text Questions


1. (b) To remove inconsistencies and errors in the data
2. (b) Filling in missing data with estimated values
3. (b) Highlight cells based on certain criteria
4. (c) Highlight Cell Rules > Duplicate Values
5. (b) To ensure that data entered meets specific criteria
6. (c) A value significantly different from other values in the dataset
7. (c) Interquartile Range (IQR)
8. (c) Pie Chart
9. (d) All of the above
10. (c) Line Chart
11. (b) It allows for real-time monitoring of key metrics
PAGE 49
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 49 10-Jan-25 3:51:45 PM


BUSINESS ANALYTICS

Notes
2.25 Self-Assessment Questions
1. Give the steps to clean data.
2. Explain any five processes that are performed to prepare data for
analysis.
3. What are the different ways to summarize data? Give examples.
4. Why is it important to analyse outliers? How can this be done?
5. How will you determine how strongly two variables are related to
each other?
6. Explain the significance of pivot tables.

2.26 References
‹ McFedries, P. (2018). Excel data analysis for dummies (5th ed.).
John Wiley & Sons.
‹ Middleton, M. R. (2021). Data analysis using Microsoft Excel (5th
ed.). Cengage Learning.
‹ Alexander, M. (2016). Excel Power Pivot & Power Query for
dummies. John Wiley & Sons.
‹ Winston, W. L. (2019). Microsoft Excel data analysis and business
modeling (6th ed.). Microsoft Press.

2.27 Suggested Readings


‹ Chapman, S. J. (2021). Essential Excel 2019: A step-by-step guide
to data analysis, charts, and statistics. Independently Published.
‹ Grover, D., & Grover, D. (2019). Data analytics with Excel: A
complete beginner’s guide. BPB Publications.
‹ Stark, C., & Stark, D. (2020). Data analysis for business, economics,
and policy using Excel. Cambridge University Press.

50 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 50 10-Jan-25 3:51:45 PM


L E S S O N

3
Getting Started with R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Installation
3.4 Importing Data from Spreadsheet Files
3.5 Commands and Syntax
3.6 Data Type
3.7 Operators
3.8 Functions
3.9 Summary
3.10 Answers to In-Text Questions
3.11 Self-Assessment Questions
3.12 References
3.13 Suggested Readings

3.1 Learning Objectives


By the end of this chapter, students should be able to:
‹ Define key concepts in R, such as packages, data structures, and basic commands.
‹ Explain and utilize the capabilities of R for handling and manipulating data.
‹ Understand and explain advantages of using R for data analysis.
‹ Differentiate and utilize different data types and operators.
‹ Write functions in R programming language.

PAGE 51
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 51 10-Jan-25 3:51:46 PM


BUSINESS ANALYTICS

Notes
3.2 Introduction
As you have already studied about importance of data analytics. In the pre-
vious lesson we explored data preparation, summarisation and visualisation
with spreadsheet. In this chapter we will introduce you to a popular open-
source programming language designed primarily for statistical computing
and data analysis i.e. R programming (referred as R henceforth). Suppose
a retail company, “ShopSmart,” that needs to analyse its daily sales, cur-
rently they use excel for basic data handling, but it becomes challenging
with the increasing size of data. Switching to R helps “ShopSmart” analyse
larger datasets seamlessly, create informative visualizations, and under-
stand customer purchasing trends, ultimately leading to better marketing
strategies and inventory management. R was developed in early 1990s by
statisticians Ross Ihaka and Robert Gentleman. It has become commonly
accepted tool among data scientists, researchers, and analysts. Apart from
the traditional tools like excel, R is more flexible and scalable in terms
of data analysis. With R more complex computations and visual displays
for large datasets can be performed and handled by the programmer. R
is used in wide areas and, if we talk about business and commerce, R is
used for tasks like predicting the market, appraising financial risk and
optimization of investment, the behaviour of customers, sales forecasting,
inventory control, analysis regarding customer feedback, customer data
segmentation, and campaign performance. Thus, R is an important tool to
make decisions based on data in commerce and gives commerce students
the ability to process and interpret data efficiently. Let’s understand some
of the advantages of R and why should we prefer R for data analysis.
‹ Statistical software generally has very costly licenses, but R is
completely free to use, which makes it accessible to anyone interested
in learning data analysis without needing to invest money.
‹ R is a versatile statistical platform providing a wide range of data
analysis techniques, enabling virtually any type of data analytics
to be performed efficiently and having state-of-the-art graphics
capabilities for visualization.
‹ The data is mostly gathered from variety of sources analysing it at
one place has its own challenges. R can manage data from a variety
of sources, including text files, spreadsheets, databases, and web
APIs, making it suitable for any business environment.
52 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 52 10-Jan-25 3:51:46 PM


GETTING STARTED WITH R

‹ R is compatible with a broad range of platforms, including Windows, Notes


Unix, and macOS, making it likely to run on almost any computer
you use.
‹ The R community which provides wide level of support for R
programmers has developed thousands of packages, extending R’s
capabilities into specialized areas like quantmod for finance, plotting
package for visualization (‘ggplot2’), and support for machine
learning algorithms as well.
Despite of various advantages of R discussed above R can still be diffi-
cult to learn at first. Since it has so many features, the documentation is
extensive and help files can be overwhelming. Many functions come from
optional modules made by different contributors, so the information can be
scattered and hard to find. Understanding everything that R can do can be
quite challenging. In this lesson we will discuss the basics of R starting
from installation (students are expected to follow the steps for installation).

3.3 Installation
To begin with R, students need to install both R (the base programming
language) and RStudio which is an Integrated Development Environment
(IDE) that makes working with R much easier. RStudio provides a more
user-friendly interface compared to R’s base interface, making coding, vi-
sualizing outputs, and managing projects more straightforward and easier.
Follow the steps mentioned below in Table 3.1 to download R and Rstudio.
Table 3.1: Installation of R and RStudio
For R
Step 1: Go to [CRAN (Comprehensive R Archive Network)] (https://
cran.r-project.org/).

PAGE 53
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 53 10-Jan-25 3:51:46 PM


BUSINESS ANALYTICS

Notes Step 2: Choose your operating system (Windows, macOS, or Linux).

Download and run the installer.

R Interface

54 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 54 10-Jan-25 3:51:47 PM


GETTING STARTED WITH R

For RStudio Notes


Visit [RStudio’s website] (https://www.rstudio.com/products/rstudio/
download/).

Choose the free version, “RStudio Desktop.”

PAGE 55
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 55 10-Jan-25 3:51:48 PM


BUSINESS ANALYTICS

Notes Follow the installation prompts.

RStudio Interface

Understanding RStudio IDE


The RStudio the display is divided into various tabs (as shown in Figure 3.1);
these tabs can further be customized as per your requirement. Some of
the most important tabs that you will see by default are described below:

56 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 56 10-Jan-25 3:51:49 PM


GETTING STARTED WITH R

‹ Source Editor Pane: In RStudio IDE, you can access the source Notes
editor (marked as 1 in Figure 3.1) for R code. It is a text editor that
can be used for various forms of R code (shown in 2 of Figure 3.1),
such as standard R Script, R Markdown, R Notebook and R Sweave
etc. We can write and edit code here in the editor.
‹ Console Pane: This pane (as shown in 3 of Figure 3.1) has R
interpreter, where R code is processed. This pane will show execution
of R code (written in editor) and results are displayed.
‹ Environment Pane: This pane can be used to access the variables
that are created in the current R session. The workspace having all
variables can be exported as well as imported (from an existing
file) as R Data file in the environment window.
‹ Output Pane: This pane contains the Files, Plots, Packages, Help,
Viewer, and Presentation tabs. Files allow users to explore files
present on the local storage system. Plots display all graphical
outputs that are produced by the R interpreter. In packages tab you
can view the installed packages (in your RStudio) and load them
manually. In help tab documentation can be searched and viewed
for various R functions.

Figure 3.1: RStudio IDE


Packages: Here you have downloaded and installed R for the first time,
this means you have installed Base R software containing most of the
functions that you will use frequently like mean() and hist(). However,

PAGE 57
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 57 10-Jan-25 3:51:49 PM


BUSINESS ANALYTICS

Notes if you want to access some code or data written by other people you
can do that as well using package. As you already know that R has an
open community support hence, we have many R packages available. R
packages are pre-written sets of functions to perform certain task, that
enhance its capabilities. In simple terms it is a bunch of data, from func-
tions, to help menus, stored in one place called package. In Figure 3.2 we
have installed ‘tidyverse’ package in RStudio, it is a popular collection
of packages designed to make data science easier. It includes tools for
importing, tidying, transforming, and visualizing data.

Figure 3.2: Package Installation RStudio


Libraries: This is the directory where packages are stored on your computer.
In R, to import a package into your workspace, you use the function li-
brary(), making the package’s functions and datasets available for use.

3.4 Importing Data from Spreadsheet Files


Importing data from spreadsheets is quite common in business analytics
because most business data is stored in such formats as Excel. Using R,
you can easily import spreadsheet data into your workspace with packages
like readxl and openxlsx. They support working with.xls and.xlsx files
even without installing Excel on your computer. The readxl package is
especially straightforward and effective. The core function, read_excel(),
reads data directly from a spreadsheet and loads it into an R data frame.

58 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 58 10-Jan-25 3:51:50 PM


GETTING STARTED WITH R

You can specify the sheet name, range of cells, and column types for Notes
better control. Sample code is shown in code window 1.

Code Window 1
IN-TEXT QUESTIONS
1. What is the difference between a package and a library in R?
2. Which package is commonly used to import Excel files into R?

3.5 Commands and Syntax


The most basic program in any programming language is “Hello World”
we will start learning the basic commands. R commands can be written
on command prompt that is interpreter or script file where we can write
complete code at once and then run. We can also run R code on online
compilers if you have not installed R studio.
Note: All the codes in this SLM are written and executed on google colab.

Code Window 2
Comments are the text that are written for the clarity of code, they help
reader to understand your code and they are ignored by interpreter while
the program execution. Single comment is given using # at the beginning
of the statement. R does not support multiline comment.

PAGE 59
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 59 10-Jan-25 3:51:50 PM


BUSINESS ANALYTICS

Notes

Code Window 3
Variables act as containers that hold data or values, which can be used
and manipulated through your program. The creation of a variable in
R is done using the assignment operator <- or =. Variables make you
work with data much better by giving meaningful names to the values
you want to use. Variables in R are flexible—you don’t have to declare
their type explicitly. R automatically understands whether you’re storing
a number, text, or something else.

Code Window 4
There are certain rules to give valid variable names in R as discussed
below:
‹ A variable name can include letters (a-z, A-Z), digits (0-9), and the
dot (.) or underscore (_) but cannot start with a number.
‹ R is case sensitive var and Var are two different identifiers.
‹ Reserved keywords in R cannot be used as variable names.
‹ Any special character except underscore and dot is not allowed.
‹ Variable names starting with a dot are allowed, but they should not
be followed by a number. It is not advised to use dot as starting
character.
Some examples of valid and invalid variable names are shown in Table 3.2.

60 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 60 10-Jan-25 3:51:51 PM


GETTING STARTED WITH R

Table 3.2: Identifiers Notes

Keywords: These are integral part of R’s syntax, keywords are used to
implement various functionalities in R. These are also called reserved
words and are predefined having specific roles within the language. The
list of reserved words in R is quite comprehensive which can be accessed
by executing ‘help(reserved)’ or using ‘?reserved’. Table 3.3 shows the
list of reserved words in R.
Table 3.3: Reserved Words

PAGE 61
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 61 10-Jan-25 3:51:51 PM


BUSINESS ANALYTICS

Notes
3.6 Data Type
Unlike C or C++, we do not require to declare a variable with data
type in R. It supports random assignment of data type for a variable
depending upon the values that it has been initialized to. There are var-
ious data types available in R, a list of data types is shown in Table 3.4
apart from these general data types are also supports a lot of flexible
data structures such as vector list arrays data frames etc. which will be
discussed in later lessons.
Table 3.4: Data Types in R

‹ R provides function so that you can view the various variables that
are currently defined in your R environment the following function
is applied to see the list of variables that are currently available.
‹ Use ls() to list all variables in current environment.
‹ ls(pattern = “name”) will give list of variables matching the
given pattern.
‹ Another function that can be used to display variables if objects().
‹ We can also remove variables from R environment using following
functions:
‹ rm(variable_name) removes a single variable.
‹ rm(var1, var2, var3) will remove multiple variables mentioned
as argument.
‹ rm(list = ls()) will remove all variables.

62 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 62 10-Jan-25 3:51:51 PM


GETTING STARTED WITH R

‹ rm(list = ls(pattern = “temp”)) will remove all variables matching Notes


the given pattern.

3.7 Operators
Operators are tools that help us to perform various operations on data, we
can do basic calculations or more advanced logical comparisons, operators
tell R what action to take on data. There are various operators available
in R programming, we will discuss them in this section.
Arithmetic Operators are simplest and most frequently used opera-
tors. It allows us to carry out simple math operations like addition,
subtraction, multiplication, and division. For example, 5 + 3 adds two
numbers together and gives a result of 8. For 5 %% 2, the remainder
will be calculated and is 1. Advanced arithmetic is also available, such
as exponentiation through either the ^ or ** operators. This will enable
you to raise a number to a power. These operators are not restricted to
single numbers only. They also work element-wise on numeric vectors
to compute things easily even with very big datasets. Table 3.5 shows
various arithmetic operators.
Table 3.5: Arithmetic Operators

Relational Operators are used to compare values and check for conditions
like equality, greater than, or less than. For instance, 5 > 3 checks whether
5 is greater than 3 and returns TRUE. similarly, 5 == 3 checks for equality
and returns FALSE. They are also widely used in filtering or subsetting
data where you will find rows of a dataset that satisfy some condition. For
example, you could use age > 18 to find all rows in a dataset where the
age is above 18. Relational operations always return logical values (TRUE
or FALSE). Table 3.6 shows various relational operators.

PAGE 63
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 63 10-Jan-25 3:51:52 PM


BUSINESS ANALYTICS

Notes Table 3.6: Relational Operators

Logical Operators let you combine or modify logical values. You can
use & to perform an AND operation, where it is only true if both the
conditions are satisfied. For instance, TRUE & FALSE is false. Likewise,
you use the | operator to obtain an OR operation, meaning that the result
is TRUE if at least one condition is satisfied. The! operator negates a
logical value, turning TRUE into FALSE and vice versa. Logical operators
are especially useful when dealing with multiple conditions in your data.
For instance, age > 18 & gender == “Male” can filter male individuals
above the age of 18 in a dataset. Table 3.7 shows various logical operators.
Table 3.7: Logical Operators
Operator Example Result
& AND (element-wise) TRUE & FALSE FALSE
&& AND (single comparison) TRUE && TRUE TRUE
| OR (element-wise) TRUE | FALSE TRUE
| | OR (single comparison) FALSE | | FALSE FALSE
! Not (negation) !TRUE FALSE

Assignment Operators are used to store values in variables. The most


used operator is <-, which assigns a value to a variable, like x <- 10.
You can also use = for assignment, but <- is preferred in R because it
is clear and consistent with the syntax of the language. Interestingly,
R also allows the right assignment operator (->), storing a value to the
left of the operator, such as 10 -> x. These operators are the workhorses
behind using variables, and you can easily manipulate data. Table 3.8
shows assignment operators.

64 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 64 13-Jan-25 2:57:50 PM


GETTING STARTED WITH R

Table 3.8: Assignment Operators Notes

3.8 Functions
In R, user-defined functions enable you to create reusable blocks of
code to perform specific tasks. Functions are especially useful in busi-
ness analytics for automating repetitive operations, performing custom
calculations, or implementing domain-specific logic. By defining your
own functions, you can encapsulate complex logic into simple, reusable
units, which improve the clarity and efficiency of your code.
In R, a function is defined by the keyword function(). Inputs are speci-
fied as arguments, and you write the logic to work with these inputs and
generate the desired output. A well-crafted function has three components:
Name which is a descriptive identifier for the function, Arguments are
variables passed into the function for customization and The code block
where the logic is executed called body. Code window 5 shows example
of function in R.

Code Window 5
PAGE 65
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 65 10-Jan-25 3:51:52 PM


BUSINESS ANALYTICS

Notes IN-TEXT QUESTION


3. How do you define a user-defined function in R?

3.9 Summary
This chapter has given an overview of working with R for data analysis,
including both foundational concepts and practical tools, to equip readers
with the essential skills. The Introduction to R looked at the basics of this
versatile programming language, powerful features, and its critical role
in data analysis. The Installation section guided readers through setting
up R and RStudio to ensure a smooth start to coding. Moving forward,
the chapter covered Importing Data, explaining how to read spreadsheet
files into R using the readxl package, an important part of data prepara-
tion. There was a focus on the understanding of Commands and Syntax,
particularly in R about case sensitivity and the need for proper formatting
to avoid error execution.
The Data Types section highlighted the primary types: numeric, character,
logical, and factors that form the basis of data handling and analysis in R.
Further details about the use of Operators, namely, arithmetic, relational,
logical, and special operators were described in detail for data manip-
ulation and analysis. Finally, the chapter ended with a presentation of
Functions, both built-in and user-defined, to perform tasks with minimal
repetition and compute complex calculations efficiently. Together, these
topics set the strong foundation for using R in data analysis.

3.10 Answers to In-Text Questions


1. A package is a collection of functions and data, while a library
is the location where installed packages are stored.
2. The readxl package is used to import Excel files.
3. A user-defined function is created using the function() keyword,
followed by a name, arguments, and the logic to return a result.

66 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 66 10-Jan-25 3:51:52 PM


GETTING STARTED WITH R

Notes
3.11 Self-Assessment Questions
1. How do you install and load a package in R? Provide a code example.
2. Explain the difference between numeric and character data types in
R.
3. Write a user-defined function in R that calculates the square of a
number.
4. Describe how to import a .csv file into R. Mention any required
functions or packages.
5. Identify two relational and two logical operators in R.

3.12 References
‹ Grolemund, G., & Wickham, H. (2016). R for Data Science: Import,
Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
‹ Matloff, N. (2011). The Art of R Programming. No Starch Press.

3.13 Suggested Readings


‹ Verzani, J. (2014). Using R for Introductory Statistics. CRC Press.
‹ Grolemund, G. (2014). Hands-On Programming with R. O’Reilly
Media.

PAGE 67
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 67 10-Jan-25 3:51:53 PM


L E S S O N

4
Data Structures in R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Vectors
4.4 Matrices
4.5 Lists
4.6 Factors
4.7 Data Frames
4.8 Conditionals and Control Flows
4.9 Loops
4.10 Apply Family
4.11 Summary
4.12 Answers to In-Text Questions
4.13 Self-Assessment Questions
4.14 References
4.15 Suggested Readings

4.1 Learning Objectives


By the end of this chapter, you should be able to:
‹ Understand and work with vectors, matrices, arrays, lists, factors, and data frames
in R.
‹ Use conditionals and control flows to add logic to your programs.
‹ Implement loops for repeated code and enhance efficiency by using the apply family
of functions.

68 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 68 10-Jan-25 3:51:54 PM


DATA STRUCTURES IN R

‹ Identify when and how to select among various data structures and Notes
control mechanisms.
‹ Write cleaner, more efficient R code using the strength of functional
programming.

4.2 Introduction
In this lesson we will discuss about data structures and their use to orga-
nize and process data more efficiently. You may think of a data structure
as a blueprint that indicates how to arrange and store data. The design
of any structure is deliberate, as it allows data access and manipulation
in certain, structured ways. We use specialized methods or functions to
interact with these structures in programming and statistical software like
R. These tools are built for easier working with data of all shapes and
forms. R offers six key data structures to work with: Vectors, Matrices,
Arrays, Lists, Factors and Data frames.
Further these can be divided into two categories Homogenous and Hetero-
geneous structures. The first three vectors, matrices, and arrays are like
neat, organized boxes, where everything is of the same type hence they
are called homogenous. On the other hand, heterogeneous structures are
data frames and lists that allow for greater flexibility. They can accom-
modate elements of various types to coexist together. Factor is a special
data structure specially used for handling categorical data (nominal or
ordinal). In the subsequent sections we will discuss these data structures.
A point to remember for those who are already familiar with program-
ming; R has no scalar types, in fact, numbers, strings or any other scalar
are vectors of length one.

4.3 Vectors
It is one of the basic data structures in R programming languages, it is
used to store multiple values having same type also called modes. It is
one-dimensional and can hold numeric, character, logical or other values,
but all the values must have same mode. Vectors are fundamental to R,
hence most of the operations are performed on vectors. Various types of
vectors are shown in Table 4.1 below:

PAGE 69
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 69 10-Jan-25 3:51:54 PM


BUSINESS ANALYTICS

Notes Table 4.1: Types of Vectors

Creating a Vector
You can create vectors using the c() function, which stands for combine
or concatenate. Also, vectors are stored contiguously in memory just
like arrays in C, hence the size of vector is determined at the time of
creation. Thus, any modification to the vector will lead to reassignment
(creating a new vector with same name internally). Code to create and
display a few vectors is shown below in code window 1.

Code Window 1
Another point to note is c() function allows you to modify or reassign
an existing vector as shown in code window 2.

Code Window 2

70 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 70 10-Jan-25 3:51:54 PM


DATA STRUCTURES IN R

This will add 10 to the vector v1 in the end or on 4th place as instructed. Notes
Vectors are useful for analysis as R allows us to use various operations
over them. In this section we will explore various operations that can
be used on vectors.
‹ Length: We can obtain the length of a vector using length() function.
This can be used to iterate over vector in loops.

This will return 5 as output.


‹ Indexing and Subset: We can use indexing to refer to a particular
element of a vector, we can also extract subsets using indexing.
Note that vector index starts from 1 instead of 0, and subset range
is inclusive.

This will give 3 and (3,23,4) as output. You can also give nega-
tive index to omit a value like print(v1[-2]) will output all values
except second index.
‹ You can also apply filtering to vectors by applying logical expressions
that return true/false for each vector, output is given by true values.

This will give output 30 40 50 and 20 30 40.


‹ Element-wise Operations: We can apply simple operations on all
the element of a vector.

This will give (3 4 5 6 7), (2 4 6 8 10), (-1 0 1 2 3) as output


respectively. We can apply all operators like arithmetic, logical,

PAGE 71
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 71 10-Jan-25 3:51:55 PM


BUSINESS ANALYTICS

Notes relational etc. in an element wise fashion. Various operators are


already discussed in lesson 3.
‹ Vectorized Functions: R offers many built-in functions which can
be applied to vectors as a whole (rather than element-wise) and
give cumulative output as shown in Table 4.2.
Table 4.2: Vectorized Functions

‹ Combining and Modifying Vectors: Apart from applying operations


on a single vector, we can also apply the given functions on two
or more vectors as shown in Table 4.3.
Table 4.3: Combined Vector Operations

Note: One important point when applying an operation to two vectors


is that such operations require both vectors to be the same length.
In case of length mismatch R automatically recycles, or repeats,
the shorter one (as shown in example) until it is long enough to
match the longer one as shown in code window 3.

72 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 72 10-Jan-25 3:51:55 PM


DATA STRUCTURES IN R

Notes

Code Window 3
‹ Miscellaneous Functions: There are certain functions shown below
in Table 4.4 which can be used with vectors, as required.
Table 4.4: Miscellaneous Functions

Thus, vectors in R support a wide range of operations - from simple


arithmetic to advanced indexing and sub-setting. Because the vectors are
vectorized, you can apply operations directly to entire vectors, bypassing
looping, which makes code much more efficient and concise.

4.4 Matrices
Since you have understood vectors and various operations that can be
applied on them, now, let’s talk about matrices. You can understand a
matrix as an enhanced vector: it’s really nothing but a vector with two
extra attributes; namely the number of rows and the number of columns.

PAGE 73
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 73 10-Jan-25 3:51:56 PM


BUSINESS ANALYTICS

Notes As with vectors, matrices are also homogenous. However, don’t mix up
one-row or one-column matrices with vectors-they are not the same.
Now, matrices are actually a special type of a broader concept in R called
arrays. While matrices have just two dimensions (rows and columns),
arrays can go further and have multiple dimensions. For instance, a
three-dimensional array has rows, columns, and layers, adding an extra
level of organization into your data. The reason that matrices are useful
in R is the vast array of operations that you can carry out on them. Many
of these operations are based upon what you know already about vectors,
such as subsetting and vectorization, but it expands these in two dimen-
sions. The added structure of rows and columns makes matrices ideal
for mathematical operations, data manipulation, and statistical modelling.
The various operations on matrices are discussed below:
‹ Creation: Matrices are generally created using matrix() function,
the data in matrices is stored in column major format by default.
The ‘nrow’ parameter specifies rows, and ‘ncol’ specifies columns.
We can use ‘byrow = TRUE’ to fill data row-wise in matrix instead
of column-wise. Code to create matrix using matrix() function and
by using vectors is shown below in code window 4.

Code Window 4

‹ We can also add or delete rows and columns from matrices as


shown below in code window 5 and 6.

74 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 74 10-Jan-25 3:51:56 PM


DATA STRUCTURES IN R

Notes

Code Window 5

Code Window 6

‹ R provides several operations for matrices, including addition,


multiplication, and scalar operations as shown in code window 7.

PAGE 75
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 75 10-Jan-25 3:51:57 PM


BUSINESS ANALYTICS

Notes

Code Window 7

As you may have noticed in the code window 7 that arithmetic multi-
plication and matrix multiplication are two different functions. Some of
the other functions are rowSums() and colSums() that give sum of rows/
columns and rowMeans() and colMeans() that give mean of rows/columns.
‹ Just like vectors indexing and subsetting can be done on matrices.
You can access specific elements, rows, or columns using indices
as shown in code window 8.

Code Window 8

‹ You can also assign values to submatrices like mat[c(1,3),] <-


matrix(c(1,1,8,12),nrow=2) that is we assign new values to first and
third row to matrix. If you give negative index it will exclude that
element like mat[-2,] will omit second row from output.
‹ Matrix filtering is a powerful operation just like for vectors, it
enables efficient subsetting and selection of data from a matrix

76 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 76 10-Jan-25 3:51:57 PM


DATA STRUCTURES IN R

based on logical criteria. Some examples are shown below in code Notes
window 9.

Code Window 9

‹ You can also give name to the rows and columns of a matrix using
the dimnames() function or by specifying them during the creation
of the matrix (as shown in code window 10).

Code Window 10

PAGE 77
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 77 10-Jan-25 3:51:57 PM


BUSINESS ANALYTICS

Notes Arrays
‹ An array in R is a data structure that can store data in more than
one dimension, hence in R arrays are an extension of matrix. While a
matrix is constrained to two dimensions, with rows and columns, an
array, however, can take three or more dimensions. Arrays are more
useful for organizing and manipulating data having more than two
axes, such as 3D spatial data or multi-dimensional experimental results.
‹ Array can be created using array() function with arguments data,
dimensions and dimension names as shown in code window 11.

Code Window 11

‹ Array elements can be accessed in same manner as vector or


matrices. We can also name the dimensions.

Code Window 12

78 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 78 10-Jan-25 3:51:58 PM


DATA STRUCTURES IN R

‹ We can reshape arrays dimension as shown. Notes

Code Window 13

4.5 Lists
In R, a list is an amazingly flexible data structure, meaning it can store
any kind of data together - numbers, characters, vectors, matrices, and
even other lists. This flexibility makes list different from vectors or ma-
trices, which insist on elements to be of the same class. A list is useful
for organizing complex data where different types may coexist. In R, lists
are used frequently, not only for storing results from statistical models
but also in general for organizing heterogeneous data:
‹ You create a list by using the “list()” function, and any of the
elements in the list are accessed using double square brackets “[[
]]”. So for instance, “list(42, “Hello”, c(1, 2, 3))” generates a list
that has an integer, a string, and a vector.

PAGE 79
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 79 10-Jan-25 3:51:58 PM


BUSINESS ANALYTICS

Notes

Code Window 14

‹ We can do indexing, subsetting or accessing elements of list is


shown in code window 15.

Code Window 15

‹ We can find size of a list using length(), we can also add or delete
elements.

80 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 80 10-Jan-25 3:51:59 PM


DATA STRUCTURES IN R

Notes

Code Window 16

4.6 Factors
Factors are another type of R objects that are created by using a vector,
it stores the vector as well as a record of distinct values in that vector
called level. Factors are majorly used for nominal or categorical data.

Code Window 17

As shown in code window 17 factor fac has 8 values but only 3 different
levels. Level is very useful as shown in code window 18:

PAGE 81
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 81 10-Jan-25 3:51:59 PM


BUSINESS ANALYTICS

Notes

Code Window 18

Here in the code window, case 1 shows that we tried to assign a value
to factor index 2 and it was successfully done as the value belonged to
predefined level, but in case 2 we got NA assigned to index 2 o factor
instead of 15 because 15 was not present in factor level. In case 3 we
anticipated a new level which was not present in initial vector, but we gave
it in factor definition. Thus illegal values cannot be assigned to vectors.
‹ Two commonly used functions with vectors are split() and by().
As the name suggests split() function is used to divide an object

82 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 82 10-Jan-25 3:51:59 PM


DATA STRUCTURES IN R

(such as a vector, data frame, or list) into subsets based on a certain Notes
grouping factor, it is particularly useful when you want to break
down your data into smaller groups according to a factor (like a
categorical variable).

Code Window 19

As shown in code above the vector data is split into groups A,B,C cor-
responding to their factor. However, by() function is used to apply a
function to subsets of a data object that have been grouped by a factor.
It is used when for scenarios where you want to perform operations like
calculating the mean, sum, or other statistical measures for each group
as shown in code window 20.

PAGE 83
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 83 10-Jan-25 3:52:00 PM


BUSINESS ANALYTICS

Notes

Code Window 20

Thus, “split()” function in R splits an object into subsets based on a fac-


tor and returns the grouped data without applying any function to those
subsets. On the other hand, the “by()” function is used when you want to
apply a function, such as “mean” or “sum”, to each group formed by a
factor. Both functions are essential for working with grouped data in R,
allowing users to organize and analyze data based on categorical variables.

4.7 Data Frames


A data frame is a two-dimensional, tabular data structure commonly used
for storing and manipulating data. It is very similar to table or spreadsheet,
where each column can store data of various types-numeric, character,
logical-and each row is an observation or record. Data frames are flexi-
ble and allow easy access to subsets of data, modification of values, and
application of functions across columns or rows. They are created using
the “data.frame()” function and are the default structure for most data
analysis tasks in R, especially for statistical modeling, data visualization,
and manipulation. The most important feature about data frames is that
it keeps the integrity of data intact. All columns within the data frame

84 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 84 10-Jan-25 3:52:00 PM


DATA STRUCTURES IN R

are of the same length, and meaningful column names can be assigned Notes
for easy interpretation and management of data.
‹ Data frame creation is shown in code window 21.

Code Window 21

As we can see in code above, data frame d is created with vec-


tors name and age, the last parameter specifies that whether you
wan string vectors to be treated as factors or not, by default this
parameter is True.
‹ Elements of a data frame can be accessed in several ways, depending
on whether you want to select columns, rows, or specific cells.

Code Window 22
PAGE 85
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 85 10-Jan-25 3:52:01 PM


BUSINESS ANALYTICS

Notes ‹ Subsets can be extracted from data frames based on row and column
selection or using logical conditions or by using the subset() function
as shown in code window 23.

Code Window 23

‹ Data frame can handle missing values as well, NA (Not Available)


is used to represent missing or undefined data in R.

Code Window 24

86 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 86 10-Jan-25 3:52:01 PM


DATA STRUCTURES IN R

‹ We can used rbind() or cbind() to combine two data frames row Notes
wise or column wise provided they have same number of columns
in case of rbind() and vice versa. We can also use merge function
to combine two or more data frames by matching rows based on
common columns.

Code Window 25

PAGE 87
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 87 10-Jan-25 3:52:01 PM


BUSINESS ANALYTICS

Notes IN-TEXT QUESTIONS


1. What is the primary difference between a matrix and an array
in R?
2. Write an R code snippet to create a 3x3 matrix.
3. How do lists differ from vectors in R?
4. What makes a data frame unique compared to a matrix?

4.8 Conditionals and Control Flows


Decision making refers to the process of choosing amongst several alter-
native actions or courses of action based on certain conditions or criteria.
It allows programs to make choices based on logical evaluations, which
is typically implemented with control structures like “if”, “else”, and
“switch”. These structures allow the program to execute different blocks
of code based on whether certain conditions are true or false, thus con-
trolling the flow of execution. Decision making is essential in creating
dynamic and responsive applications that can adapt to changing inputs
or situations, so the program behaves correctly under all circumstances.
The flow chart of decision making can be depicted by flow chart below:

Figure 4.1
There are three decision making constructs in R programming: if, if…
else, switch.
88 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 88 10-Jan-25 3:52:02 PM


DATA STRUCTURES IN R

‹ The if statement in R is the simplest form of decision making. Notes


It compares a condition, and then if that condition is TRUE then
the code block inside if is executed; otherwise, the code block is
skipped for a FALSE condition. The syntax of it is shown below:

As shown in code window 26 only that block is executed where


the Boolean condition is true.

Code Window 26

‹ If you want to give a code that should execute some instructions if


conditions is true and other if conditions is false then if..else can
be used. For example, I will go out if it’s raining else I won’t. The
structure of if..else is shown below:

Code window 27 shows if..else code in R.

PAGE 89
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 89 10-Jan-25 3:52:02 PM


BUSINESS ANALYTICS

Notes

Code Window 27

‹ if we need to execute multiple conditions if..else if…else ladder


can be used the syntax is given below:

For example, if we need to print grades based on marks the code


is given in code window 28.

Code Window 28

90 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 90 10-Jan-25 3:52:03 PM


DATA STRUCTURES IN R

‹ In addition to if..else if..else ladder, a switch statement can also be Notes


used. It lets you check if a variable matches any value from a list.
Each possible value is called a “case,” and the variable is compared
against these cases to find a match. Switch statements can be very
straightforward and efficient for handling multiple conditions. Some
rules of switch case are:
‹ If the variable being tested isn’t a character string, it’s automatically
converted to an integer.
‹ You can have as many case statements as you want. Each case
is followed by the value to compare and a colon.
‹ If the variable is an integer between 1 and “nargs() - 1” (the
maximum number of arguments), the matching case’s value is
evaluated, and its result is returned.
‹ If the variable is a character string, it does an exact match among
the case names.
‹ If there are several matches, only the first one is used.
‹ There is no default case.
‹ If there’s no match but there is an unnamed case (“.”), then its
value is returned. If there are more than one unnamed cases,
then an error is raised.
Syntax and example of switch.

Code Window 29
PAGE 91
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 91 10-Jan-25 3:52:03 PM


BUSINESS ANALYTICS

Notes
4.9 Loops
Like any other programming language, we have loops in R too. They
are basic constructs allowing a block of code to be executed repeatedly.
R implements several kinds of loops: for, while, and repeat. Each loop
type is suited for different tasks, depending on the kind of control flow
needed. We will discuss the code and syntax of each of these loops in
this section.
‹ For Loop: It is used to iterate over a sequence of elements (that
are iterate able), such as a vector, list, or sequence using a loop
control variable. The code of for loop is given in code window 30

Code Window 30

The above code iterates over a vector and prints all elements of vectors
one by one. We can write code to iterate over other data structures in
the same manner.
‹ Like for loop, while loop also repeatedly executes a block of code
as long as the condition remains TRUE. But here the loop control
variable needs to be initialized outside the loop. While code to print
sum of 5 numbers is shown below in code window 31, iteration
variable is increment inside the loop.

92 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 92 10-Jan-25 3:52:03 PM


DATA STRUCTURES IN R

Notes

Code Window 31

‹ The third type of iterative statement i.e. repeat loops indefinitely


until explicitly stopped using a break statement.

Code Window 32

PAGE 93
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 93 10-Jan-25 3:52:04 PM


BUSINESS ANALYTICS

Notes ‹ We can also have nested loops for complex operations where iterations
are needed at various levels. For example, if you want to print
columns for each row, nested code is shown in code window 33.

Code Window 33

Here, the outer loop takes each value of i, and for every single value of
i, the inner loop takes each value of j. This structure makes sure that for
each pair of values taken by i and j, one calculation is performed—it is
the product of i and j. The result of this calculation is then printed. This
is typically applied when many tasks require the calculation of tables,
pairwise comparisons, or generally any combinatorial operation involving
several variables.
‹ Next and break statements can be used to control loop, next helps
to skips the current iteration and moves to the next one while break
terminates the loop entirely as seen in repeat loop. Code is given
in code window 34.

94 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 94 10-Jan-25 3:52:04 PM


DATA STRUCTURES IN R

Notes

Code Window 34

Thus, in R, loops are very helpful when automating repetitive tasks; for
loop iterates over elements in a sequence, such as vectors or lists, exe-
cuting a block of code for each element. The “while” loop continues to
execute if a specified condition is “TRUE”. This makes a good choice
for tasks where one doesn’t know beforehand the number of iterations. A
loop will run endlessly until stopped using a break statement that should
be provided, ideal when the condition for its stop is more complex in
expression.
Although loops are very general, R’s vectorized operations and apply-fam-
ily functions are often much faster alternatives to handle large datasets
or for simple operations, so are generally preferred in most cases.
IN-TEXT QUESTION
5. Write an R code snippet using an if-else statement to check if
a number is even or odd.

PAGE 95
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 95 10-Jan-25 3:52:04 PM


BUSINESS ANALYTICS

Notes
4.10 Apply Family
The apply family in R includes functions like apply, lapply, sapply, vap-
ply, tapply, mapply, and rapply. It is very useful and powerful feature of
R. These functions provide alternatives to loops for applying functions
across various data structures like vectors, matrices, arrays, lists, factors,
and data frames. They are generally more concise and can improve code
readability and performance for vectorized operations, loops can be slower
than vectorized operation. In this section we will discuss these functions
one by one along with code.
‹ The apply() is used to operate on margins of matrix and array.
It applies a given function along rows or columns of a matrix or
higher-dimensional array. The syntax is apply(X, MARGIN, FUN)
where X is matrix or array, margin refers dimensions and fun is
the function that we need to apply.

Code Window 35

In this code the apply function is used over 3 × 3 matrix to calculate


sum of rows and columns.
‹ lapply() is used to apply a function to each element of the list and
it returns a list. The code for lapply() is given in code window 36.

96 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 96 10-Jan-25 3:52:05 PM


DATA STRUCTURES IN R

Notes

Code Window 36

‹ The sapply() function works like lapply() but it attempts to simplify


the output into a vector or matrix when possible.

Code Window 37

PAGE 97
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 97 10-Jan-25 3:52:05 PM


BUSINESS ANALYTICS

Notes ‹ vapply() is also like lapply() and sapply() but it lets you to specify
the expected output type for better reliability.

Code Window 38

‹ tapply() applies a function to subsets of a vector, defined by a


factor or a list of factors. It takes three input parameters data vector,
factors to group by, function to apply.

Code Window 39

‹ mapply() can be used to apply a function to multiple arguments


(vectorized), code is shown.

Code Window 40

‹ If you want to recursively apply a function to elements of a list


you can use rapply(), kit can also be used to handle nested list.

98 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 98 10-Jan-25 3:52:06 PM


DATA STRUCTURES IN R

Notes

Code Window 41

In this code we have given nested list to function rapply and x^2 is
applied to each element of list. Classes specify to apply function only
to specific classes and how return structure like “unlist” for vector or
“replace” for nested list.
Table 4.5 shows various functions of apply family.

Table 4.5 Apply Family Functions

4.11 Summary
In this chapter we have covered some of the basic building blocks in R
that serve as the foundation for manipulating data and controlling pro-
grams. Vectors are one-dimensional arrays that hold elements of a similar
type, whereas matrices extend this concept to two dimensions, and arrays

PAGE 99
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 99 10-Jan-25 3:52:06 PM


BUSINESS ANALYTICS

Notes generalize further to n-dimensions. Lists, however, are containers that can
hold elements of different types, making them very versatile. Factors are
utilized to represent categorical data in a statistically efficient manner.
Data frames, which are a hybrid structure that combines the features of
lists and matrices, are ideal for organizing tabular data. To add logic to
your programs, tools like “if”, “else”, and “switch” allow decision-mak-
ing capabilities. For repetitive operations, loops like “for”, “while”, and
“repeat” are necessary; however, the apply family of functions provides
more efficient alternatives, enabling concise and functional programming.
This chapter has laid a solid foundation for dealing with data, writing
efficient code, and solving complex programming problems in R.

4.12 Answers to In-Text Questions


1. A matrix is limited to two dimensions, while an array can have
more than two dimensions
2. mat <- matrix(1:9, nrow = 3, ncol = 3)
print(mat)
3. Lists can hold elements of different types, while vectors must have
elements of the same type
4. Data frames allow columns to have different types (e.g., numeric,
character), unlike matrices
5. num <- 4
if (num %% 2 == 0) {
print(“Even”)
} else {
print(“Odd”)
}

4.13 Self-Assessment Questions


1. What is the difference between a vector and a matrix in R? Provide
an example of each.

100 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 100 10-Jan-25 3:52:06 PM


DATA STRUCTURES IN R

2. Explain how a list differs from a vector and give a practical example Notes
of when you would use a list instead of a vector.
3. Describe the structure of a data frame and explain why it is particularly
useful for working with tabular data.
4. Write an R code snippet using an if-else statement to determine
whether a number is positive, negative, or zero.
5. What is the purpose of the apply family of functions, and how do
they improve code efficiency compared to traditional loops?

4.14 References
‹ Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly
Media.
‹ Matloff, N. (2011). The Art of R Programming. No Starch Press.
‹ Crawley, M. J. (2012). The R Book. Wiley.

4.15 Suggested Readings


‹ Garrett Grolemund’s Hands-On Programming with R for beginners
exploring R fundamentals.
‹ Hadley Wickham’s Advanced R for a deeper dive into R’s programming
capabilities.

PAGE 101
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 101 10-Jan-25 3:52:07 PM


L E S S O N

5
Descriptive Statistics
Using R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Importing Data File
5.4 Data Visualisation Using Charts
5.5 Measure of Central Tendency
5.6 Measure of Dispersion
5.7 Relationship between Variables
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings

5.1 Learning Objectives


After reading this lesson student will be able to:
‹ Write code to import data files in R.

‹ Create and interpret various types of charts such as histograms, bar charts, box plots,
line graphs, and scatter plots for data visualization.
‹ Describe data using measures of central tendency (mean, median, mode).

‹ Describe data measures of dispersion (range, variance, standard deviation, IQR).

‹ Analyse relationships between variables using statistical measures such as covariance,


correlation, and the coefficient of determination.
102 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 102 10-Jan-25 3:52:07 PM


DESCRIPTIVE STATISTICS USING R

Notes
5.2 Introduction
Data analysis is an important skill in today’s data-driven world, allowing
people and organizations to extract meaningful insights from raw data.
This lesson focuses on equipping you with the essential tools and tech-
niques in R, a powerful statistical computing and visualization language,
to handle data effectively. We begin by learning how to import data
files, a fundamental step in data analysis. Whether working with CSV
files, Excel sheets, or other formats, importing data correctly ensures a
seamless workflow for subsequent analysis.
Now we’re moving on to data visualization. It is an essential part of the
process of data exploration and communication. It will teach you how to
unveil patterns, distributions, and relationships in your data using visual
representations like histograms, bar charts, box plots, line graphs, and
scatter plots. Visualization doesn’t only help understand complex datasets
but also communicate results to others effectively. Descriptive statistics
are the basis of data analysis. You will look into measures of central
tendency-mean, median, mode, which summarize the central value of a
dataset, and measures of dispersion-range, variance, standard deviation,
interquartile range, which describe the variability or spread of data. These
measures give an overall idea of the characteristics of the data.
Lastly, we discuss the relationships between variables using concepts such
as covariance, correlation, and the coefficient of determination (R²). With
these tools, we can express and interpret the nature of the relationship
between variables. This is the foundation on which predictive modelling
and decision-making are based. At the end of this lesson, you would
have gained theoretical knowledge and practical skills in the analysis and
interpretation of data using R. Being a beginner or enhancing your skill
set on data, it gives a good foundation working with real-world datasets.

5.3 Importing Data File


You must be now familiar with the power of R for data analysis; to per-
form data analysis on any datasets effectively, we require to have data
from various sources. Therefore, importing data into R is an important
is one of the most essential skills. Like any other programming tool R

PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 103 10-Jan-25 3:52:07 PM


BUSINESS ANALYTICS

Notes also supports a variety of file formats for data import such as spread-
sheets, text files (.csv or .txt), from databases (MySQL, SQLite, and
PostgreSQL.), from software (SPSS, SAS, or STATA) and other formats
(JSON, XML, HTML etc.).
‹ For importing data from csv we require the function read.csv(), the
syntax is read.csv(filepath, header, sep); where filepath specifies the
location of file, header parameter specifies if the first row contains
column names (TRUE/FALSE), and Sep is used to provide delimiter
(like “,” for csv).

Code Window 1
This will load my file to data and head() will display first six rows by
default.
‹ To import other formats, we need to load desired package from
library, code to read excel, json and excel file is shown below in
code window 2.

Code Window 2
‹ R can also interact with database using packages like DBI and
RMySQL as shown in code window 3.

104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 104 10-Jan-25 3:52:07 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 3
Once you import data to R you can start doing data analysis using the
built-in functions or libraries.

5.4 Data Visualisation Using Charts


In this section we will learn about data visualization, R provides a rich
ecosystem of libraries (like ggplot2, plotly, lattice, cowplot) each offering
unique capabilities to create a variety of charts, plots, and interactive vi-
sualizations etc. In this section we will learn plotting various charts using
ggplot2 library which is widely used, for this purpose we will be using the
mpg dataset. This built-in dataset in R, provided by the ggplot2 package
containing information about fuel efficiency and various characteristics of
cars. Various columns of mpg dataset are shown below in Table 5.1.

Table 5.1: Columns of mpg Dataset

The first five rows of mpg dataset are shown below in code window 4.

PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 105 10-Jan-25 3:52:08 PM


BUSINESS ANALYTICS

Notes

Code Window 4
(i) Histograms: We can use histogram to visualize the distribution of
a single continuous variable by binning i.e. dividing them into
intervals or bins. It is useful to identify patterns such as skewness,
spread, or unusual gaps. The code is given in code window 5.

Code Window 5

106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 106 10-Jan-25 3:52:08 PM


DESCRIPTIVE STATISTICS USING R

In this code we have created a histogram of highway mileage (hwy). Notes


The mileage is grouped into bins of width 2, the fill color of bars
is set to steel blue with black outline. The label of x and y axis is
done using labs().
(ii) Bar Chart: We use bar charts to represent categorial data, they are
ideal when we are comparing discrete groups as they can show
counts or proportions of each category. The code window 6 shows
bar charts.

Code Window 6
(iii) Box Plot: It summarizes the distribution of a continuous variable
by displaying the median, quartiles, and potential outliers. It is
useful when we need to compare across multiple groups. The code
is shown in code window 7.

PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 107 10-Jan-25 3:52:08 PM


BUSINESS ANALYTICS

Notes

Code Window 7
This code gives a box plot for highway mileage (hwy) for different
car classes (class). function geom_boxplot() generates a box plot
for each class with statistical summaries. The fill color is set to
light green color.
(iv) Line Graphs: They are recommended when we want to analyse
trends over a continuous variable or to observe relationships. The
code to generate line graph is shown in code window 8.

Code Window 8
108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 108 10-Jan-25 3:52:09 PM


DESCRIPTIVE STATISTICS USING R

(v) Scatter Plots: When we need to visualize the relationship between Notes
two continuous variables we can use scatter plots, they are an ideal
choice for identifying trends, clusters, or correlations. The code
window 9 shows how to generate scatter plot.

Code Window 9

IN-TEXT QUESTIONS
1. Which function in R is commonly used to load a CSV file?
2. What type of chart is used to visualize the frequency distribution
of data?

5.5 Measure of Central Tendency


Any dataset can be analysed, summarised and its characteristics can be
understood based on one of the most common statistical tools that is
measures of central tendency. They summarize a set of data by identifying
a single value that represents the entire distribution of the dataset. The
three most common measures of central tendency are mean, median, and
mode. Each of these measures provides unique insights and the decision
which one to use is based on the nature of the data and the analysis that
you need to do.

PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 109 10-Jan-25 3:52:09 PM


BUSINESS ANALYTICS

Notes (i) Mean: The arithmetic average or mean is calculated by the formula
below:

It is a widely used measure for quantitative data although it is


sensitive to outliers, which can skew the result. Code window 10
shows how to compute mean on a dataset.
(ii) Median: Median is the middle value in a sorted dataset, it is calculated
as the central value if the dataset has an odd number of observations
and median is the average of the two central values if observations
are even in number. Compared to mean; the median is less affected
by outliers. The example below shows how to compute median. R
code is shown in code window 10.

(iii) Mode: The mode represents the value that appears most frequently in
a dataset. A dataset can be unimodal having one mode, multimodal
having more than one mode, or no mode at all if no value repeats.
The example below shows all three, the corresponding code to
compute mode in dataset is given in code window 10.

110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 110 10-Jan-25 3:52:10 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 10

5.6 Measure of Dispersion


As you have seen in previous section though measure of central tendency
may provide information about the central values, this information can-
not be used alone to deduce characteristics of data. We need measure of
dispersion as well to understand how spread out the data is. They help
describe the variability or spread of data points around the central value.
Common measures of dispersion are discussed below, and their code is
shown in code window 11.
(i) Range: The simplest measure of dispersion is range which is the
difference between the maximum and minimum values in a dataset.
Although range is easy to calculate but it is sensitive to outliers.
Formula for range is:
Formula:
Range = Maximum Value – Minimum Value

PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 111 10-Jan-25 3:52:10 PM


BUSINESS ANALYTICS

Notes (ii) Variance: It is the measures of deviation from mean i.e. how far
each data point is from the mean, on average. A higher variance
indicates greater variability in the data. Variance is expressed in
squared units which makes it harder to interpret directly. The formula
of variance is given below:
Formula:

(iii) Standard Deviation: Since interpreting variance was difficult therefore


we use standard deviation, which is square root of the variance,
hence, providing a measure of dispersion in the same units as the
original data. A smaller standard deviation means the data points
are closer to the mean.
Formula:

(iv) Interquartile Range (IQR): It measures the spread of the middle


50% values of data. It indicates how spread out the middle half
of a dataset is. For better understanding, imagine you line up all
your data from smallest to largest (sorted). The IQR focuses on the
middle 50% of those numbers, ignoring the smallest and largest
value. The formula is shown below:
Formula:
IQR = Q3 – Q1
Hence measures of dispersion help us understand how consistent or varied
the data is, for example, two datasets with the same mean can have very
different standard deviations, thus indicating different levels of spread. We
can use these measures of dispersion to add depth to our understanding
of data, complementing the central tendency measures. They are essential
for making informed decisions based on data variability.

112 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 112 10-Jan-25 3:52:10 PM


DESCRIPTIVE STATISTICS USING R

Notes

Code Window 11

5.7 Relationship between Variables


By now we have already understood various measures of central tendency
and dispersion, understanding the relationship between variables is also a
key aspect of data analysis. It helps in identifying patterns, trends, and
associations amongst data that can lead to decision-making and predictions.
There can be three types of relationship amongst variables: positive rela-
tion depicts increase of two variables together for example studying time
and exam scores often show a positive relationship. Negative relationship
is observed when increase in one variable leads to decrease in other, like

PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 113 10-Jan-25 3:52:11 PM


BUSINESS ANALYTICS

Notes distance from a Wi-Fi router and internet speed. Nonlinearity is shown when
variables have a curved or complex relationship instead of a straight-line
pattern( we will study them in lesson 6). There is no relationship between
two variables when changes in one variable do not affect the other like
shoe size and IQ have no relationship. In this section, we’ll discuss three
key concepts for relationship measurement - Covariance, Correlation, and
the Coefficient of Determination (R²). Both covariance and correlation are
used to measure linear dependency between pair of random variables also
called bivariate data. You have already studied correlation and covariance
in lesson 2 we will explore these two again with r programming code.
Covariance is a statistical measure that indicates how two variables change
together. It shows whether an increase in one variable leads to increase
in another variable, or whether this will affect inversely. The formula for
covariance is given below:

Thus, covariance measures how the two variables vary together, so that
if the value of covariance is positive it means the relationship is direct
wherein both increase and decrease as each other while a negative covari-
ance means that the one increases as the other increases. If the covariance
value approximates zero, then apparently there is no relationship. However,
it is hard to interpret because covariance is measured in units that are the
product of the variables’ units, so it is hard to compare or standardize
across datasets. An example is shown in code window 12 below:

Code Window 12
114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 114 10-Jan-25 3:52:11 PM


DESCRIPTIVE STATISTICS USING R

While covariance provides directionality yet interpreting the magnitude Notes


is challenging due to its dependence on units. This can be resolved by
normalizing covariance i.e. computing correlation, resulting in a dimen-
VLRQOHVVYDOXHWKDWOLHVEHWZHHQíDQG,WLVIRUPDOO\NQRZQDV3HDUVRQ
correlation coefficient (r) and its formula is shown below:

When r=1 it shows perfect positive linear relationship, while r=-1 indi-
cates negative linear relationship and r=0 means no linear relationship.
Correlation not only indicates the strength of the relationship but also
the direction. The R code for computing correlation is given in code
window 13:

Code Window 13
Correlation provides valuable insights into the strength and direction of
the relationship between two variables. A strong positive correlation (0.7
”U” LQGLFDWHVWKDWDVRQHYDULDEOHLQFUHDVHVWKHRWKHUYDULDEOHDOVR
increases significantly, demonstrating a robust linear relationship. On the
RWKHU KDQG D ZHDN RU QR FRUUHODWLRQ U §   LPSOLHV WKDW FKDQJHV LQ RQH
do not reliably predict changes in other, which indicates there isn’t any
practically important linear relation between the variables.
Further, coefficient of determination (R²) is used for explaining variance, it
is denoted by R² and quantifies how well one variable predicts another. It
is especially useful in regression analysis to evaluate the goodness-of-fit.
It is computed by squaring the correlation coefficient as shown in formula:
R2 = r2

PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 115 10-Jan-25 3:52:11 PM


BUSINESS ANALYTICS

Notes This value represents the proportion of variance in one variable (Y) ex-
plained by the other (X). For instance, an R2 = 0.81 implies 81% of the
variation in Y can be explained by X.
Values of R2 can be used to deduce relationship like R²= 0 indicates no
predictive relationship among the given variables i.e. the independent
variable (X) does not explain any of the variation in the dependent vari-
able (Y), while R² = 1 signifies perfect prediction indicating that all the
variation in Y is entirely explained by X.
IN-TEXT QUESTIONS
3. Which measure of central tendency is the middle value of a
sorted dataset?
4. What is the statistical term for the difference between the
maximum and minimum values in a dataset?
5. What term describes the strength and direction of the linear
relationship between two variables?
6. Which metric indicates how well one variable explains another
in regression analysis?

5.8 Summary
This lesson provides a good foundation in using data analysis with R by
considering how you can import data, information visualization, describing
data, and correlating or checking for dependence among variables. This
will help you to read your data files and to successfully import various
kinds of sources, like CSVs or XLS, into your program. The next area
of key importance is visualization, where we discussed several types of
charts that allow us to visualize our data. Histograms help to find dis-
tributions. Bar charts are useful for categorical data comparisons. Box
plots summarize data distributions by summarizing medians, quartiles,
and outliers. Line graphs capture trends over time, and scatter plots
show the relationships between two variables. These tools not only help
in analyzing the data but also make it easy to communicate findings.
Further, descriptive statistics is explained which helps to describe char-
acteristics of the dataset. Measures of central tendency-mean, median,

116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 116 10-Jan-25 3:52:12 PM


DESCRIPTIVE STATISTICS USING R

and mode-help summarize the center of the data. However, measures of Notes
dispersion-range, variance, standard deviation, and IQR-help explain the
variation or spread in the data. Together, these measures allow for a full
description of the dataset.
Lastly, we covered relationships between variables. You have learnt to
analyze how variables are connected through covariance and correlation,
which describe how two variables change together and the strength of the
relationship. The coefficient of determination, or R², quantifies exactly
the degree to which one variable predicts another, giving a more nuanced
understanding of how they interact. Mastering these concepts and tools
will enable you to import, visually present, describe, and analyze data as
needed in preparation for more advanced data analysis and decision-making.

5.9 Answers to In-Text Questions


1. read.csv
2. Histogram
3. Median
4. Range
5. Correlation
6. Coefficient of Determination

5.10 Self-Assessment Questions


1. Explain the process of importing a CSV file in R.
2. Create a histogram and a scatter plot using a dataset of your choice.
3. What is the difference between variance and standard deviation?
Provide examples.
4. How do covariance and correlation differ? Explain with an example.

5.11 References
‹ Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using
R. SAGE Publications.

PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 117 10-Jan-25 3:52:12 PM


BUSINESS ANALYTICS

Notes ‹ James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning: With applications in R. Springer.
‹ Kabacoff, R. I. (2015). R in action: Data analysis and graphics
with R (2nd ed.). Manning Publications.

5.12 Suggested Readings


‹ Matloff, N. (2011). The art of R programming: A tour of statistical
software design. No Starch Press.
‹ R Core Team. (n.d.). The R project for statistical computing.
Retrieved from https://cran.r-project.org/
‹ Wickham, H., & Grolemund, G. (2017). R for data science: Import,
tidy, transform, visualize, and model data. O’Reilly Media.

118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 118 10-Jan-25 3:52:12 PM


L E S S O N

6
Predictive and Textual
Analytics
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi

STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 Simple Linear Regression Models
6.4 &RQ¿GHQFH DQG 3UHGLFWLRQ ,QWHUYDOV
6.5 Multiple Linear Regression
6.6 ,QWHUSUHWDWLRQ RI 5HJUHVVLRQ &RHI¿FLHQWV
6.7 Heteroscedasticity and Multi-Collinearity
6.8 Basics of Textual Data Analysis
6.9 Methods and Techniques of Textual Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings

6.1 Learning Objectives


After reading this lesson students will be able to:
‹ Explain the differences between simple and multiple linear regression.
‹ Describe the purpose of confidence and prediction intervals.
‹ Evaluate the impact of heteroscedasticity and multicollinearity on regression models.
‹ Define and analyze the performance of text mining techniques.
PAGE 119
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 119 10-Jan-25 3:52:12 PM


BUSINESS ANALYTICS

Notes
6.2 Introduction
In this lesson we will focus on two important analytics for business:
Predictive and textual, each having its own importance and techniques.
Predictive analytics often referred to as advanced analytics, is very closely
linked with business intelligence, giving organizations actionable insights
for better decision-making and planning. For instance, if an organization
wants to know how much profit it will make a few years later based on
current trends of sales, customer demographics, and regional performance
they can take benefit from predictive analytics. It uses techniques such
as data mining and artificial intelligence to predict outcomes like future
profit or other factors that may be critical to the success of the organi-
zation. At its core, predictive analytics is a process that makes informed
predictions about future events based on historical data. Analysts and data
scientists apply statistical models, regression techniques, and machine
learning algorithms to identify trends in historical data, so businesses
can predict risks and trends and prepare for future events.
Predictive analytics is changing industries because it can facilitate da-
ta-driven decision-making and efficiency in operations. In marketing, it
aids businesses in understanding customer behavior, leads segmentation,
and targeting of high-value prospects. Retailers apply predictive analytics
in order to personalize shopping experiences, optimize pricing strategies,
and merchandise plans. Manufacturing uses predictive analytics to en-
hance the monitoring of machine performance, avoid equipment failures,
and smooth logistics. Fraud detection, credit scoring, churn analysis,
and risk assessment are some of the benefits for the financial sector. Its
applications in healthcare include personalized care, resource allocation,
and identification of high-risk patients for timely interventions. As tools
get better, predictive analytics continues revolutionizing industries with
smarter insights and better outcomes.
Textual analysis is the systematic examination and interpretation of textual
data in order to draw meaningful insight, patterns, and trends. There are
different techniques and methods involved in the processing of unstructured
text, including various forms of documents, social media posts, or reviews
by customers. The intention of textual analysis is for the transformation
of raw text to become accessible and useful as input in decision-making

120 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 120 10-Jan-25 3:52:13 PM


PREDICTIVE AND TEXTUAL ANALYTICS

processes or for research. This can be identifying hot words, categorizing Notes
text topics, or sentiment analysis associated with a particular piece of
text. Textual analysis really is the backbone of a number of fields, such
as marketing, social science, business intelligence, to name only a few,
in garnering valuable insights from huge streams of unstructured text.

6.3 Simple Linear Regression Models


We have seen relationship between variables in lesson 5, simple linear
regression can be defined as a statistical learning method that is used
to examine or predict quantitative relationship between two continuous
variables: one independent variable called predictor (X) and other depen-
dent variable called response (Y). This method helps us model the linear
relationship between the variables and make their predictions assuming
that there is approximately a linear relationship between independent
variable X and dependent variable Y. Mathematically, we can write this
linear relationship as:

:KHUH ȕ0 DQG ȕ1 are model coefficients which are unknown constants
UHSUHVHQWLQJ LQWHUFHSW DQG VORSH DQG ࣅ LV HUURU WHUP RU UDQGRP QRLVH
We can build a simple linear regress model in R using the following steps:
‹ The very first step is to prepare data for this we need to ensure
that the data set is clean and contain no missing values for the
variable involved. We need to load the data into R using functions
like read.csv() or read.table().
‹ Once data is uploaded we need to visualize it using scatter plot.
‹ After that we can fit the regression model using lm() function and
then we can use summarize method to understand the details of
the model.
‹ We can also make predictions using predict() function.
We have already studied the reading techniques in previous lessons. We
will focus on linear regression functions in this section.
The lm() function in R is a built-in function for fitting linear models,
both simple and multiple linear regression. It estimates the relationship

PAGE 121
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 121 10-Jan-25 3:52:13 PM


BUSINESS ANALYTICS

Notes between a dependent variable and one or more independent variables


using the method of least squares.
Let us consider the code given in code window 1 that applies linear
regression on cars dataset.

Code Window 1
In code window 1 we have shown the code for linear regression on
built-in dataset “mtcars” available in R. The goal here is to predict how

122 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 122 10-Jan-25 3:52:13 PM


PREDICTIVE AND TEXTUAL ANALYTICS

changes in horsepower ‘hp’ influence miles per gallon ‘mpg’. So, in this Notes
example we will examine information regarding “mpg” and “hp”. Since
the data is already prepared and formatted, we visualize the data using
scatter plot. This will help us analyse whether there is a pattern or trend
in the data that is worth exploring further.
Next, we try to fit a simple linear regression model by using the formula
“lm(mpg ~ hp, data = mtcars)”, this predicts mpg based on hp. After
fitting the model, we enhance our scatter plot by adding a regression line
using the abline() function to show the model’s predicted relationship. To
make the model useful, we use the predict() function to estimate mpg for
a specific horsepower value. This prediction tells us what fuel efficiency
we might expect for such a car. Finally, we examine the model’s accu-
racy by plotting its residuals—the differences between the observed and
predicted values. This step helps us check if there are any patterns in
the errors, which could indicate that the model is missing some critical
information.

6.4 Confidence and Prediction Intervals


In predictive analysis, confidence intervals and prediction intervals are
two critical tools that can be used to quantify the uncertainty surrounding
any statistical estimates or predictions. Both of them give a quantitative
indication of ranges in which the true values are expected to lie, yet
they have different usage. They play an important role in interpreting and
evaluating the regression model. They provide insight into the accuracy
of parameter estimates and the range within which individual predictions
are likely to fall.
In case of simple linear regression, confidence interval (CI) is used to
estimate the range within which the actual mean of the dependent variable
(y) lies for a given value of the independent variable (x). For example,
assume that you have a fitted regression model to predict mileage (mpg)
of a car given its weight(wt). A 95% confidence interval indicates that
the mean mileage for cars whose weight is 3.0 lies between 18.5 and
20.0 mpg. Code for this is given in code window 2.

PAGE 123
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 123 10-Jan-25 3:52:14 PM


BUSINESS ANALYTICS

Notes

Code Window 2
In the code above, a simple linear regression model is fitted to the mtcars
dataset using lm(mpg ~ wt), where mpg is the dependent variable, and
wt is the independent variable that is the weight of the car. The model
predicts how car weight influences fuel efficiency. Then we have used
the predict function to compute a confidence interval for the mean re-
sponse (mpg) for the car with weight = 3.0. To do this, specify interval
= “confidence” and level = 0.95. The function will compute the range
in which the average mileage for cars weighing 3.0 will fall, with 95%
confidence. This computation gives the lower and upper bound of the
mean mpg, thus making it possible to evaluate the precision and reliability
of the estimated mean value.
A prediction interval (PI) predicts the interval where the actual value
of the dependent variable (y) is likely to lie for a given x. Unlike CIs,
prediction intervals contain the residual error variability ‫)׫‬. Thus, a pre-
diction interval gives a range for an individual data point rather than the
mean response. For example, let’s consider the same regression model
as above, a prediction interval for a car with weight 3.0 might indicate
that its mileage lies between 16.0 and 22.0 mpg, with 95% confidence.
R code for same is given in code window 3.

Code Window 3
124 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 124 10-Jan-25 3:52:14 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Hence, confidence intervals are narrower in range than prediction intervals Notes
since they only consider the uncertainty of the estimate of the mean of
response variable. Prediction intervals are larger because they encompass
the variability of individual observations about the regression line as well
as the uncertainty in the mean.
The visualization code and output of confidence and prediction interval
is shown in code window 4.

Code Window 4
In this code we have visualized car weight against mileage for all the
cars in the mtcars dataset. The plot function is used to make a scatter

PAGE 125
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 125 10-Jan-25 3:52:15 PM


BUSINESS ANALYTICS

Notes plot of weight vs. mileage, the abline function plots the fitted regres-
sion line in blue with a width of 2 representing the linear relationship
between the two variables. Two shaded areas are added to the plot to
represent the confidence intervals and prediction intervals. The confidence
interval is represented by a green shaded area around the regression line
that indicates the range within which the mean predicted values lie for a
given car weight. Similarly, the prediction interval is represented as a red
shaded area, which is the range in which individual predictions for new
data points are likely to fall. Finally, a legend is added to the top-right
corner of the plot to distinguish the regression line, confidence interval,
and prediction interval. The legend uses different colors and labels to
make the plot easy to interpret, with blue for the regression line, green
for the confidence interval, and red for the prediction interval.
Thus both the serve analysts, decision-makers, and researchers in trying
to quantify uncertainty in bettering predictions and drawing appropriate
conclusions from regression models.

6.5 Multiple Linear Regression


Multiple Linear Regression (MLR) is just an extension of simple linear
regression that models the relationship between two or more independent
variables and a dependent variable. In MLR, the dependent variable is
predicted using a linear combination of multiple independent variables.
This method can be very helpful when we want to understand the influ-
ence of several independent factors on a single outcome or target variable.
The mathematical equation for MLR is:

Where y LV WKH WDUJHW YDULDEOH ȕ0 is the intercept term that represents
the value of y when all independent variables are 0. We have multiple
independent variables represented by x1, x2, …., xn ZLWK FRHIILFLHQWV ȕ1,
ȕ2 « ȕn DQG ࣅ LV WKH HUURU WHUP
For an MLR model to give valid results we need to fulfil a few assump-
tions that are stated below:
‹ The relationship between the independent and dependent variables
must be linear.

126 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 126 10-Jan-25 3:52:15 PM


PREDICTIVE AND TEXTUAL ANALYTICS

‹ Observations must be independent. Notes


‹ Variance of the error must be constant across the ranges of independent
variables i.e. homoscedasticity.
‹ Residuals or error terms must be normally distributed.
‹ Independent variables should not have strong mutual correlations
means no multicollinearity.
The r code is given in code window 5

Code Window 5
Once the model has been fitted, the interpretation of the coefficients is
of great importance. In each case, the coefficient reflects the change in
the dependent variable for a one-unit change in the respective indepen-
dent variable, with all other variables held constant. For example, if the
coefficient for car weight is -0.1, then an increase in car weight of one
unit would reduce mpg by 0.1, assuming horsepower and cylinders are
held constant.
To access the performance of MLR various statistics like r-squared, f-sta-
tistic, p-value etc. can be used. Multiple Linear Regression is a powerful
technique for modelling and understanding the relationship between a

PAGE 127
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 127 10-Jan-25 3:52:15 PM


BUSINESS ANALYTICS

Notes dependent variable and multiple independent variables. By fitting a model


and interpreting the coefficients, we can make predictions and assess the
effect of each predictor. Proper model evaluation, diagnostics, and check-
ing assumptions are necessary to ensure the reliability and validity of
the results. Multiple linear regression is widely applied in various fields,
such as economics, finance, marketing, and engineering. It applies where
complex relations between variables need to be understood and predicted.

6.6 Interpretation of Regression Coefficients


When working with a regression model, coefficients hold very important
meaning. They determine how each independent variable would cause the
dependent variable to move in simple terms, they describe the relationship
of the predictor variables to the outcome we are trying to predict. Let
us understand this concept considering example of predicting car mile-
age using car weight, consider the equation of linear regression given
in section 6.3.
ȕ0 would tell you what was expected mileage at zero pounds for the car,
in a real sense, the weight of the car is usually not zero pounds, so it’s
not a sense-making equation, but it does describe the starting point of
WKH PRGHO ȕ1 represents the slope of the regression line i.e. the change
in y corresponding to every one-unit change in x. This represents how
much the mileage is expected to increase or decrease when we variate
car weight i.e. for every additional unit of car weight. This relationship
is represented with a positive or negative value of coefficient.
The multiple linear regression interpretation is more complex, but the
same idea is still there. Suppose we have a model with more than one
predictor, such as predicting car mileage using weight, horsepower, and
the number of cylinders. The model might look something like this:

The key is the fact that each coefficient reflects the effect of the corre-
sponding variable, but only after controlling for the influence of the other
YDULDEOHVLQWKHPRGHO/LNHKHUHȕ1 represents the change in mileage with
respect to change in car weight, holding the other variables (horsepower
DQG F\OLQGHUV  FRQVWDQW 6LPLODUO\ ȕ2 shows change in mileage for each
additional unit of horsepower, again holding weight and cylinders constant.

128 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 128 10-Jan-25 3:52:15 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Understanding these coefficients is important because it helps us inter- Notes


pret the model in the context of the real-world problem we’re trying to
solve. Are the results meaningful? Do they make sense in the context of
what we know about the subject matter? For example, in the car mileage
example, it would be reasonable to expect that heavier cars generally
have lower mileage, and so a negative coefficient for weight might align
with our expectations. This is also a critical point because coefficients
alone don’t tell everything. They have to be seen in the context of other
statistical tests, like p-values and confidence intervals, to determine their
reliability. A coefficient with a very high p-value may indicate that the
corresponding predictor may not be as important as we assumed, and its
influence on the dependent variable may not be statistically significant.
Ultimately, the interpretation of regression coefficients is about under-
standing how the variables in your model relate to each other, and how
changes in one predictor may influence the outcome. From there, you
can make more informed decisions, predictions, and conclusions through
careful examination of these relationships.

6.7 Heteroscedasticity and Multi-Collinearity


As we have explained in previous section when building regression mod-
els, apart from the relationship between the variables, it is important to
guarantee that certain assumptions of the model are met. For instance,
two of these assumptions are homoscedasticity (which will be contrasted
with heteroscedasticity), and no multicollinearity. These are all important
because violations of assumptions can impact the reliability of your model
and lead one to incorrect conclusions.
Heteroscedasticity is defined as the case when the variation of errors
(i.e. the differences between observed and predicted values) varies with
the levels of the independent variables. In simple words, it means that
whenever the value of the independent variable is changed, the spread
or dispersion of the residuals also varies.
Assume an example of house price prediction by square feet. Hetero-
scedasticity occurs if error variance for the larger house predictions is
greater than the smaller houses. The errors-or, rather, residuals-for large
houses are spread further out than they are for smaller houses, violating

PAGE 129
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 129 10-Jan-25 3:52:16 PM


BUSINESS ANALYTICS

Notes the assumptions of linear regression. All error variances should be ap-
proximately constant at all independent variable levels.
Heteroscedasticity affects the standard errors of the coefficients, biasing
test statistics. The model may be statistically significant when it shouldn’t
be, or vice versa. Residual plots help analysts detect heteroscedasticity.
If the residuals plot as a random cloud around zero, everything is fine.
If they fan out or form a pattern as the independent variable increases,
heteroscedasticity may be present. A potential method to address hetero-
scedasticity is to take a log transformation of the dependent variable, or
use weighted least squares regression in which more weight is placed
on observations with less variability. To detect heteroscedasticity in your
regression model, you can use residual plots, the code is shown in code
window 6.

Code Window 6

130 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 130 10-Jan-25 3:52:16 PM


PREDICTIVE AND TEXTUAL ANALYTICS

If you observe heteroscedasticity in the residual plot, one way to address Notes
it is by applying a log transformation to the dependent variable. The code
is shown in code window 7.

Code Window 7
Multicollinearity occurs when two or more independent variables in a
regression model are highly correlated with each other. In simple terms,
the independent variables are stepping on each other’s toes and giving
redundant information, thereby not making it possible to separate the
individual effect of each variable on the dependent variable. For example
if we attempt to forecast a person’s income, given his or her level of
education and years of work experience. Since there is a strong tendency
for people with higher levels of education to have more work experience,
the two variables can be highly correlated. This kind of multicollinearity
renders it difficult to distinguish the contribution of each variable inde-
pendently to the prediction of income. The regression coefficients can
become unstable and generate widely fluctuating estimates with very
minor changes in data thus resulting in an unreliable model.
Analysts commonly use the Variance Inflation Factor (VIF) as a measure
to detect multicollinearity for each independent variable. A high VIF,
usually above 10, means that the variable is highly correlated with at
least one of the other predictors in the model. In case of multicollinearity,

PAGE 131
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 131 10-Jan-25 3:52:17 PM


BUSINESS ANALYTICS

Notes the easy solution is to delete one of the correlated variables or combine
them into one more meaningful variable-for instance, by calculating an
index or an average. The code is shown in code window 8.

Code Window 8
Both heteroscedasticity and multicollinearity can reduce the effectiveness
of a regression model. Heteroscedasticity interferes with the estimation
of the standard errors of the coefficients, which may lead to misleading
significance tests, while multicollinearity prevents the assessment of the
effects of each predictor. Knowledge of these issues and their detection
and resolution is crucial in developing more reliable and interpretable
regression models.
IN-TEXT QUESTIONS
1. What type of regression model uses one independent variable?
2. What interval is used to estimate the range within which the
true value of a parameter lies?
3. What term describes the condition when the variance of errors
is not constant?
4. What is the term for the relationship between predictors and
the outcome in regression?
5. What type of regression issue arises from high correlations
between predictors?

6.8 Basics of Textual Data Analysis


Textual analysis includes information extraction from text, such as emails,
reviews, or posts on social media. It is commonly applied in tasks such

132 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 132 10-Jan-25 3:52:17 PM


PREDICTIVE AND TEXTUAL ANALYTICS

as sentiment analysis, keyword extraction, and text classification. Basic Notes


steps of text analysis are:
‹ Text Preprocessing i.e. cleaning and preparing the text for the
analysis process, for example conversion to lowercase, remove stop
words such as “the” and “is”, punctuation, and special characters.
‹ Tokenization means dividing a text into smaller tokens or units
such as words, phrases, etc.
‹ Text Representation implies converting text in a form that is ready
to be analysed, for instance as a frequency count of words, bag-
of-words representation.

6.9 Methods and Techniques of Textual Analysis


There are three methods of textual analysis: text mining, categorization,
and sentiment analysis. Text mining gives the fundamental tools for clean-
ing and extracting features from raw text; categorization helps classify
text into predefined categories; and sentiment analysis gives insights into
the emotional tone of the text. Together, these methods unlock valuable
insights from large volumes of unstructured text data, making it more
useful for business, research, and decision-making purposes.

6.9.1 Text Mining


The process of mining useful information and knowledge from unstructured
text is called text mining. It involves transforming text data into a struc-
tured format that can further be analyzed. Text mining allows discovery of
patterns, relationships, and trends within large collections of text: books,
articles, reviews, social media posts, and many more. Text mining com-
monly includes text preprocessing, tokenization, word frequency analysis,
and also more complex techniques such as topic modeling and clustering.
In text mining, the main objective is to draw insights and gain deeper
insight into the content. For example, we could apply text mining to the
analysis of customer reviews in order to understand common complaints,
frequently mentioned features, or overall satisfaction. R has an extensive
list of libraries such as tm, tidytext, and textTinyR that make the process
of text mining easier with functions for cleaning text, tokenization, and
much more. The code window 9 shows sample code for text mining.

PAGE 133
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 133 10-Jan-25 3:52:17 PM


BUSINESS ANALYTICS

Notes

Code Window 9

6.9.2 Categorization
It refers to the process of assigning text into predefined categories or
labels based on the content. This method is applied in many applications,
including email filtering (spam vs. non-spam), document classification
(business, sports, tech), and sentiment analysis (positive, negative, neutral).
The idea is to map a given text document to one or more categories that
best represent the content. Techniques of categorization involve super-
vised learning models including Naive Bayes, Support Vector Machines
(SVM), and Logistic Regression. Such models require labeled training
data to learn how to classify new, unseen data. The model can predict
the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating
a Document-Term Matrix (DTM) and using classification models such as
Naive Bayes. The sample code is shown in code window 10.

Code Window 10

134 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 134 10-Jan-25 3:52:18 PM


PREDICTIVE AND TEXTUAL ANALYTICS

6.9.3 Sentiment Analysis Notes


Sentiment analysis is the process of determining the emotional tone or
sentiment behind a piece of text. The aim is to classify text as expressing
a positive, negative, or neutral sentiment. This technique is widely used
for analyzing customer feedback, product reviews, social media posts,
and other forms of text to gauge public opinion or sentiment about a
particular topic.
There are two main approaches in sentiment analysis:
Lexicon-based Methods: These use pre-defined dictionaries of words
with positive, negative, or neutral sentiments. Words within the text are
matched up against the dictionary, and their sentiment is determined based
on the frequency or intensity of positive or negative words.
Machine Learning-based Approaches: This involves training the model
on labelled text data wherein the sentiment is known in advance and then
applies that model to classify new text. Techniques like Naive Bayes,
Support Vector Machines, and deep learning can be applied here.
R provides libraries like syuzhet, tidytext, and sentimentr to perform
sentiment analysis. The sample code is given in code window 11.

Code Window 11

IN-TEXT QUESTIONS
6. What technique is used to classify text into predefined categories?
7. What statistical tool helps identify sentiment in text?
8. What analysis method breaks down text into meaningful elements
for extraction?

PAGE 135
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 135 10-Jan-25 3:52:18 PM


BUSINESS ANALYTICS

Notes
6.10 Summary
This lesson introduces essential concepts in regression analysis, starting
from simple and multiple linear regression models. It shows the role of
confidence and prediction intervals in statistical prediction and teaches
how to interpret regression coefficients. It also addresses two common
challenges: heteroscedasticity and multicollinearity. The second half of
the lesson covers textual data analysis techniques such as text mining,
categorization, and sentiment analysis, outlining an overview of how such
techniques can be applied in the extraction of insights from unstructured
text data. The practical use of R applications ensures students are well-
equipped to not only carry out statistical analyses but also to perform
text analysis.

6.11 Answers to In-Text Questions


1. Simple
2. Confidence
3. Heteroscedasticity
4. Coefficients
5. Multicollinearity
6. Categorization
7. Sentiment
8. Text Mining

6.12 Self-Assessment Questions


1. Explain the difference between a confidence interval and a prediction
interval in regression analysis.
2. Describe how multicollinearity can affect the results of a multiple
linear regression model.
3. How does sentiment analysis help in analyzing customer feedback?
4. Explain heteroscedasticity in a regression model.

136 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 136 10-Jan-25 3:52:18 PM


PREDICTIVE AND TEXTUAL ANALYTICS

Notes
6.13 References
‹ James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Introduction
to statistical learning with applications in R. Springer.
‹ Fox, J. (2016). Applied regression analysis and generalized linear
models (3rd ed.). Sage Publications.
‹ Silge, J., & Robinson, D. (2017). Text mining with R: A tidy
approach. O’Reilly Media.

6.14 Suggested Readings


‹ Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
statistical learning: Data mining, inference, and prediction (2nd
ed.). Springer.
‹ Wickham, H., & Grolemund, G. (2017). R for data science: Import,
tidy, transform, visualize, and model data. O’Reilly Media.
‹ Provost, F., & Fawcett, T. (2013). Data science for business: What
you need to know about data mining and data-analytic thinking.
O’Reilly Media.

PAGE 137
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 137 10-Jan-25 3:52:19 PM


Business Analytics.indd 138 10-Jan-25 3:52:19 PM
Glossary

Array: Generalized matrix with more than two dimensions.


Big Data: Large datasets that are too complex and challenging to be processed by using
traditional data tools.
Central Tendency: The central point of a dataset that can be computed using measures
such as mean, median, or mode.
Coefficient of Determination (R²): A measure to indicate the percentage of variation in
one variable due to other.
Control Flow: Programming constructs that control the flow of execution, such as if-else,
loops.
Correlation: A standardized measure of the strength and direction of a relationship be-
tween two variables.
Covariance: A measure of how two variables vary together.
Data Analytics: The process of analysing datasets to find patterns and gain meaningful
information.
Data Cleaning: The process that involves detecting and correcting errors, inconsistencies,
and missing values in a dataset. Treatment of outliers, duplicate entries, or irrelevant data
points is essential in this stage.
Data Collection: The process that involves collecting raw data from a large number of
sources, such as databases, spreadsheets, APIs, or even sensors.
Data Frame: Tabular data structure that combines features of a matrix and list.
Data Importing: The process of loading external data files into R.
Data Integration: Integration of data from different sources into one dataset. It may in-
volve table merging, dataset joining, or another kind of data conflict resolution.
Data Preparation: Process that involves tasks such as data cleaning, transforming, and
arranging raw data in a format that enables effective analysis or model training.
Data Reduction: The process for reducing either the size or the complexity of the dataset,
and it involves feature selection, dimensionality reduction, and sampling, among others.

PAGE 139
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 139 10-Jan-25 3:52:20 PM


BUSINESS ANALYTICS

Notes Data Splitting: Division of data into subsets, usually training, validation,
and test sets. These sets help a model builder to build models with the
data, tune their hyperparameters, and finally estimate their performance.
Data Transformation: The process that refers to the conversion of data
into a format or structure fit for analysis.
Data: It is collection of raw facts and figures collected for reference or
analysis.
Dispersion: The spread of data that can be measured by range, variance,
standard deviation, or IQR.
Factor: Data structure to store categorical data.
Function: A reusable block of code that performs a specific task, such
as sum() or user-defined functions.
List: Flexible data structure allowing elements of different types.
Matrix: A two-dimensional data structure with all elements of the same
type ordered in rows and columns.
Multiple Linear Regression: Extended simple linear regression that uses
multiple independent variables to predict a dependent variable.
Operator: A symbol or set of symbols used to perform operations like
addition (+) or comparisons (==).
Package: A collection of R functions, data, and documentation that extend
R’s functionality, like dplyr or ggplot2.
Qualitative Data: Descriptive data that defines characteristics like nom-
inal and ordinal data.
Quantitative Data: Data that can be measured like continuous or dis-
crete data.
Regression Coefficients: Values that represent the relationship between
the predictor variables and the response variable in a regression model.
Simple Linear Regression: A statistical method that models one to one
relationship between a dependent and independent variable.
Text Mining: The process of extracting useful information from unstruc-
tured text data.
Vector: An ordered sequence of elements of the same type, such as
numeric, character.

140 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi

Business Analytics.indd 140 10-Jan-25 3:52:20 PM


1504-Business Analytics [BCPH-DSC16-S6-CC4] Cover Jan25.pdf - January 18, 2025

You might also like