Business Analytics Unit 1-5
Business Analytics Unit 1-5
Editor
Deekshant Awasthi
Department
De
epa
part
rtme
ment of D
Distance
ist
stan
ance
nce aand
nd C
Continuing
onti
on tinu
n ing Ed
Education
ducattio
ionn
E-mail:
E ma
E- maill: dd
[email protected]
dce
cepr
prin
pr inti
ting
ngg@c
@ ol
ol.d
.du.
u ac
u. ac.i
.in
n
commme
mercr e@
rc @co
c l.
l.du
du.a
[email protected] .ac.
c.in
in
Published
Pu ubl
blis
i he
is h d by
by::
Department
Deepa
part
rttment
rtme nt ooff Di
Distance
D ist
stan ce aand
ance nd CContinuing
onnti
tinu
nuuin
ing E
ing Education
duccattion
Campus
Ca
amp us ooff Op
mpus Open
Ope en L
en Learning,
ea
arnrnin
ing,
g S
g, School
chhoo Open
ooll off O pen Lear
pen
pe Learning,
rning,
University
Univ
Un iver
errsi
s ty ooff De
Delhi,
elhhi, D
Delhi-110007
elh
el hi-1
110
1000
0077
00
Printed
Pri
Printed
inted by
by::
School
Scho
Sc o ooff Op
hool Open
O pen
e LLearning,
earn
ea rnin
ing, University
g, U niversit
ni Delhi
i y off D elhi
BUSINESS ANALYTICS
Reviewer
Ms. Aishwarya Anand Arora
Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
New Delhi - 110026 (11800 Copies, 2025)
Syllabus Mapping
Unit - I: Introduction Lesson 1: Introduction to
Data and Data Science; Data analytics and data analysis, Classification of Data Science
Analytics, Application of analytics in business, Types of data: nominal, (Pages 1–18)
ordinal, scale; Big Data and its characteristics, Applications of Big data.
Challenges in data analytics.
Unit - II: Data Preparation, Summarisation and Visualisation Using Lesson 2: Data
Spreadsheet Preparation, Summarisation
Data Preparation and Cleaning, Sort and filter, Conditional formatting, Text and Visualisation Using
to Column, Removing Duplicates, Data Validation, identifying outliers in Spreadsheet
the data, covariance and correlation matrix, Moving Averages, Finding the (Pages 19–50)
missing value from data; Summarisation; Visualisation: scatter plots, line
charts, histogram, etc., Pivot Tables, pivot charts and interactive dashboards.
Unit - III: Getting started with R Lesson 3: Getting Started
Introduction to R, Advantages of R, Installation of R Packages, Importing with R
data from spreadsheet files, Commands and Syntax, Packages and Libraries, (Pages 51–67)
Data Structures in R - Vectors, Matrices, Arrays, Lists, Factors, Data Frames,
Lesson 4: Data Structures
Conditionals and Control Flows, Loops, Functions, and Apply family.
in R
(Pages 68–101)
Unit - IV: Descriptive Statistics Using R Lesson 5: Descriptive
Importing Data file; Data visualisation using charts: histograms, bar charts, Statistics Using R
box plots, line graphs, scatter plots. etc; Data description: Measure of (Pages 102–118)
Central Tendency, Measure of Dispersion, Relationship between variables:
Covariance, Correlation and coefficient of determination.
Unit - V: Predictive and Textual Analytics Lesson 6: Predictive and
Simple Linear Regression models; Confidence & Prediction intervals; Textual Analytics
Multiple Linear Regression; Interpretation of Regression Coefficients; (Pages 119–137)
heteroscedasticity; multi-collinearity. Basics of textual data analysis, signif-
icance, application, and challenges. Introduction to Textual Analysis using
R. Methods and Techniques of textual analysis: Text Mining, Categorization
and Sentiment Analysis.
PAGE
Lesson 1: Introduction to Data Science 1–18
Glossary 139–140
PAGE i
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
1
Introduction to Data
Science
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
1.1 Learning Objectives
1.2 Introduction
1.3 Data and Its Types
1.4 Data Analytics and Data Analysis
1.5 Application of Analytics in Business
1.6 Big Data and Its Characteristics
1.7 Applications of Big Data
1.8 Challenges in Data Analytics
1.9 Summary
1.10 Answers to In-Text Questions
1.11 Self-Assessment Questions
1.12 References
1.13 Suggested Readings
PAGE 1
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.2 Introduction
Data Science is an interdisciplinary field that combines statistics, data
analysis, and machine learning to obtain meaningful insights and knowl-
edge from data. It is based on the processes of gathering, analysing data,
and making informed decisions by using the patterns that have evolved.
It is very versatile as it allows businesses and organizations to enhance
decision-making, perform predictive analyses, and discover hidden patterns
within datasets. Data science is applied in many spheres, including banks,
healthcare, manufacturing, and e-commerce, to serve critical applications
such as optimizing routes, forecasting revenues, creating targeted promo-
tional offers, and even predicting election outcomes.
Data Scientist combines expertise in machine learning, statistics, pro-
gramming (using tools like R), mathematics, and database management
to work with raw data. This brings together a systematic approach to
asking the right questions in defining a problem, gathering and cleaning
data, standardizing it for analysis, finding trends, and presenting action-
able insights in a clear and impactful manner. By using Data Science,
organizations could tap into this capability and discover the full potential
in their data; it could be improving operational efficiency or enhancing
customer experiences and providing competitive advantage. Truly, the
field is still growing and expanding its scope and application, making it
quite in-demand in today’s data-driven world.
This lesson will serve as your foundation to understand data and data
analysis. How can we classify analytics and how this can be applied to
various businesses? We will also understand which data comes into the
category of big data and various applications and challenges that occur
while dealing with big data.
2 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 3
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Apart from types of data, its quality is also an important aspect. We are
exposed to data every day, for example, in news stories, weather reports
and advertising, but how can we determine whether the data is of good
quality or not. Quality is something that is important throughout the en-
tire data journey. The six aspects of data quality are relevance, accuracy,
timeliness, interpretability, coherence, accessibility.
Relevance: The relevance of data or statistical information reflects
the degree to which it meets the needs of data users. Some questions
that must be answered are, “does this information matter?” “Does
it fill an existing data gap?”
Accuracy: Accurate data give a true reflection of reality, a data
which is not accurate doesn’t help to gain any fruitful decision and
hence has no value.
Timeliness: It is the time when data is available to the user or
decision maker. It is the delay between the time when the data are
meaningful and when they are available. For example, the stock
information of an e-commerce needs to be updated and available
as soon as an order is placed.
Interpretability: An information that people can’t understand has
no value and could even be misleading. To avoid such misunderstandings,
data is followed by meta data which is supplementary information or
documentation that allows users to interpret the data properly.
Coherence: It can be split into two concepts: consistency and
commonality. Consistency means using the same concepts, definitions
and methods over time. Commonality means using the same or
similar concepts, definitions and methods across different statistical
programs. If there is good consistency and good commonality, then
it is easier to compare results from different studies or track how
they stay the same or change over time.
Accessibility: It is defined as how easy it is for people to find,
get, understand, and use data. When determining whether data
are accessible, make sure they are organized, available, accountable,
and interpretable.
4 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
1.4 Data Analytics and Data Analysis
Data analytics and data analysis are two terms frequently used inter-
changeably, but they also have different meanings in context of working
with data and extracting useful insights. While both have relevance for
data-based decisions making, each has its own scope and purpose.
Data analytics refers to the whole process of examining datasets in or-
der to find trends, patterns, relationships, and other insights that might
help in the decision-making processes. There are various techniques and
processes through which data is analyzed and interpreted meaningfully.
Data analytics is usually applied to answer questions or solve problems or
predict outcomes. There are four main types of data analytics: descriptive,
diagnostic, predictive and prescriptive.
Descriptive Analytics: This type of analytics focuses on summarizing
past data and describing what happened. It includes the use of
historical data to identify trends and patterns, often through statistical
measures like mean, median, and mode. Descriptive analytics answers
questions like, “What happened?” and provides insights into the
past performance of an entity or system, like how well a business
performed last year.
Diagnostic Analytics: IT is a step ahead, which identifies the
causes of certain trends or patterns identified in the descriptive
analytics phase. It addresses the “Why did it happen?” by focusing
on deeper analysis to understand root causes of the data observed,
like determining the factors of drop in subscribers of an Instagram
account.
Predictive Analytics: It has the ability to make predictions based
on historical data through statistical models, and the output could
be a machine learning algorithm. It answers the question, “What is
likely to happen?” Analyzing trends, patterns, and relationships for
future behaviours or outcomes. For example, predicting the sales
of a new trend for coming six months.
Prescriptive Analytics: This type of analytics suggests possible
actions and outcomes based on the analysis. It combines insights
from all other types of analytics to answer “What should we
PAGE 5
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
6 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 7
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes processes, and gain a competitive edge in today’s fast-paced and da-
ta-driven business world. Businesses can understand their past and present
by harnessing data from different forms of analytics as well as forecast
future trends, optimize operations, and create more personalized customer
experiences. Let’s see some examples that will help you understand the
underlying worth of analysis.
One of the most powerful applications of analytics in business
is understanding customer behavior and preferences. Business
organizations can identify trends, predict needs, and personalize their
offerings to increase customer satisfaction and loyalty by analyzing
customer data. This personalized shopping experience boosts sales
and enhances customer engagement by making it easier for customers
to find what they need. Retailers, e-commerce platforms, and even
service industries like healthcare use analytics to segment customers,
predict their future needs, and tailor their marketing campaigns or
product recommendations accordingly.
Analytics is also an important function for improving business
processes. From such studies, analyses of inventory, supply chain
logistics, and workforce performance can help tune business processes
to lower costs and optimize efficiency. The world’s largest retailer,
Walmart, makes use of predictive analytics to optimize inventory
management: it considers the seasonality of demand for particular
goods, thus ensuring that the right merchandise arrives in time at
the stores and is not overstored. It can be easily applied across
various sectors like manufacturing, logistics, and even hospitality,
where strong demand forecasting and resource optimization are key
to operational success.
Companies need to make wise financial decisions to stay ahead.
Analytics can help in forecasting revenue, manage budgets, and assess
financial risks so that companies can have a well-supported decision-
making process regarding investments and expenses. For example,
predictive analytics will help a bank assess the creditworthiness
of someone applying for a loan. Through analyzing a consumer’s
financial history, spending patterns, or even social media activity,
a bank can forecast the probability of loan repayment and charge
the appropriate interest rate.
8 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 9
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
10 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
many malls, stores, and websites around the world are generating similar Notes
information every second.
This huge amount of data is termed as big data. Businesses use this
data to understand the preference of customers, predict trends, and even
recommend products to you (like “You might also like.”). For example,
when Netflix suggests shows based on what you have watched, it’s using
big data to make smart recommendations tailored just for you.
In the modern technological world the data is expanding far too quickly
and people frequently rely on it, additionally because of the rate at which
the data is expanding it has become increasingly difficult to store the
data on any server hence to handle this to process and to analyze this
huge amount of data the concept of Big Data came into picture. It is a
collection of data that is large in volume (obviously the data is generated
every day hence it grows exponentially with time), it becomes difficult
for us to store it and manage it hence the traditional methods which were
used to store manage and handle data were proven to be inefficient.
Hence we can say Big data refers to extremely large datasets that are too
complex, vast, or fast-moving to be processed, stored, or analyzed using
traditional data processing methods. It is the accumulation, management,
and analysis of huge amounts of structured, semi-structured, and unstruc-
tured data to expose patterns, trends, and insights.
Some real-world examples of big data could be Instagram every minute
many photos and videos are shared across the world. Twitter generates
billion tweets per year each tweet can contain textual data, or it can be
image video or audio data. Gmail or outlook can also be an example as
around billion emails are sent every day and most of them contain differ-
ent attachments like text, video, photo etc. Banks, e-commerce, weather
monitoring systems, CCTVs etc. all contribute to big data.
Let’s understand the characteristics of big data or we can say 5 V’s of
big data:
Volume: It is the huge data amount generated which is major in
terms of Terabytes, petabytes, or even exabytes For example, the
likes, comments and post shared by billions of users of Facebook
every day.
PAGE 11
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
12 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 13
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
14 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 15
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
1.9 Summary
This lesson introduced the basic concepts of data and its importance in
the field of data science and analytics. It distinguished between data
analysis and data analytics, with the insight into the types of analytics:
16 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
1.12 References
Provost, F., & Fawcett, T. (2013). Data Science for Business.
O’Reilly Media.
PAGE 17
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Sharda, R., Delen, D., & Turban, E. (2020). Analytics, Data Science,
and Artificial Intelligence: Systems for Decision Support. Pearson.
18 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
2
Data Preparation,
Summarisation and
Visualisation Using
Spreadsheet
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
2.1 Learning Objectives
2.2 Data Preparation
2.3 Data Cleaning
2.4 Data Summarization
2.5 Data Sorting
2.6 Filtering Data
2.7 Conditional Formatting
2.8 Text to Column
2.9 Find and Remove Duplicates
2.10 Removing Duplicate Values
2.11 Data Validation
2.12 Identifying Outliers in Data
2.13 Covariance
2.14 Correlation Matrix
2.15 Moving Average
2.16 Finding Missing Values
2.17 Data Summarization
2.18 Data Visualization
PAGE 19
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
20 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Data Transformation: This refers to the conversion of data into a format Notes
or structure fit for analysis. Arguably, it will involve the normalization or
scaling of numeric values, encoding categorical variables, and aggregation
or disaggregation of data.
Data Integration: This is the integration of data from different sources
into one dataset. It may involve table merging, dataset joining, or another
kind of data conflict resolution.
Data Reduction: This is a process for reducing either the size or the
complexity of the dataset, and it involves feature selection, dimensionality
reduction, and sampling, among others.
Data Formatting: Consistency in format, including standardized date
formats and variable naming conventions.
Data Splitting: Basically, it is the division of data into subsets, usually
training, validation, and test sets. These sets help a model builder to build
models with the data, tune their hyperparameters, and finally estimate
their performance.
Good data preparation is important in order for one to generate valid
and accurate insights; otherwise, if the data quality is low, meaningful
conclusions will not be obtained.
22 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
ensure that the data is logically consistent, such as ensuring all transac- Notes
tions have corresponding dates.
Removing Irrelevant Data: This can be done by filtering data. That is
by removing data that is not relevant to the analysis or that does not
contribute useful information. This can include unnecessary columns,
out-dated records, or noise in the data.
Formatting and Structuring Data: This is done by ensuring that data
is in the correct format, such as consistent date formats or proper text
casing. Also, re-structure the data to meet the needs of the analysis, such
as pivoting tables or separating combined fields into distinct columns.
IN-TEXT QUESTIONS
1. What is the primary goal of data cleaning in a spreadsheet?
(a) To improve the appearance of the spreadsheet
(b) To remove inconsistencies and errors in the data
(c) To format data for printing
(d) To reduce the size of the spreadsheet
2. In data cleaning, what does “imputation” refer to?
(a) Removing unnecessary columns
(b) Filling in missing data with estimated values
(c) Filtering out irrelevant data
(d) Detecting outliers
PAGE 23
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
24 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 25
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
For filtering data on numeric values, you can even select a comparison,
like Between to see only the values that lie in a given range.
Notes
2.7 Conditional Formatting
Conditional Formatting allows users to
fill cells with certain color depending
on the condition. This enhances data
visualization and its interpretation. It also
helps in identifying patterns in data. Let
us see how conditional formatting can
be done in MS Excel.
Example: Highlight cells that have a value greater than 350.
Step 1: Select the range of cells
on which conditional formatting
has to be applied.
Step 2: On the Home tab, under
Styles Group, click Conditional
Formatting.
Step 3: Click Highlight Cells Rules
> Greater Than....
Step 4: Enter the desired value and
select the formatting style.
Step 5: Click OK
PAGE 27
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.8 Text to Column
Text to column feature is used to separate a single column data into mul-
tiple columns. This enhances readability of the data. For example, if a
column contains first name, last name and profession in a single column,
then this information can be separated in different columns. This allows
columns to have atomic values. Note that this separation is possible only
if multiple values are separated by the same delimiter in the cell. These
delimiters can be Comma, Semicolon, Space, or other characters. Let us
see how we can split data in MS Excel.
Step 1: Select the cell or column that contains the text to be split.
Step 2: Select Data > Text to Columns.
Step 3: In the Convert Text to Columns Wizard displayed on the screen,
select Delimited > Next.
Step 4: Select the Delimiters for your data.
Step 5: Select Next.
Step 6: Preview the split and select Finish.
28 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
3. What does Conditional Formatting allow you to do in a spreadsheet?
(a) Apply formulas automatically
(b) Highlight cells based on certain criteria
(c) Change data values based on formatting
(d) Sort data based on custom rules
4. To highlight only duplicate values in a range of data using
Conditional Formatting, which rule would you apply?
(a) Text that contains
(b) Top/Bottom Rules
(c) Highlight Cell Rules > Duplicate Values
(d) New Rule > Use a Formula
PAGE 29
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes of the data. Data validation helps users control the input to ensure ac-
curacy and consistency.
While validating data, specific criteria for accepting data in cell(s) are
set. This restricts users from entering invalid data. Thus, validating data,
not only enhances accuracy, reliability and integrity of data but it also
cuts time in manual checking and correcting data entries. In Excel, this
can be done using the steps given below:
Step 1: Select the Cells for Data Validation
Step 2: In the Data Tab, click on Data Validation to open the Data Val-
idation Dialog Box
Step 3: In the Data Validation dialog box, under the Settings tab, define
the validation criteria:
Allow: Select the type of data. This data can be Whole Number, Decimal,
List (only values from a predefined list are allowed), Date, Time, Text Length
(only text of a certain length is allowed). The last option is Custom which
is used for more complex criteria and can be specified using a formula.
Data: Specify the condition (e.g., between, not between, equal to, not
equal to, etc.).
Minimum/Maximum: Enter the acceptable range or limits based on the
above selection. For example, to allow values between 100 and 1000,
select “Whole Number,” “between,” and then set the minimum to 100
and the maximum to 1000.
You can even configure (optional) an Input Message that will appear when
the cell is selected. For this, click on InputMessage Tab in the dialog
box. Give a brief title for the input message box and enter the guidance
text that will appear when someone selects the cell. The guidance text
will instruct user on what type of data to enter.
Another optional feature in MS Excel is that you can customize the Error
Alert. To do this, under the Error Alert tab, specify what would happen
if user enters invalid data:
Show Error Alert after Invalid Data is entered: Check this to enable
error alerts.
Style: Choose from Stop, Warning, or Information to indicate the severity
of the alert.
30 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 31
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Analyze Data Values: After sorting the values, identify large data dis-
crepancies and outliers to eliminate them. Such values can be straight-
away deleted. But, a better option is to remove only statistical anomalies.
Identify Data Quartiles: To calculate the outliers in the data, calculate
quartiles using Excel’s automated quartile formula beginning with “=
QUARTILE ()” in an empty cell. After the left parenthesis, specify the
first and last cells in your data range separated by a colon and followed
by a comma and the quartile you want to define. For example, formu-
la like “= QUARTILE (A5:A50, 1)” or “= QUARTILE (B2:B200, 3).”
Will find values from A1 cell to A50 cells that belong to quartile 1 (the
25th percentile, or the value below which 25% of data points fall when
arranged in increasing order).
Define the Interquartile Range (IQR): IQR represents the expected
average range of the data (without outlier values). It is calculated by
subtracting the first quartile from the third quartile.
Calculate the Upper and Lower Bounds: Defining the upper and lower
bounds of data allows identification of values that are higher than expected
value (upper bound) and smaller than the lower bound.
Calculate the upper bound of data by multiplying IQR by 1.5 and adding it
to the third quartile. The formula can be given as, “= Q3 + (1.5 * IQR).”
Similarly, to find the lower bound of data, multiply the IQR by 1.5 and
subtract it to from your first quartile value. The formula can be given
as, “= Q1 + (1.5 * IQR).”
Remove the Outliers: After defining the upper and lower bounds of data,
review the data to identify values that are higher than the upper bound
or lower than the lower bound. These values are statistical outliers. So,
delete them for more accurate analysis or visualization reports.
2.13 Covariance
Covariance is a statistical function that calculates the joint variability of
two random variables, given two sets of data. To calculate covariance in
Excel, use the covariance.p functions. The syntax is = COVARIANCE.P
(array1, array2), where
Array1 is a range or array of integer values.
32 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 33
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
34 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 35
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes business, moving average of sales for the last 3 months is calculated to
understand the market trends. To forecast weather, the moving average
of three-month temperatures is calculated.
We can compute different types of moving average - simple (or arithme-
tic), exponential, variable, triangular, and weighted. But in this section,
let us see how to calculate simple moving average. In Excel, simple
moving average is calculated by using formulas and trendline options.
A simple moving average can be calculated using the AVERAGE func-
tion. Given a list of average monthly temperatures in column B, moving
average for first 3 months can be calculated as = AVERAGE(B2:B4) or
=SUM(B2:B4)/3. To find subsequent averages, the formula can be copied
in other rows.
36 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
5. What is the main purpose of data validation in spreadsheets?
(a) To perform mathematical calculations on data
(b) To ensure that data entered meets specific criteria
(c) To visualize data using charts
(d) To automatically sort data
6. What is an outlier in a dataset?
(a) A value that is similar to other values
(b) A value that falls within the Interquartile Range (IQR)
(c) A value significantly different from other values in the
dataset
(d) A missing or blank value
7. Which statistical method can be used to detect outliers using
quartiles?
(a) Standard deviation
(b) Z-score
(c) Interquartile Range (IQR)
(d) Median
PAGE 37
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.16 Finding Missing Values
Excel does not have any particular function to list missing values. But
it is important because of the following reasons:
Data Integrity which ensures that the dataset is complete.
Data Reconciliation that facilitates the reconciliation process (mostly
used in finance).
Quality Assurance to identify anomalies or data entry errors.
Efficient Analysis to perform accurate data analysis by spotting and
addressing gaps.
List missing Values in Excel
To identify and list missing values in Excel, you can use the following
functions:
IF, ISNUMBER and MATCH Functions:
IF: Returns one value if a condition is true and another if it’s false.
ISNUMBER: Checks if a value is a number.
MATCH: Searches for a value in a range and returns its relative
position.
Example: If a column A has a list of values in the range 1 to 100, then
missing values in this data can be identified by using the formula
= IF(ISNUMBER(MATCH(ROW(A1), A:A, 0)), “”, ROW(A1))
Note that the syntax of the MATCH
function is,
MATCH(lookup_value, lookup_ar-
ray, [match_type])
Where,
lookup_value is the value to be
matched in the lookup_array.
lookup_array is the range of cells
being searched.
match_type is optional. It can have
38 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
value -1, 0, or 1. The default value is 1. The argument specifies how Notes
Excel matches lookup_value with values in lookup_array.
Now, drag and apply the formula from B1 to B100. This will result in
column B displaying the missing values in the list.
Missing values can also be identified using the Filter feature on column
B to display only the missing numbers by excluding blank cells.
PAGE 39
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Microsoft Excel provides different types of charts to visualize data in the
spreadsheet. To draw a chart, you need to follow the steps given below:
Step 1: Organize the data in rows and columns within the Excel sheet.
Every row and column should be labelled clearly to identify the data to
be visualized.
Step 2: Select the data by clicking and dragging mouse to highlight
the data to be visualized. In this selection, include the row and column
headers (as shown in the figure).
Step 3: Choose a chart type by clicking on the “Insert” tab. In the “Charts”
section, select the required chart option (Column, Line, Pie, Bar, Area,
Scatter, etc.) by clicking on the dropdown arrow below the chart type.
40 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Step 4: Insert the chart. Once the desired chart is selected, it is auto- Notes
matically created and inserted in the worksheet. Now, it can be clicked
and dragged to change its position or resized by using the sizing handles
at the corners.
Step 5: Customize the chart. For this, click on the chart to select it. Now,
you would be able to see two additional tabs: “Design” and “Format”.
Use these tabs to customize the chart’s appearance, style and layout. Im-
portant information like chart title, axis labels, legend, data labels, etc.
can be added to enhance visualization and data interpretation.
Step 6: Edit the data (optional). In case you wish to make changes to
the data, simply edit it in the worksheet. Excel will automatically update
the chart to reflect the changes.
PAGE 41
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Line Chart: The line chart plots data points and then connects these
points by lines. These lines show trends or change in values over time.
Line charts are widely used for continuous data like stock prices or
temperature measurements.
Pie Chart: A pie chart plots data as slices of a circle. Size of each slice
is proportional to the value it represents. That is, it represents the pro-
portion of each category within a whole.
42 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
PAGE 43
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
44 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 45
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes If you want 3 pivot charts on the interactive dashboard then you must
have 3 pivot tables. So, you can simply duplicate the pivot table sheet
in the Excel workbook.
Step 3: Create Charts using the Pivot Table. For example:
The first chart would represent every product’s monthly sales. For this
chart, we need 3 data entries - Sales, Product, and Month. In the Pivot
table sheet, drag and drop the Month data in the rows area, product in
the columns area, and Sales in the values area.
46 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Data Table to insert a table representing all values in the data table. Notes
PAGE 47
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
8. Which of the following is the most suitable chart type for
displaying the proportion of different categories in a dataset?
(a) Line Chart
(b) Scatter Plot
(c) Pie Chart
(d) Histogram
9. Which of the following operations can you perform using a
pivot table?
(a) Filter data based on specific criteria
(b) Create complex formulas
(c) Sort data in a specific column
(d) All of the above
10. Which type of chart is commonly used in pivot charts to show
data changes over time?
(a) Bar Chart
(b) Pie Chart
(c) Line Chart
(d) Scatter Plot
11. What is a common benefit of using a dashboard for data analysis?
(a) It provides detailed data without summarization
(b) It allows for real-time monitoring of key metrics
(c) It removes the need for data visualization
(d) It only displays raw data without analysis
48 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.23 Summary
The aim of data preparation is to ensure good quality and consistency of
data for specific tasks. While data preparation, we need to detect outliers
that fall outside the expected range of values. These unexpected values
could be due to errors or may require special attention. We need to decide
whether to remove, correct, or leave outliers in the dataset. Sometimes
outliers are valid and should be kept, but in other cases, they may need
correction or exclusion.
Moreover, to validate accuracy of the data, check the data against reliable
sources or business rules to ensure accuracy. And, ensure that the data is
logically consistent, such as ensuring all transactions have corresponding dates.
Data summarization is done to transform a given large dataset into a
smaller form, usually presentable, for reporting, analysis, and further
examination. It involves extracting central insights and patterns from
data without losing vital information. Pivot tables are an important part
of MS Excel that allows users to quickly summarize large amounts of
data, analyze numerical data in detail, and answer unanticipated questions
about the data. Correspondingly, Pivot Chart is a dynamic visualization
tool that helps users summarize and analyze large datasets. Trends and
patterns can be easily identified by pivot charts.
Notes
2.25 Self-Assessment Questions
1. Give the steps to clean data.
2. Explain any five processes that are performed to prepare data for
analysis.
3. What are the different ways to summarize data? Give examples.
4. Why is it important to analyse outliers? How can this be done?
5. How will you determine how strongly two variables are related to
each other?
6. Explain the significance of pivot tables.
2.26 References
McFedries, P. (2018). Excel data analysis for dummies (5th ed.).
John Wiley & Sons.
Middleton, M. R. (2021). Data analysis using Microsoft Excel (5th
ed.). Cengage Learning.
Alexander, M. (2016). Excel Power Pivot & Power Query for
dummies. John Wiley & Sons.
Winston, W. L. (2019). Microsoft Excel data analysis and business
modeling (6th ed.). Microsoft Press.
50 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3
Getting Started with R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
3.1 Learning Objectives
3.2 Introduction
3.3 Installation
3.4 Importing Data from Spreadsheet Files
3.5 Commands and Syntax
3.6 Data Type
3.7 Operators
3.8 Functions
3.9 Summary
3.10 Answers to In-Text Questions
3.11 Self-Assessment Questions
3.12 References
3.13 Suggested Readings
PAGE 51
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.2 Introduction
As you have already studied about importance of data analytics. In the pre-
vious lesson we explored data preparation, summarisation and visualisation
with spreadsheet. In this chapter we will introduce you to a popular open-
source programming language designed primarily for statistical computing
and data analysis i.e. R programming (referred as R henceforth). Suppose
a retail company, “ShopSmart,” that needs to analyse its daily sales, cur-
rently they use excel for basic data handling, but it becomes challenging
with the increasing size of data. Switching to R helps “ShopSmart” analyse
larger datasets seamlessly, create informative visualizations, and under-
stand customer purchasing trends, ultimately leading to better marketing
strategies and inventory management. R was developed in early 1990s by
statisticians Ross Ihaka and Robert Gentleman. It has become commonly
accepted tool among data scientists, researchers, and analysts. Apart from
the traditional tools like excel, R is more flexible and scalable in terms
of data analysis. With R more complex computations and visual displays
for large datasets can be performed and handled by the programmer. R
is used in wide areas and, if we talk about business and commerce, R is
used for tasks like predicting the market, appraising financial risk and
optimization of investment, the behaviour of customers, sales forecasting,
inventory control, analysis regarding customer feedback, customer data
segmentation, and campaign performance. Thus, R is an important tool to
make decisions based on data in commerce and gives commerce students
the ability to process and interpret data efficiently. Let’s understand some
of the advantages of R and why should we prefer R for data analysis.
Statistical software generally has very costly licenses, but R is
completely free to use, which makes it accessible to anyone interested
in learning data analysis without needing to invest money.
R is a versatile statistical platform providing a wide range of data
analysis techniques, enabling virtually any type of data analytics
to be performed efficiently and having state-of-the-art graphics
capabilities for visualization.
The data is mostly gathered from variety of sources analysing it at
one place has its own challenges. R can manage data from a variety
of sources, including text files, spreadsheets, databases, and web
APIs, making it suitable for any business environment.
52 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3.3 Installation
To begin with R, students need to install both R (the base programming
language) and RStudio which is an Integrated Development Environment
(IDE) that makes working with R much easier. RStudio provides a more
user-friendly interface compared to R’s base interface, making coding, vi-
sualizing outputs, and managing projects more straightforward and easier.
Follow the steps mentioned below in Table 3.1 to download R and Rstudio.
Table 3.1: Installation of R and RStudio
For R
Step 1: Go to [CRAN (Comprehensive R Archive Network)] (https://
cran.r-project.org/).
PAGE 53
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
R Interface
54 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 55
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
RStudio Interface
56 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Source Editor Pane: In RStudio IDE, you can access the source Notes
editor (marked as 1 in Figure 3.1) for R code. It is a text editor that
can be used for various forms of R code (shown in 2 of Figure 3.1),
such as standard R Script, R Markdown, R Notebook and R Sweave
etc. We can write and edit code here in the editor.
Console Pane: This pane (as shown in 3 of Figure 3.1) has R
interpreter, where R code is processed. This pane will show execution
of R code (written in editor) and results are displayed.
Environment Pane: This pane can be used to access the variables
that are created in the current R session. The workspace having all
variables can be exported as well as imported (from an existing
file) as R Data file in the environment window.
Output Pane: This pane contains the Files, Plots, Packages, Help,
Viewer, and Presentation tabs. Files allow users to explore files
present on the local storage system. Plots display all graphical
outputs that are produced by the R interpreter. In packages tab you
can view the installed packages (in your RStudio) and load them
manually. In help tab documentation can be searched and viewed
for various R functions.
PAGE 57
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes if you want to access some code or data written by other people you
can do that as well using package. As you already know that R has an
open community support hence, we have many R packages available. R
packages are pre-written sets of functions to perform certain task, that
enhance its capabilities. In simple terms it is a bunch of data, from func-
tions, to help menus, stored in one place called package. In Figure 3.2 we
have installed ‘tidyverse’ package in RStudio, it is a popular collection
of packages designed to make data science easier. It includes tools for
importing, tidying, transforming, and visualizing data.
58 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
You can specify the sheet name, range of cells, and column types for Notes
better control. Sample code is shown in code window 1.
Code Window 1
IN-TEXT QUESTIONS
1. What is the difference between a package and a library in R?
2. Which package is commonly used to import Excel files into R?
Code Window 2
Comments are the text that are written for the clarity of code, they help
reader to understand your code and they are ignored by interpreter while
the program execution. Single comment is given using # at the beginning
of the statement. R does not support multiline comment.
PAGE 59
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 3
Variables act as containers that hold data or values, which can be used
and manipulated through your program. The creation of a variable in
R is done using the assignment operator <- or =. Variables make you
work with data much better by giving meaningful names to the values
you want to use. Variables in R are flexible—you don’t have to declare
their type explicitly. R automatically understands whether you’re storing
a number, text, or something else.
Code Window 4
There are certain rules to give valid variable names in R as discussed
below:
A variable name can include letters (a-z, A-Z), digits (0-9), and the
dot (.) or underscore (_) but cannot start with a number.
R is case sensitive var and Var are two different identifiers.
Reserved keywords in R cannot be used as variable names.
Any special character except underscore and dot is not allowed.
Variable names starting with a dot are allowed, but they should not
be followed by a number. It is not advised to use dot as starting
character.
Some examples of valid and invalid variable names are shown in Table 3.2.
60 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Keywords: These are integral part of R’s syntax, keywords are used to
implement various functionalities in R. These are also called reserved
words and are predefined having specific roles within the language. The
list of reserved words in R is quite comprehensive which can be accessed
by executing ‘help(reserved)’ or using ‘?reserved’. Table 3.3 shows the
list of reserved words in R.
Table 3.3: Reserved Words
PAGE 61
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.6 Data Type
Unlike C or C++, we do not require to declare a variable with data
type in R. It supports random assignment of data type for a variable
depending upon the values that it has been initialized to. There are var-
ious data types available in R, a list of data types is shown in Table 3.4
apart from these general data types are also supports a lot of flexible
data structures such as vector list arrays data frames etc. which will be
discussed in later lessons.
Table 3.4: Data Types in R
R provides function so that you can view the various variables that
are currently defined in your R environment the following function
is applied to see the list of variables that are currently available.
Use ls() to list all variables in current environment.
ls(pattern = “name”) will give list of variables matching the
given pattern.
Another function that can be used to display variables if objects().
We can also remove variables from R environment using following
functions:
rm(variable_name) removes a single variable.
rm(var1, var2, var3) will remove multiple variables mentioned
as argument.
rm(list = ls()) will remove all variables.
62 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3.7 Operators
Operators are tools that help us to perform various operations on data, we
can do basic calculations or more advanced logical comparisons, operators
tell R what action to take on data. There are various operators available
in R programming, we will discuss them in this section.
Arithmetic Operators are simplest and most frequently used opera-
tors. It allows us to carry out simple math operations like addition,
subtraction, multiplication, and division. For example, 5 + 3 adds two
numbers together and gives a result of 8. For 5 %% 2, the remainder
will be calculated and is 1. Advanced arithmetic is also available, such
as exponentiation through either the ^ or ** operators. This will enable
you to raise a number to a power. These operators are not restricted to
single numbers only. They also work element-wise on numeric vectors
to compute things easily even with very big datasets. Table 3.5 shows
various arithmetic operators.
Table 3.5: Arithmetic Operators
Relational Operators are used to compare values and check for conditions
like equality, greater than, or less than. For instance, 5 > 3 checks whether
5 is greater than 3 and returns TRUE. similarly, 5 == 3 checks for equality
and returns FALSE. They are also widely used in filtering or subsetting
data where you will find rows of a dataset that satisfy some condition. For
example, you could use age > 18 to find all rows in a dataset where the
age is above 18. Relational operations always return logical values (TRUE
or FALSE). Table 3.6 shows various relational operators.
PAGE 63
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Logical Operators let you combine or modify logical values. You can
use & to perform an AND operation, where it is only true if both the
conditions are satisfied. For instance, TRUE & FALSE is false. Likewise,
you use the | operator to obtain an OR operation, meaning that the result
is TRUE if at least one condition is satisfied. The! operator negates a
logical value, turning TRUE into FALSE and vice versa. Logical operators
are especially useful when dealing with multiple conditions in your data.
For instance, age > 18 & gender == “Male” can filter male individuals
above the age of 18 in a dataset. Table 3.7 shows various logical operators.
Table 3.7: Logical Operators
Operator Example Result
& AND (element-wise) TRUE & FALSE FALSE
&& AND (single comparison) TRUE && TRUE TRUE
| OR (element-wise) TRUE | FALSE TRUE
| | OR (single comparison) FALSE | | FALSE FALSE
! Not (negation) !TRUE FALSE
64 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3.8 Functions
In R, user-defined functions enable you to create reusable blocks of
code to perform specific tasks. Functions are especially useful in busi-
ness analytics for automating repetitive operations, performing custom
calculations, or implementing domain-specific logic. By defining your
own functions, you can encapsulate complex logic into simple, reusable
units, which improve the clarity and efficiency of your code.
In R, a function is defined by the keyword function(). Inputs are speci-
fied as arguments, and you write the logic to work with these inputs and
generate the desired output. A well-crafted function has three components:
Name which is a descriptive identifier for the function, Arguments are
variables passed into the function for customization and The code block
where the logic is executed called body. Code window 5 shows example
of function in R.
Code Window 5
PAGE 65
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
3.9 Summary
This chapter has given an overview of working with R for data analysis,
including both foundational concepts and practical tools, to equip readers
with the essential skills. The Introduction to R looked at the basics of this
versatile programming language, powerful features, and its critical role
in data analysis. The Installation section guided readers through setting
up R and RStudio to ensure a smooth start to coding. Moving forward,
the chapter covered Importing Data, explaining how to read spreadsheet
files into R using the readxl package, an important part of data prepara-
tion. There was a focus on the understanding of Commands and Syntax,
particularly in R about case sensitivity and the need for proper formatting
to avoid error execution.
The Data Types section highlighted the primary types: numeric, character,
logical, and factors that form the basis of data handling and analysis in R.
Further details about the use of Operators, namely, arithmetic, relational,
logical, and special operators were described in detail for data manip-
ulation and analysis. Finally, the chapter ended with a presentation of
Functions, both built-in and user-defined, to perform tasks with minimal
repetition and compute complex calculations efficiently. Together, these
topics set the strong foundation for using R in data analysis.
66 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
3.11 Self-Assessment Questions
1. How do you install and load a package in R? Provide a code example.
2. Explain the difference between numeric and character data types in
R.
3. Write a user-defined function in R that calculates the square of a
number.
4. Describe how to import a .csv file into R. Mention any required
functions or packages.
5. Identify two relational and two logical operators in R.
3.12 References
Grolemund, G., & Wickham, H. (2016). R for Data Science: Import,
Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
Matloff, N. (2011). The Art of R Programming. No Starch Press.
PAGE 67
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
4
Data Structures in R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.3 Vectors
4.4 Matrices
4.5 Lists
4.6 Factors
4.7 Data Frames
4.8 Conditionals and Control Flows
4.9 Loops
4.10 Apply Family
4.11 Summary
4.12 Answers to In-Text Questions
4.13 Self-Assessment Questions
4.14 References
4.15 Suggested Readings
68 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Identify when and how to select among various data structures and Notes
control mechanisms.
Write cleaner, more efficient R code using the strength of functional
programming.
4.2 Introduction
In this lesson we will discuss about data structures and their use to orga-
nize and process data more efficiently. You may think of a data structure
as a blueprint that indicates how to arrange and store data. The design
of any structure is deliberate, as it allows data access and manipulation
in certain, structured ways. We use specialized methods or functions to
interact with these structures in programming and statistical software like
R. These tools are built for easier working with data of all shapes and
forms. R offers six key data structures to work with: Vectors, Matrices,
Arrays, Lists, Factors and Data frames.
Further these can be divided into two categories Homogenous and Hetero-
geneous structures. The first three vectors, matrices, and arrays are like
neat, organized boxes, where everything is of the same type hence they
are called homogenous. On the other hand, heterogeneous structures are
data frames and lists that allow for greater flexibility. They can accom-
modate elements of various types to coexist together. Factor is a special
data structure specially used for handling categorical data (nominal or
ordinal). In the subsequent sections we will discuss these data structures.
A point to remember for those who are already familiar with program-
ming; R has no scalar types, in fact, numbers, strings or any other scalar
are vectors of length one.
4.3 Vectors
It is one of the basic data structures in R programming languages, it is
used to store multiple values having same type also called modes. It is
one-dimensional and can hold numeric, character, logical or other values,
but all the values must have same mode. Vectors are fundamental to R,
hence most of the operations are performed on vectors. Various types of
vectors are shown in Table 4.1 below:
PAGE 69
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Creating a Vector
You can create vectors using the c() function, which stands for combine
or concatenate. Also, vectors are stored contiguously in memory just
like arrays in C, hence the size of vector is determined at the time of
creation. Thus, any modification to the vector will lead to reassignment
(creating a new vector with same name internally). Code to create and
display a few vectors is shown below in code window 1.
Code Window 1
Another point to note is c() function allows you to modify or reassign
an existing vector as shown in code window 2.
Code Window 2
70 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
This will add 10 to the vector v1 in the end or on 4th place as instructed. Notes
Vectors are useful for analysis as R allows us to use various operations
over them. In this section we will explore various operations that can
be used on vectors.
Length: We can obtain the length of a vector using length() function.
This can be used to iterate over vector in loops.
This will give 3 and (3,23,4) as output. You can also give nega-
tive index to omit a value like print(v1[-2]) will output all values
except second index.
You can also apply filtering to vectors by applying logical expressions
that return true/false for each vector, output is given by true values.
PAGE 71
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
72 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 3
Miscellaneous Functions: There are certain functions shown below
in Table 4.4 which can be used with vectors, as required.
Table 4.4: Miscellaneous Functions
4.4 Matrices
Since you have understood vectors and various operations that can be
applied on them, now, let’s talk about matrices. You can understand a
matrix as an enhanced vector: it’s really nothing but a vector with two
extra attributes; namely the number of rows and the number of columns.
PAGE 73
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes As with vectors, matrices are also homogenous. However, don’t mix up
one-row or one-column matrices with vectors-they are not the same.
Now, matrices are actually a special type of a broader concept in R called
arrays. While matrices have just two dimensions (rows and columns),
arrays can go further and have multiple dimensions. For instance, a
three-dimensional array has rows, columns, and layers, adding an extra
level of organization into your data. The reason that matrices are useful
in R is the vast array of operations that you can carry out on them. Many
of these operations are based upon what you know already about vectors,
such as subsetting and vectorization, but it expands these in two dimen-
sions. The added structure of rows and columns makes matrices ideal
for mathematical operations, data manipulation, and statistical modelling.
The various operations on matrices are discussed below:
Creation: Matrices are generally created using matrix() function,
the data in matrices is stored in column major format by default.
The ‘nrow’ parameter specifies rows, and ‘ncol’ specifies columns.
We can use ‘byrow = TRUE’ to fill data row-wise in matrix instead
of column-wise. Code to create matrix using matrix() function and
by using vectors is shown below in code window 4.
Code Window 4
74 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 5
Code Window 6
PAGE 75
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 7
As you may have noticed in the code window 7 that arithmetic multi-
plication and matrix multiplication are two different functions. Some of
the other functions are rowSums() and colSums() that give sum of rows/
columns and rowMeans() and colMeans() that give mean of rows/columns.
Just like vectors indexing and subsetting can be done on matrices.
You can access specific elements, rows, or columns using indices
as shown in code window 8.
Code Window 8
76 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
based on logical criteria. Some examples are shown below in code Notes
window 9.
Code Window 9
You can also give name to the rows and columns of a matrix using
the dimnames() function or by specifying them during the creation
of the matrix (as shown in code window 10).
Code Window 10
PAGE 77
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Arrays
An array in R is a data structure that can store data in more than
one dimension, hence in R arrays are an extension of matrix. While a
matrix is constrained to two dimensions, with rows and columns, an
array, however, can take three or more dimensions. Arrays are more
useful for organizing and manipulating data having more than two
axes, such as 3D spatial data or multi-dimensional experimental results.
Array can be created using array() function with arguments data,
dimensions and dimension names as shown in code window 11.
Code Window 11
Code Window 12
78 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 13
4.5 Lists
In R, a list is an amazingly flexible data structure, meaning it can store
any kind of data together - numbers, characters, vectors, matrices, and
even other lists. This flexibility makes list different from vectors or ma-
trices, which insist on elements to be of the same class. A list is useful
for organizing complex data where different types may coexist. In R, lists
are used frequently, not only for storing results from statistical models
but also in general for organizing heterogeneous data:
You create a list by using the “list()” function, and any of the
elements in the list are accessed using double square brackets “[[
]]”. So for instance, “list(42, “Hello”, c(1, 2, 3))” generates a list
that has an integer, a string, and a vector.
PAGE 79
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 14
Code Window 15
We can find size of a list using length(), we can also add or delete
elements.
80 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 16
4.6 Factors
Factors are another type of R objects that are created by using a vector,
it stores the vector as well as a record of distinct values in that vector
called level. Factors are majorly used for nominal or categorical data.
Code Window 17
As shown in code window 17 factor fac has 8 values but only 3 different
levels. Level is very useful as shown in code window 18:
PAGE 81
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 18
Here in the code window, case 1 shows that we tried to assign a value
to factor index 2 and it was successfully done as the value belonged to
predefined level, but in case 2 we got NA assigned to index 2 o factor
instead of 15 because 15 was not present in factor level. In case 3 we
anticipated a new level which was not present in initial vector, but we gave
it in factor definition. Thus illegal values cannot be assigned to vectors.
Two commonly used functions with vectors are split() and by().
As the name suggests split() function is used to divide an object
82 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(such as a vector, data frame, or list) into subsets based on a certain Notes
grouping factor, it is particularly useful when you want to break
down your data into smaller groups according to a factor (like a
categorical variable).
Code Window 19
As shown in code above the vector data is split into groups A,B,C cor-
responding to their factor. However, by() function is used to apply a
function to subsets of a data object that have been grouped by a factor.
It is used when for scenarios where you want to perform operations like
calculating the mean, sum, or other statistical measures for each group
as shown in code window 20.
PAGE 83
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 20
84 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
are of the same length, and meaningful column names can be assigned Notes
for easy interpretation and management of data.
Data frame creation is shown in code window 21.
Code Window 21
Code Window 22
PAGE 85
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Subsets can be extracted from data frames based on row and column
selection or using logical conditions or by using the subset() function
as shown in code window 23.
Code Window 23
Code Window 24
86 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
We can used rbind() or cbind() to combine two data frames row Notes
wise or column wise provided they have same number of columns
in case of rbind() and vice versa. We can also use merge function
to combine two or more data frames by matching rows based on
common columns.
Code Window 25
PAGE 87
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Figure 4.1
There are three decision making constructs in R programming: if, if…
else, switch.
88 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 26
PAGE 89
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 27
Code Window 28
90 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 29
PAGE 91
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.9 Loops
Like any other programming language, we have loops in R too. They
are basic constructs allowing a block of code to be executed repeatedly.
R implements several kinds of loops: for, while, and repeat. Each loop
type is suited for different tasks, depending on the kind of control flow
needed. We will discuss the code and syntax of each of these loops in
this section.
For Loop: It is used to iterate over a sequence of elements (that
are iterate able), such as a vector, list, or sequence using a loop
control variable. The code of for loop is given in code window 30
Code Window 30
The above code iterates over a vector and prints all elements of vectors
one by one. We can write code to iterate over other data structures in
the same manner.
Like for loop, while loop also repeatedly executes a block of code
as long as the condition remains TRUE. But here the loop control
variable needs to be initialized outside the loop. While code to print
sum of 5 numbers is shown below in code window 31, iteration
variable is increment inside the loop.
92 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 31
Code Window 32
PAGE 93
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes We can also have nested loops for complex operations where iterations
are needed at various levels. For example, if you want to print
columns for each row, nested code is shown in code window 33.
Code Window 33
Here, the outer loop takes each value of i, and for every single value of
i, the inner loop takes each value of j. This structure makes sure that for
each pair of values taken by i and j, one calculation is performed—it is
the product of i and j. The result of this calculation is then printed. This
is typically applied when many tasks require the calculation of tables,
pairwise comparisons, or generally any combinatorial operation involving
several variables.
Next and break statements can be used to control loop, next helps
to skips the current iteration and moves to the next one while break
terminates the loop entirely as seen in repeat loop. Code is given
in code window 34.
94 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 34
Thus, in R, loops are very helpful when automating repetitive tasks; for
loop iterates over elements in a sequence, such as vectors or lists, exe-
cuting a block of code for each element. The “while” loop continues to
execute if a specified condition is “TRUE”. This makes a good choice
for tasks where one doesn’t know beforehand the number of iterations. A
loop will run endlessly until stopped using a break statement that should
be provided, ideal when the condition for its stop is more complex in
expression.
Although loops are very general, R’s vectorized operations and apply-fam-
ily functions are often much faster alternatives to handle large datasets
or for simple operations, so are generally preferred in most cases.
IN-TEXT QUESTION
5. Write an R code snippet using an if-else statement to check if
a number is even or odd.
PAGE 95
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
4.10 Apply Family
The apply family in R includes functions like apply, lapply, sapply, vap-
ply, tapply, mapply, and rapply. It is very useful and powerful feature of
R. These functions provide alternatives to loops for applying functions
across various data structures like vectors, matrices, arrays, lists, factors,
and data frames. They are generally more concise and can improve code
readability and performance for vectorized operations, loops can be slower
than vectorized operation. In this section we will discuss these functions
one by one along with code.
The apply() is used to operate on margins of matrix and array.
It applies a given function along rows or columns of a matrix or
higher-dimensional array. The syntax is apply(X, MARGIN, FUN)
where X is matrix or array, margin refers dimensions and fun is
the function that we need to apply.
Code Window 35
96 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 36
Code Window 37
PAGE 97
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes vapply() is also like lapply() and sapply() but it lets you to specify
the expected output type for better reliability.
Code Window 38
Code Window 39
Code Window 40
98 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 41
In this code we have given nested list to function rapply and x^2 is
applied to each element of list. Classes specify to apply function only
to specific classes and how return structure like “unlist” for vector or
“replace” for nested list.
Table 4.5 shows various functions of apply family.
4.11 Summary
In this chapter we have covered some of the basic building blocks in R
that serve as the foundation for manipulating data and controlling pro-
grams. Vectors are one-dimensional arrays that hold elements of a similar
type, whereas matrices extend this concept to two dimensions, and arrays
PAGE 99
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes generalize further to n-dimensions. Lists, however, are containers that can
hold elements of different types, making them very versatile. Factors are
utilized to represent categorical data in a statistically efficient manner.
Data frames, which are a hybrid structure that combines the features of
lists and matrices, are ideal for organizing tabular data. To add logic to
your programs, tools like “if”, “else”, and “switch” allow decision-mak-
ing capabilities. For repetitive operations, loops like “for”, “while”, and
“repeat” are necessary; however, the apply family of functions provides
more efficient alternatives, enabling concise and functional programming.
This chapter has laid a solid foundation for dealing with data, writing
efficient code, and solving complex programming problems in R.
100 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
2. Explain how a list differs from a vector and give a practical example Notes
of when you would use a list instead of a vector.
3. Describe the structure of a data frame and explain why it is particularly
useful for working with tabular data.
4. Write an R code snippet using an if-else statement to determine
whether a number is positive, negative, or zero.
5. What is the purpose of the apply family of functions, and how do
they improve code efficiency compared to traditional loops?
4.14 References
Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly
Media.
Matloff, N. (2011). The Art of R Programming. No Starch Press.
Crawley, M. J. (2012). The R Book. Wiley.
PAGE 101
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
5
Descriptive Statistics
Using R
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.3 Importing Data File
5.4 Data Visualisation Using Charts
5.5 Measure of Central Tendency
5.6 Measure of Dispersion
5.7 Relationship between Variables
5.8 Summary
5.9 Answers to In-Text Questions
5.10 Self-Assessment Questions
5.11 References
5.12 Suggested Readings
Create and interpret various types of charts such as histograms, bar charts, box plots,
line graphs, and scatter plots for data visualization.
Describe data using measures of central tendency (mean, median, mode).
Notes
5.2 Introduction
Data analysis is an important skill in today’s data-driven world, allowing
people and organizations to extract meaningful insights from raw data.
This lesson focuses on equipping you with the essential tools and tech-
niques in R, a powerful statistical computing and visualization language,
to handle data effectively. We begin by learning how to import data
files, a fundamental step in data analysis. Whether working with CSV
files, Excel sheets, or other formats, importing data correctly ensures a
seamless workflow for subsequent analysis.
Now we’re moving on to data visualization. It is an essential part of the
process of data exploration and communication. It will teach you how to
unveil patterns, distributions, and relationships in your data using visual
representations like histograms, bar charts, box plots, line graphs, and
scatter plots. Visualization doesn’t only help understand complex datasets
but also communicate results to others effectively. Descriptive statistics
are the basis of data analysis. You will look into measures of central
tendency-mean, median, mode, which summarize the central value of a
dataset, and measures of dispersion-range, variance, standard deviation,
interquartile range, which describe the variability or spread of data. These
measures give an overall idea of the characteristics of the data.
Lastly, we discuss the relationships between variables using concepts such
as covariance, correlation, and the coefficient of determination (R²). With
these tools, we can express and interpret the nature of the relationship
between variables. This is the foundation on which predictive modelling
and decision-making are based. At the end of this lesson, you would
have gained theoretical knowledge and practical skills in the analysis and
interpretation of data using R. Being a beginner or enhancing your skill
set on data, it gives a good foundation working with real-world datasets.
PAGE 103
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes also supports a variety of file formats for data import such as spread-
sheets, text files (.csv or .txt), from databases (MySQL, SQLite, and
PostgreSQL.), from software (SPSS, SAS, or STATA) and other formats
(JSON, XML, HTML etc.).
For importing data from csv we require the function read.csv(), the
syntax is read.csv(filepath, header, sep); where filepath specifies the
location of file, header parameter specifies if the first row contains
column names (TRUE/FALSE), and Sep is used to provide delimiter
(like “,” for csv).
Code Window 1
This will load my file to data and head() will display first six rows by
default.
To import other formats, we need to load desired package from
library, code to read excel, json and excel file is shown below in
code window 2.
Code Window 2
R can also interact with database using packages like DBI and
RMySQL as shown in code window 3.
104 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 3
Once you import data to R you can start doing data analysis using the
built-in functions or libraries.
The first five rows of mpg dataset are shown below in code window 4.
PAGE 105
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 4
(i) Histograms: We can use histogram to visualize the distribution of
a single continuous variable by binning i.e. dividing them into
intervals or bins. It is useful to identify patterns such as skewness,
spread, or unusual gaps. The code is given in code window 5.
Code Window 5
106 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 6
(iii) Box Plot: It summarizes the distribution of a continuous variable
by displaying the median, quartiles, and potential outliers. It is
useful when we need to compare across multiple groups. The code
is shown in code window 7.
PAGE 107
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 7
This code gives a box plot for highway mileage (hwy) for different
car classes (class). function geom_boxplot() generates a box plot
for each class with statistical summaries. The fill color is set to
light green color.
(iv) Line Graphs: They are recommended when we want to analyse
trends over a continuous variable or to observe relationships. The
code to generate line graph is shown in code window 8.
Code Window 8
108 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
(v) Scatter Plots: When we need to visualize the relationship between Notes
two continuous variables we can use scatter plots, they are an ideal
choice for identifying trends, clusters, or correlations. The code
window 9 shows how to generate scatter plot.
Code Window 9
IN-TEXT QUESTIONS
1. Which function in R is commonly used to load a CSV file?
2. What type of chart is used to visualize the frequency distribution
of data?
PAGE 109
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes (i) Mean: The arithmetic average or mean is calculated by the formula
below:
(iii) Mode: The mode represents the value that appears most frequently in
a dataset. A dataset can be unimodal having one mode, multimodal
having more than one mode, or no mode at all if no value repeats.
The example below shows all three, the corresponding code to
compute mode in dataset is given in code window 10.
110 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 10
PAGE 111
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes (ii) Variance: It is the measures of deviation from mean i.e. how far
each data point is from the mean, on average. A higher variance
indicates greater variability in the data. Variance is expressed in
squared units which makes it harder to interpret directly. The formula
of variance is given below:
Formula:
112 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 11
PAGE 113
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes distance from a Wi-Fi router and internet speed. Nonlinearity is shown when
variables have a curved or complex relationship instead of a straight-line
pattern( we will study them in lesson 6). There is no relationship between
two variables when changes in one variable do not affect the other like
shoe size and IQ have no relationship. In this section, we’ll discuss three
key concepts for relationship measurement - Covariance, Correlation, and
the Coefficient of Determination (R²). Both covariance and correlation are
used to measure linear dependency between pair of random variables also
called bivariate data. You have already studied correlation and covariance
in lesson 2 we will explore these two again with r programming code.
Covariance is a statistical measure that indicates how two variables change
together. It shows whether an increase in one variable leads to increase
in another variable, or whether this will affect inversely. The formula for
covariance is given below:
Thus, covariance measures how the two variables vary together, so that
if the value of covariance is positive it means the relationship is direct
wherein both increase and decrease as each other while a negative covari-
ance means that the one increases as the other increases. If the covariance
value approximates zero, then apparently there is no relationship. However,
it is hard to interpret because covariance is measured in units that are the
product of the variables’ units, so it is hard to compare or standardize
across datasets. An example is shown in code window 12 below:
Code Window 12
114 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
When r=1 it shows perfect positive linear relationship, while r=-1 indi-
cates negative linear relationship and r=0 means no linear relationship.
Correlation not only indicates the strength of the relationship but also
the direction. The R code for computing correlation is given in code
window 13:
Code Window 13
Correlation provides valuable insights into the strength and direction of
the relationship between two variables. A strong positive correlation (0.7
U LQGLFDWHVWKDWDVRQHYDULDEOHLQFUHDVHVWKHRWKHUYDULDEOHDOVR
increases significantly, demonstrating a robust linear relationship. On the
RWKHU KDQG D ZHDN RU QR FRUUHODWLRQ U § LPSOLHV WKDW FKDQJHV LQ RQH
do not reliably predict changes in other, which indicates there isn’t any
practically important linear relation between the variables.
Further, coefficient of determination (R²) is used for explaining variance, it
is denoted by R² and quantifies how well one variable predicts another. It
is especially useful in regression analysis to evaluate the goodness-of-fit.
It is computed by squaring the correlation coefficient as shown in formula:
R2 = r2
PAGE 115
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes This value represents the proportion of variance in one variable (Y) ex-
plained by the other (X). For instance, an R2 = 0.81 implies 81% of the
variation in Y can be explained by X.
Values of R2 can be used to deduce relationship like R²= 0 indicates no
predictive relationship among the given variables i.e. the independent
variable (X) does not explain any of the variation in the dependent vari-
able (Y), while R² = 1 signifies perfect prediction indicating that all the
variation in Y is entirely explained by X.
IN-TEXT QUESTIONS
3. Which measure of central tendency is the middle value of a
sorted dataset?
4. What is the statistical term for the difference between the
maximum and minimum values in a dataset?
5. What term describes the strength and direction of the linear
relationship between two variables?
6. Which metric indicates how well one variable explains another
in regression analysis?
5.8 Summary
This lesson provides a good foundation in using data analysis with R by
considering how you can import data, information visualization, describing
data, and correlating or checking for dependence among variables. This
will help you to read your data files and to successfully import various
kinds of sources, like CSVs or XLS, into your program. The next area
of key importance is visualization, where we discussed several types of
charts that allow us to visualize our data. Histograms help to find dis-
tributions. Bar charts are useful for categorical data comparisons. Box
plots summarize data distributions by summarizing medians, quartiles,
and outliers. Line graphs capture trends over time, and scatter plots
show the relationships between two variables. These tools not only help
in analyzing the data but also make it easy to communicate findings.
Further, descriptive statistics is explained which helps to describe char-
acteristics of the dataset. Measures of central tendency-mean, median,
116 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
and mode-help summarize the center of the data. However, measures of Notes
dispersion-range, variance, standard deviation, and IQR-help explain the
variation or spread in the data. Together, these measures allow for a full
description of the dataset.
Lastly, we covered relationships between variables. You have learnt to
analyze how variables are connected through covariance and correlation,
which describe how two variables change together and the strength of the
relationship. The coefficient of determination, or R², quantifies exactly
the degree to which one variable predicts another, giving a more nuanced
understanding of how they interact. Mastering these concepts and tools
will enable you to import, visually present, describe, and analyze data as
needed in preparation for more advanced data analysis and decision-making.
5.11 References
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using
R. SAGE Publications.
PAGE 117
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning: With applications in R. Springer.
Kabacoff, R. I. (2015). R in action: Data analysis and graphics
with R (2nd ed.). Manning Publications.
118 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
6
Predictive and Textual
Analytics
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.3 Simple Linear Regression Models
6.4 &RQ¿GHQFH DQG 3UHGLFWLRQ ,QWHUYDOV
6.5 Multiple Linear Regression
6.6 ,QWHUSUHWDWLRQ RI 5HJUHVVLRQ &RHI¿FLHQWV
6.7 Heteroscedasticity and Multi-Collinearity
6.8 Basics of Textual Data Analysis
6.9 Methods and Techniques of Textual Analysis
6.10 Summary
6.11 Answers to In-Text Questions
6.12 Self-Assessment Questions
6.13 References
6.14 Suggested Readings
Notes
6.2 Introduction
In this lesson we will focus on two important analytics for business:
Predictive and textual, each having its own importance and techniques.
Predictive analytics often referred to as advanced analytics, is very closely
linked with business intelligence, giving organizations actionable insights
for better decision-making and planning. For instance, if an organization
wants to know how much profit it will make a few years later based on
current trends of sales, customer demographics, and regional performance
they can take benefit from predictive analytics. It uses techniques such
as data mining and artificial intelligence to predict outcomes like future
profit or other factors that may be critical to the success of the organi-
zation. At its core, predictive analytics is a process that makes informed
predictions about future events based on historical data. Analysts and data
scientists apply statistical models, regression techniques, and machine
learning algorithms to identify trends in historical data, so businesses
can predict risks and trends and prepare for future events.
Predictive analytics is changing industries because it can facilitate da-
ta-driven decision-making and efficiency in operations. In marketing, it
aids businesses in understanding customer behavior, leads segmentation,
and targeting of high-value prospects. Retailers apply predictive analytics
in order to personalize shopping experiences, optimize pricing strategies,
and merchandise plans. Manufacturing uses predictive analytics to en-
hance the monitoring of machine performance, avoid equipment failures,
and smooth logistics. Fraud detection, credit scoring, churn analysis,
and risk assessment are some of the benefits for the financial sector. Its
applications in healthcare include personalized care, resource allocation,
and identification of high-risk patients for timely interventions. As tools
get better, predictive analytics continues revolutionizing industries with
smarter insights and better outcomes.
Textual analysis is the systematic examination and interpretation of textual
data in order to draw meaningful insight, patterns, and trends. There are
different techniques and methods involved in the processing of unstructured
text, including various forms of documents, social media posts, or reviews
by customers. The intention of textual analysis is for the transformation
of raw text to become accessible and useful as input in decision-making
120 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
processes or for research. This can be identifying hot words, categorizing Notes
text topics, or sentiment analysis associated with a particular piece of
text. Textual analysis really is the backbone of a number of fields, such
as marketing, social science, business intelligence, to name only a few,
in garnering valuable insights from huge streams of unstructured text.
:KHUH ȕ0 DQG ȕ1 are model coefficients which are unknown constants
UHSUHVHQWLQJ LQWHUFHSW DQG VORSH DQG ࣅ LV HUURU WHUP RU UDQGRP QRLVH
We can build a simple linear regress model in R using the following steps:
The very first step is to prepare data for this we need to ensure
that the data set is clean and contain no missing values for the
variable involved. We need to load the data into R using functions
like read.csv() or read.table().
Once data is uploaded we need to visualize it using scatter plot.
After that we can fit the regression model using lm() function and
then we can use summarize method to understand the details of
the model.
We can also make predictions using predict() function.
We have already studied the reading techniques in previous lessons. We
will focus on linear regression functions in this section.
The lm() function in R is a built-in function for fitting linear models,
both simple and multiple linear regression. It estimates the relationship
PAGE 121
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 1
In code window 1 we have shown the code for linear regression on
built-in dataset “mtcars” available in R. The goal here is to predict how
122 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
changes in horsepower ‘hp’ influence miles per gallon ‘mpg’. So, in this Notes
example we will examine information regarding “mpg” and “hp”. Since
the data is already prepared and formatted, we visualize the data using
scatter plot. This will help us analyse whether there is a pattern or trend
in the data that is worth exploring further.
Next, we try to fit a simple linear regression model by using the formula
“lm(mpg ~ hp, data = mtcars)”, this predicts mpg based on hp. After
fitting the model, we enhance our scatter plot by adding a regression line
using the abline() function to show the model’s predicted relationship. To
make the model useful, we use the predict() function to estimate mpg for
a specific horsepower value. This prediction tells us what fuel efficiency
we might expect for such a car. Finally, we examine the model’s accu-
racy by plotting its residuals—the differences between the observed and
predicted values. This step helps us check if there are any patterns in
the errors, which could indicate that the model is missing some critical
information.
PAGE 123
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 2
In the code above, a simple linear regression model is fitted to the mtcars
dataset using lm(mpg ~ wt), where mpg is the dependent variable, and
wt is the independent variable that is the weight of the car. The model
predicts how car weight influences fuel efficiency. Then we have used
the predict function to compute a confidence interval for the mean re-
sponse (mpg) for the car with weight = 3.0. To do this, specify interval
= “confidence” and level = 0.95. The function will compute the range
in which the average mileage for cars weighing 3.0 will fall, with 95%
confidence. This computation gives the lower and upper bound of the
mean mpg, thus making it possible to evaluate the precision and reliability
of the estimated mean value.
A prediction interval (PI) predicts the interval where the actual value
of the dependent variable (y) is likely to lie for a given x. Unlike CIs,
prediction intervals contain the residual error variability ). Thus, a pre-
diction interval gives a range for an individual data point rather than the
mean response. For example, let’s consider the same regression model
as above, a prediction interval for a car with weight 3.0 might indicate
that its mileage lies between 16.0 and 22.0 mpg, with 95% confidence.
R code for same is given in code window 3.
Code Window 3
124 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Hence, confidence intervals are narrower in range than prediction intervals Notes
since they only consider the uncertainty of the estimate of the mean of
response variable. Prediction intervals are larger because they encompass
the variability of individual observations about the regression line as well
as the uncertainty in the mean.
The visualization code and output of confidence and prediction interval
is shown in code window 4.
Code Window 4
In this code we have visualized car weight against mileage for all the
cars in the mtcars dataset. The plot function is used to make a scatter
PAGE 125
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes plot of weight vs. mileage, the abline function plots the fitted regres-
sion line in blue with a width of 2 representing the linear relationship
between the two variables. Two shaded areas are added to the plot to
represent the confidence intervals and prediction intervals. The confidence
interval is represented by a green shaded area around the regression line
that indicates the range within which the mean predicted values lie for a
given car weight. Similarly, the prediction interval is represented as a red
shaded area, which is the range in which individual predictions for new
data points are likely to fall. Finally, a legend is added to the top-right
corner of the plot to distinguish the regression line, confidence interval,
and prediction interval. The legend uses different colors and labels to
make the plot easy to interpret, with blue for the regression line, green
for the confidence interval, and red for the prediction interval.
Thus both the serve analysts, decision-makers, and researchers in trying
to quantify uncertainty in bettering predictions and drawing appropriate
conclusions from regression models.
Where y LV WKH WDUJHW YDULDEOH ȕ0 is the intercept term that represents
the value of y when all independent variables are 0. We have multiple
independent variables represented by x1, x2, …., xn ZLWK FRHIILFLHQWV ȕ1,
ȕ2 « ȕn DQG ࣅ LV WKH HUURU WHUP
For an MLR model to give valid results we need to fulfil a few assump-
tions that are stated below:
The relationship between the independent and dependent variables
must be linear.
126 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 5
Once the model has been fitted, the interpretation of the coefficients is
of great importance. In each case, the coefficient reflects the change in
the dependent variable for a one-unit change in the respective indepen-
dent variable, with all other variables held constant. For example, if the
coefficient for car weight is -0.1, then an increase in car weight of one
unit would reduce mpg by 0.1, assuming horsepower and cylinders are
held constant.
To access the performance of MLR various statistics like r-squared, f-sta-
tistic, p-value etc. can be used. Multiple Linear Regression is a powerful
technique for modelling and understanding the relationship between a
PAGE 127
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
The key is the fact that each coefficient reflects the effect of the corre-
sponding variable, but only after controlling for the influence of the other
YDULDEOHVLQWKHPRGHO/LNHKHUHȕ1 represents the change in mileage with
respect to change in car weight, holding the other variables (horsepower
DQG F\OLQGHUV FRQVWDQW 6LPLODUO\ ȕ2 shows change in mileage for each
additional unit of horsepower, again holding weight and cylinders constant.
128 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 129
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes the assumptions of linear regression. All error variances should be ap-
proximately constant at all independent variable levels.
Heteroscedasticity affects the standard errors of the coefficients, biasing
test statistics. The model may be statistically significant when it shouldn’t
be, or vice versa. Residual plots help analysts detect heteroscedasticity.
If the residuals plot as a random cloud around zero, everything is fine.
If they fan out or form a pattern as the independent variable increases,
heteroscedasticity may be present. A potential method to address hetero-
scedasticity is to take a log transformation of the dependent variable, or
use weighted least squares regression in which more weight is placed
on observations with less variability. To detect heteroscedasticity in your
regression model, you can use residual plots, the code is shown in code
window 6.
Code Window 6
130 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
If you observe heteroscedasticity in the residual plot, one way to address Notes
it is by applying a log transformation to the dependent variable. The code
is shown in code window 7.
Code Window 7
Multicollinearity occurs when two or more independent variables in a
regression model are highly correlated with each other. In simple terms,
the independent variables are stepping on each other’s toes and giving
redundant information, thereby not making it possible to separate the
individual effect of each variable on the dependent variable. For example
if we attempt to forecast a person’s income, given his or her level of
education and years of work experience. Since there is a strong tendency
for people with higher levels of education to have more work experience,
the two variables can be highly correlated. This kind of multicollinearity
renders it difficult to distinguish the contribution of each variable inde-
pendently to the prediction of income. The regression coefficients can
become unstable and generate widely fluctuating estimates with very
minor changes in data thus resulting in an unreliable model.
Analysts commonly use the Variance Inflation Factor (VIF) as a measure
to detect multicollinearity for each independent variable. A high VIF,
usually above 10, means that the variable is highly correlated with at
least one of the other predictors in the model. In case of multicollinearity,
PAGE 131
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes the easy solution is to delete one of the correlated variables or combine
them into one more meaningful variable-for instance, by calculating an
index or an average. The code is shown in code window 8.
Code Window 8
Both heteroscedasticity and multicollinearity can reduce the effectiveness
of a regression model. Heteroscedasticity interferes with the estimation
of the standard errors of the coefficients, which may lead to misleading
significance tests, while multicollinearity prevents the assessment of the
effects of each predictor. Knowledge of these issues and their detection
and resolution is crucial in developing more reliable and interpretable
regression models.
IN-TEXT QUESTIONS
1. What type of regression model uses one independent variable?
2. What interval is used to estimate the range within which the
true value of a parameter lies?
3. What term describes the condition when the variance of errors
is not constant?
4. What is the term for the relationship between predictors and
the outcome in regression?
5. What type of regression issue arises from high correlations
between predictors?
132 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 133
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
Code Window 9
6.9.2 Categorization
It refers to the process of assigning text into predefined categories or
labels based on the content. This method is applied in many applications,
including email filtering (spam vs. non-spam), document classification
(business, sports, tech), and sentiment analysis (positive, negative, neutral).
The idea is to map a given text document to one or more categories that
best represent the content. Techniques of categorization involve super-
vised learning models including Naive Bayes, Support Vector Machines
(SVM), and Logistic Regression. Such models require labeled training
data to learn how to classify new, unseen data. The model can predict
the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating
a Document-Term Matrix (DTM) and using classification models such as
Naive Bayes. The sample code is shown in code window 10.
Code Window 10
134 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Code Window 11
IN-TEXT QUESTIONS
6. What technique is used to classify text into predefined categories?
7. What statistical tool helps identify sentiment in text?
8. What analysis method breaks down text into meaningful elements
for extraction?
PAGE 135
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
6.10 Summary
This lesson introduces essential concepts in regression analysis, starting
from simple and multiple linear regression models. It shows the role of
confidence and prediction intervals in statistical prediction and teaches
how to interpret regression coefficients. It also addresses two common
challenges: heteroscedasticity and multicollinearity. The second half of
the lesson covers textual data analysis techniques such as text mining,
categorization, and sentiment analysis, outlining an overview of how such
techniques can be applied in the extraction of insights from unstructured
text data. The practical use of R applications ensures students are well-
equipped to not only carry out statistical analyses but also to perform
text analysis.
136 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
6.13 References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Introduction
to statistical learning with applications in R. Springer.
Fox, J. (2016). Applied regression analysis and generalized linear
models (3rd ed.). Sage Publications.
Silge, J., & Robinson, D. (2017). Text mining with R: A tidy
approach. O’Reilly Media.
PAGE 137
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 139
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Data Splitting: Division of data into subsets, usually training, validation,
and test sets. These sets help a model builder to build models with the
data, tune their hyperparameters, and finally estimate their performance.
Data Transformation: The process that refers to the conversion of data
into a format or structure fit for analysis.
Data: It is collection of raw facts and figures collected for reference or
analysis.
Dispersion: The spread of data that can be measured by range, variance,
standard deviation, or IQR.
Factor: Data structure to store categorical data.
Function: A reusable block of code that performs a specific task, such
as sum() or user-defined functions.
List: Flexible data structure allowing elements of different types.
Matrix: A two-dimensional data structure with all elements of the same
type ordered in rows and columns.
Multiple Linear Regression: Extended simple linear regression that uses
multiple independent variables to predict a dependent variable.
Operator: A symbol or set of symbols used to perform operations like
addition (+) or comparisons (==).
Package: A collection of R functions, data, and documentation that extend
R’s functionality, like dplyr or ggplot2.
Qualitative Data: Descriptive data that defines characteristics like nom-
inal and ordinal data.
Quantitative Data: Data that can be measured like continuous or dis-
crete data.
Regression Coefficients: Values that represent the relationship between
the predictor variables and the response variable in a regression model.
Simple Linear Regression: A statistical method that models one to one
relationship between a dependent and independent variable.
Text Mining: The process of extracting useful information from unstruc-
tured text data.
Vector: An ordered sequence of elements of the same type, such as
numeric, character.
140 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi