Business Analytics: Rahul Publications
Business Analytics: Rahul Publications
MBA
II Year III Sem
(Osmania University)
Latest 2021-22 Edition
BUSINESS
ANALYTICS
Study Manual
Short Questions and Answers
Multiple Choice Questions
Fill in the blanks
Solved Model Papers
Solved Previous Question Paper
Price 0
9-0
`. 19
- by -
Well Experienced Lecturer
TM
Rahul Publications
Hyderabad. Ph : 66550071, 9391018098
BUSINESS
ANALYTICS
Inspite of many efforts taken to present this book without errors, some errors
might have crept in. Therefore we do not take any legal responsibility for
such errors and omissions. However, if they are brought to our notice, they
will be corrected in the next edition.
Price ` 199-00
N Unit - II
Unit - III
39 - 92
93 - 152
2.3.1 Tables 51
I
Topic Page No.
UNIT - III
3.1 Trend Lines 93
3.2 Regression Analysis 95
3.2.1 Linear Regression 100
3.2.2 Multiple Regression 108
3.3 Forecasting Techniques 115
3.4 Data Mining 120
3.4.1 Definition of Datamining 120
3.5 Approaches in Data Mining 125
3.6 Data Exploration and Reduction 128
3.7 Data Reduction 136
3.8 Data Classification 139
3.9 Data Association 141
3.10 Cause Effect Modeling 142
Short Answer Questions 145
Choose the Correct Answer 150
Fill in the blanks 152
UNIT - IV
4.1 Overview of Linear Optimization 153
4.2 Non-linear Programming Integer Optimization 165
4.3 Cutting Planning Algorithm and Methods 169
4.4 Decision Analysis 173
4.4.1 Decision Making under Risk and Uncertainity 174
Short Answer Questions 181
Choose the Correct Answer 184
Fill in the blanks 186
II
Topic Page No.
UNIT - V
5.1 Programming Using R 187
5.2 R Package 195
5.3 Reading and Writing Data in R 198
5.4 R Functions 200
5.5 Control Statements 205
5.6 Frames and Subsets 216
5.7 Managing and Manipulating Data in R 225
Short Answer Questions 227
Choose the Correct Answer 232
Fill in the blanks 234
III
UNIT - I BUSINESS ANALYTICS (OU)
Ans :
Business analytics is the practice of iterative, methodical exploration of an
organization’s data, with an emphasis on statistical analysis. Business analytics is
used by companies committed to data-driven decision-making.
Definition of business analytics
According to Schaer (2018) - “allows your business to make predictive analysis
rather than reacting to changes in data”.
According to Gabelli School of Business (2018)- “involves applying models,
methods, and tools to data, producing insights that lead to informed business decisions”
According to Wells (2008) - “the application of logic and mental processes to
find meaning in data”
According to Lynda (2018) - “allows us to learn from the past and make
better predictions for the future”.
Business analytics (BA) refers to the skills, technologies, practices for continuous
iterative exploration and investigation of past business performance to gain insight
and drive business planning. Business analytics focuses on developing new insights
and understanding of business performance based on data and statistical methods.
In contrast, business intelligence traditionally focuses on using a consistent set of
metrics to both measure past performance and guide business planning, which is also
based on data and statistical methods. Business analytics makes extensive use of
1
Rahul Publications
MBA II YEAR III SEMESTER
2
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
Self-service has become a major trend among business analytics tools. Users
now demand software that is easy to use and doesn’t require specialized training.
This has led to the rise of simple-to-use tools from companies such as Tableau
and Qlik, among others. These tools can be installed on a single computer for
small applications or in server environments for enterprise-wide deployments.
Once they are up and running, business analysts and others with less specialized
training can use them to generate reports, charts and web portals that track
specific metrics in data sets
3
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
There are four types in business analytics
(i) Prescriptive: This type of analysis reveals what actions should be taken.
This is the most valuable kind of analysis and usually results in rules and
recommendations for next steps.
(ii) Predictive: An analysis of likely scenarios of what might happen. The
deliverables are usually a predictive forecast.
(iii) Diagnostic: A look at past performance to determine what happened and
why. The result of the analysis is often an analytic dashboard.
(iv) Descriptive: What is happening now based on incoming data. To mine
the analytics, you typically use a real-time dashboard and/or email reports.
1. Prescriptive analytics : It is really valuable, but largely not used. Where big
data analytics in general sheds light on a subject, prescriptive analytics gives
you a laser-like focus to answer specific questions. For example, in the health
care industry, you can better manage the patient population by using prescriptive
analytics to measure the number of patients who are clinically obese, then add
filters for factors like diabetes and LDL cholesterol levels to determine where to
focus treatment. The same prescriptive model can be applied to almost any
industry target group or problem.
2. Predictive analytics : It use big data to identify past patterns to predict the
future. For example, some companies are using predictive analytics for sales
lead scoring. Some companies have gone one step further use predictive analytics
for the entire sales process, analyzing lead source, number of communications,
types of communications, social media, documents, CRM data, etc. Properly
tuned predictive analytics can be used to support sales, marketing, or for other
types of complex forecasts.
3. Diagnostic analytics : They are used for discovery or to determine why
something happened. For example, for a social media marketing campaign, you
can use descriptive analytics to assess the number of posts, mentions, followers,
4
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
fans, page views, reviews, pins, etc. There can be thousands of online mentions
that can be distilled into a single view to see what worked in your past campaigns
and what didn’t.
4. Descriptive analytics : Descriptive analysis (or) data mining are at the bottom
of the big data value chain, but they can be valuable for uncovering patterns
that offer insight. A simple example of descriptive analytics would be assessing
credit risk; using past financial performance to predict a customer’s likely financial
performance. Descriptive analytics can be useful in the sales cycle, for example,
to categorize customers by their likely product preferences and sales cycle.
Q4. Explain the different models in Business Analytics?
Ans :
An analytical model is simply a mathematical equation that describes
relationships among variables in a historical data set. The equation either estimates or
classifies data values. In essence, a model draws a “line” through a set of data points
that can be used to predict outcomes. What is a business analysis model?
Simply put, a business analysis model outlines the steps a business takes to
complete a specific process, such as ordering a product or on boarding a new hire.
Process modeling (or mapping) is key to improving process efficiency, training, and
even complying with industry regulations.
Because there are many different kinds of processes, organizations, and functions
within a business, BAs employ a variety of visual models to map and analyze data.
The following the different models:
1. Activity diagrams
Activity diagrams are a type of UML behavioural diagram that describes what
needs to happen in a system. They are particularly useful for communicating
process and procedure to stakeholders from both the business and development
teams.
A Business analytical might use an activity diagram to map the process of
logging in to a website or completing a transaction like withdrawing or depositing
money
5
Rahul Publications
MBA II YEAR III SEMESTER
valid
invalid
balance
< amount
Show balance
6
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
Detail Detail
Detail Detail
Example
Example
Detail
Detail
Example
Detail Detail
3. Product roadmaps
Product (or feature) roadmaps outline the development and launches of a product
and its features. They are a focused analysis of a product’s evolution, which
helps developers and other stakeholders focus on initiatives that add direct value
to the user.
The beauty of product roadmaps lies in their flexibility and range of applications.
BAs can create different product roadmaps to illustrate different information,
including:
7
Rahul Publications
MBA II YEAR III SEMESTER
Feature releases
A defined product outline and schedule helps sales stay on the same page as the
developers so they can deliver accurate, updated information to their prospects and
clients. Because of their versatility and broad applications across teams and
organizations, product roadmaps are a core part of an analyst’s toolbox.
4. Organizational charts
8
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
Shareholder
Inspection
Board
Finance
centre
CEO
Accounting Collections
International Corporate Property Health
Personal Payroll IT
Reinsurance Renewal Renewal
Lines
HR
Claims Claims
5. SWOT analysis
The SWOT analysis is a fundamental tool in a Business analytics. SWOT stands
for strengths, weaknesses, opportunities, and threats. A SWOT analysis evaluates
a business’s strengths and weaknesses and identifies any opportunities or threats
to that business.
SWOT analysis helps stakeholders make strategic decisions regarding their
business. The goal is to capitalize on strengths and opportunities while reducing
the impact of internal or external threats and weaknesses.
From a visual modeling perspective, SWOT analysis is fairly straight
forward. A typical model will have four boxes or quadrants-one for each category-
with bulleted lists outlining the respective results.
SWOT Analysis
9
Rahul Publications
MBA II YEAR III SEMESTER
10
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
These diagrams focus on broad, high-level systems rather than annotating minor
process details.
8. PESTLE analysis
A PESTLE analysis often goes hand-in-hand with a SWOT analysis. PESTLE
evaluates external factors that could impact business performance. This acronym
stands for six elements affecting business: political, economic, technological,
environmental, legal, and sociological.
PESTLE analysis assesses the possible factors within each category, as well as
their potential impact, duration of effect, type of impact (i.e., negative or positive),
and level of importance.
This type of business analysis helps stakeholders manage risk, strategically plan
and review business goals and performance, and potentially gain an advantage
over competitors.
9. Entity-relationship diagram
An entity-relationship diagram (ER diagram) illustrates how entities (e.g., people,
objects, or concepts) relate to one another in a system. For example, a logical
ER diagram visually shows how the terms in an organization’s business glossary
relate to one another.
ER diagrams comprise three main parts:
Entities
Relationships
Attributes
11
Rahul Publications
MBA II YEAR III SEMESTER
Attributes apply to the entities, describing further details about the concept.
Relationships are where the key insights from ER diagrams arise. In a visual model,
the relationships between entities are illustrated either numerically or via crow’s foot
notation.
These diagrams are most commonly used to model database structures in software
engineering and business information systems and are particularly valuable tools for
Business analytics in those fields.
12
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
13
Rahul Publications
MBA II YEAR III SEMESTER
14
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
End user Involvement and Buy-In : End users should be involved in adopting
Business Analytics and have a stake in the predictive model.
Explain ability vs. the “Perfect Lift”: Balance building precise statistical
models with being able to explain the model and how it will produce results.
Ans :
Adopting and implementing Business Analytics is not something a company
can do overnight. But, if a company follows some best practices for Business Analytics,
they will get the levels of insight they seek and become more competitive and successful.
We list some of the most important best practices for Business Analytics here, though
your organization will need to determine which best practices are most fitting for there
needs.
15
Rahul Publications
MBA II YEAR III SEMESTER
Know the objective for using Business Analytics. Define the business use case
and the goal ahead of time.
Define the criteria for success and failure.
Select the methodology and be sure to know the data and relevant internal and
external factors
Validate models using to predefined success and failure criteria
Business Analytics is critical for remaining competitive and achieving success.
When they get BA best practices in place and get buy-in from all stakeholders, the
organization will benefit from data-driven decision making.
16
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
PolyBase. ...
Presto
Some of the most common of those big data challenges include the following:
Organizational resistance
Ans :
The evolution of big data is discussed below,
(i) 1970s and before : The data generation and storage of 1970s and before is
fundamentally primitive and structured. This era is termed as the era of
mainframes, as it stores the basic data.
(ii) 1980s and 1990s : In 1980s and 1990s the evolution of relational data bases
took please. The relational data utilization is complex and thus this era comprises
of data intensive applications.
(iii) 2000s and beyond : The World Wide Web (www) and the Internet of Things
(IOT) have an aggression of structured, unstructured and multimedia data. The
data driven is complex and unstructured.
17
Rahul Publications
MBA II YEAR III SEMESTER
18
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
19
Rahul Publications
MBA II YEAR III SEMESTER
20
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
21
Rahul Publications
MBA II YEAR III SEMESTER
2. Social Media Data: Social networking sites such as Face book and Twitter
contains the information and the views posted by millions of people across the
globe.
4. Power Grid Data: The power grid data mainly holds the information consumed
by a particular node in terms of base station.
5. Transport Data: It includes the data’s from various transport sectors such as
model, capacity, distance and availability of a vehicle.
6. Search Engine Data: Search engines retrieve a large amount of data from
different sources of database.
Ans :
The importance of big data is how you utilize the data which you own. Data can
be fetched from any source and analyze it to solve that enable us in terms of
1. Cost reductions
2. Time reductions,
3. New product development and optimized offerings, and
4. Smart decision making.
22
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
i) Fraud & Big Data - Fraud is intentional deception made for personal gain or
to damage another individual. - One of the most common forms of fraudulent
activity is credit card fraud. - Social media and mobile phones are forming new
frontiers fraud. - Capegemini financial services team believes that due to the
23
Rahul Publications
MBA II YEAR III SEMESTER
nature of data streams and processing required BIG Data Technologies provide
an optimal technology solution based on the following three Vs : 1. High volume:
Years of consumer records and transactions (150 billion + 2. records per year).
3. High velocity: Dynamic transactions and social media info. 4. High variety:
Social media plus other unstructured data such as customer E-mails, call center
conversations as well as transactional structured data.
III) Big data and Healthcare
Big data promises enormous revolution in healthcare, with the advancements in
everything from the management of chronic disease to the delivery of personalized
medicine. - In addition to saving and improving lives, Big Data has the potential
to transform the entire healthcare system by replacing guesswork and institution
with objective data-driven science. - The healthcare industry now has huge
amount of data: from biological data such as gene expression, Special Needs
Plans (SNPs), proteomics, metabolomics, and next-generation gene sequence
data etc. The exponential growth in data is further accelerated by the digitization
of patient level data stored in Electronic Medical Records (EMRs) or Electronic
Health Records (EHRs) and Health Information Exchanges (HIEs) enhanced
with data from imaging and test results, medical and prescription claims and
personal health devices. - In addition to saving and improving lives, Big Data
has the potential to transform the entire health care system by replacing guesswork
and intuition with objective, data-driven science
IV) Advertising and Big Data
Big Data is changing the way advertisers address three related needs. (i) How
much to spend on advertisements. (ii) How to allocate amount across all the
marketing communication touch points. (iii) How to optimize advertising
effectiveness. Given these needs advertisers need to measure their advertising
end to end in terms of Reach, Response & Reaction. Reach, Resonance, and
Reaction Reach: First part of reach is to identify the people who are most
volumetrically responsive to their advertising and then answer questions such
as what do those people watch? What do they do online? How to develop media
plan against intended audience. The second part of reach is delivering
advertisements to the right audience. That is, to understand if we are actually
reaching our desired audience. If we think about the online world, it’s a world
where we can deliver 100 million impressions but we never really know for sure
24
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
who our campaign was actually delivered to. If our intended audience is women
aged 18 to 35, of our 100 million impressions, what percentage of impressions
were actually delivered to the intended audience? What was the reach, what
was the frequency, what was the delivery against the intended audience?
Resonance: If we know whom we want to reach and we’re reaching them
efficiently with your media spend, the next question is, are our ads breaking
through? Do people know they’re from our brand? Are they changing attitudes?
Are they making consumers more likely to want to buy our brand? This is what
is called resonance. Reaction: Advertising must drive a behavioral reaction or
it isn’t really working. We have to measure the actual behavioral impact.
Q14. Explain the life cycle of big data.
Ans :
Big Data Life Cycle
In today’s big data context, the previous approaches are either incomplete or
suboptimal. For example, the SEMMA methodology disregards completely data collection
and pre-processing of different data sources. These stages normally constitute most of
the work in a successful big data project.
2. Research
4. Data Acquisition
5. Data Mugging
6. Data Storage
9. Modeling
10. Implementation
25
Rahul Publications
MBA II YEAR III SEMESTER
26
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
In order to combine both the data sources, a decision has to be made in order to
make these two response representations equivalent. This can involve converting
the first data source response representation to the second form, considering one
star as negative and five stars as positive. This process often requires a large
time allocation to be delivered with good quality.
6. Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big
data technologies offer plenty of alternatives regarding this point. The most
common alternative is using the Hadoop File System for storage that provides
users a limited version of SQL, known as HTVE Query Language. This allows
most analytics task to be done in similar ways as would be done in traditional BI
data warehouses, from the user perspective. Other storage options to be
considered are Mongo DB, Redis and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of
their abilities to implement different architectures. Modified versions of traditional
data warehouses are still being used in large-scale applications. For example,
Teradata and IBM offer SQL databases that can handle terabytes of data; open
source solutions such as postgre SQL and MySQL are still being used for large-
scale applications.
Even though there are differences in how the different storages work in the
background, from the client side, most solutions provide a SQL API. Hence,
having a good understanding of SQL is still a key skill to have for big data analytics.
This stage a priori seems to be the most important topic; in practice, this is not
true. It is not even an essential stage. It is possible to implement a big data solution
that would be working with real-time data. So, in this case, we only need to
gather data to develop the model and then implement it in real time. So, there
would not be a need to formally store the data at all.
7. Exploratory Data Analysis
Once the data has been cleaned and stored in a way that insights can be retrieved
from it, the data exploration phase is mandatory. The objective of this stage is to
understand the data. This is normally done with statistical techniques and also
plotting the data. This is a good stage to evaluate whether the problem definition
makes sense or is feasible.
27
Rahul Publications
MBA II YEAR III SEMESTER
This stage involves reshaping the cleaned data retrieved previously and using
statistical pre-processing for missing values imputation, outlier detection,
normalization, feature extraction and feature selection.
9. Modelling:
The prior stage should have produced several data sets for training and testing,
e.g., a predictive model. This stage involves trying different models and looking
forward to solving the business problem at hand. In practice, it is normally desired
that the model would give some insight into the business. Finally, the best model
or combination of models is selected evaluating its performance on a left-out
data set.
10. Implementation
In this stage, the data product developed is implemented in the data pipeline of
the company. This involves setting up a validation scheme while the data product
is working in order to track its performance. For example, in case of implementing
a predictive model, this stage would involve applying the model to new data and
once the response is available, evaluate the model.
Ans :
The people who are using Big Data know better that, what Big Data is. Let’s
look at some such industries:
Healthcare: Big Data has already started to create a huge difference in the
healthcare sector. With the help of predictive analytics, medical professionals
and HCPs are now able to provide personalized healthcare services to individual
patients. Apart from that, fitness wearables, telemedicine, remote monitoring -
all powered by Big Data and AI - are helping change lives for the better.
28
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
Banking: The banking sector relies on Big Data for fraud detection. Big Data
tools can efficiently detect fraudulent acts in real-time such as misuse of credit/
debit cards, archival of inspection tracks, faulty alteration in customer stats, etc.
Manufacturing: According to TCS 2013 Global Trend Study, the most significant
benefit of Big Data in manufacturing is improving the supply strategies and product
quality. In the manufacturing sector, Big Data helps create a transparent
infrastructure, thereby predicting uncertainties and incompetencies that can affect
the business adversely.
IT: One of the largest users of Big Data, IT companies around the world are
using Big Data to optimize their functioning, enhance employee productivity
and minimize risks in business operations. By combining Big Data technologies
with ML and AI, the IT sector is continually powering innovation to find solutions
even for the most complex of problems.
29
Rahul Publications
MBA II YEAR III SEMESTER
ii) Unstructured
Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze
unstructured data. Email is an example of unstructured data.
Structured Data Unstructured Data
Characteristics Pre-defined data models No pre-defined data model
Usually text only May be text, images,
sound, video or other
formats
Easy to search Difficult to search
Resides in Relational databases Applications
Data warehouses NoSQL databases
Data warehouses
Data lakes
Generated by Humans or Machines Humans or Machines
Typical Airline reservation Word processing
applications systems
Inventory control Presentation software
CRM systems Email clients
ERP systems Tools for viewing or
editing media
Examples Dates Text Files
Phone numbers Presentation software
Social security numbers Email messages
Credit card numbers Audio files
Customer names Video files
Addresses Images
Product names and Surveillance imagery
numbers
Transaction information
30
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
iii) Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned
above, i.e., structured and unstructured data. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains
vital information or tags that segregate individual elements within the data.
31
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
Business analytics is the practice of iterative, methodical exploration of an
organization’s data, with an emphasis on statistical analysis. Business analytics is
used by companies committed to data-driven decision-making.
According to Lynda (2018) - “allows us to learn from the past and make
better predictions for the future”.
Ans :
i) Prescriptive: This type of analysis reveals what actions should be taken. This
is the most valuable kind of analysis and usually results in rules and
recommendations for next steps.
iii) Diagnostic: A look at past performance to determine what happened and why.
The result of the analysis is often an analytic dashboard.
iv) Descriptive: What is happening now based on incoming data. To mine the
analytics, you typically use a real-time dashboard and/or email reports.
32
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
Ans :
i) Data Volume
Data Volume can be measured by quality of transactions, events and amount of
history. Big Data isn’t just a description of raw volume. The real challenge is
identifying or developing most cost-effective and reliable methods for extracting
value from all the terabytes and petabytes of data now available. That’s where
Big Data analytics become necessary.
ii) Data Variety
It is the assortment of data. Traditionally data, especially operational data, is
structured as it is put into a database based on the type of data (i.e., character,
numeric, floating point, etc.). Wide variety of data: Internet data(Social media
,Social Network-Twitter, Face book),
iii) Data Velocity
It is the measure of how fast the data is coming in. Remember our Facebook
example. 250 billion images may seem like a lot. But if you want your mind
blown, consider this: Facebook users upload more than 900 million photos a
day. So that 250 billion number from last year will seem like a drop in the bucket
in a few months. Facebook has to handle a tsunami of photographs every day. It
has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
33
Rahul Publications
MBA II YEAR III SEMESTER
iv) Variability
The increase in the range of values typical of a large data set - and value, which
addresses the need for valuation of enterprise data.
5. Various Challenges in Business Analytics
Ans :
Executive Ownership: Business Analytics requires buy-in from senior
leadership and a clear corporate strategy for integrating predictive models.
IT Involvement: Technology infrastructure and tools must be able to handle
the data and Business Analytics processes.
Available Production Data vs. Cleansed Modeling Data : Watch for
technology infrastructure that restrict available data for historical modeling, and
know the difference between historical data for model development and real-
time data in production.
Project Management Office (PMO) : The correct project management
structure must be in place in order to implement predictive models and adopt an
agile approach.
End user Involvement and Buy-In : End users should be involved in adopting
Business Analytics and have a stake in the predictive model.
Change Management : Organizations should be prepared for the changes
that Business Analytics bring to current business and technology operations.
Explainability vs. the “Perfect Lift”: Balance building precise statistical
models with being able to explain the model and how it will produce results.
6. Digital Marketing
Ans :
Introduction Database Marketers, Pioneers of Big Data Big Data & New School
of Marketing Cross Channel Life cycle Marketing Social and Affiliate Marketing
Empowering marketing with Social Intelligence Introduction Digital marketing
encompasses using any sort of online media (profit or non-profit) for driving people to
a website, a mobile app etc. and retaining them and interacting with them to understand
what consumers really want.
34
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
35
Rahul Publications
MBA II YEAR III SEMESTER
36
Rahul Publications
UNIT - I BUSINESS ANALYTICS (OU)
8. Data refers to the data that lacks any specific form [b]
(a) Structured data (b) Unstructured data
(c) Both (d) None of the above
9. is the last stage is Big data life cycle. [a]
(a) Implementation (b) Datastorage
(c) Data Mugging (d) Research
10. analyze what other companies have done in the same situations.
[d]
(a) Implementation (b) Datastorage
(c) Data Mugging (d) Research
37
Rahul Publications
MBA II YEAR III SEMESTER
ANSWERS
1. Business analytics (BA)
2. Analytics
3. Prescriptive
4. Analytical
5. Descriptive
6. Product
7. Organizational charts
8. Process flow Diagram
9 Data Volume
10. Big data
38
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
DESCRIPTIVE ANALYTICS
Over view of Description Statistics (Central Tendency,
UNIT Variability), Data Visualization-Definition, Visualization
II Techniques - Tables, Cross Tabulations, charts, Data Dash
boards using Ms-Excel or SPSS.
Ans :
Statistics is a branch of mathematics that deals with collecting, interpreting,
organization and interpretation of data.
According to Prof Horace Secrist: Statistics may be defined as the aggregate
of facts affected to a marked extent by multiplicity of causes, numerically expressed
enumerated or estimated according to a reasonable standard of accuracy, collected in
a systematic manner, for a predetermined purpose and placed in relation to each
other.
Descriptive statistics employs a set of procedures that make it possible to
meaningfully and accurately summarize and describe samples of data. In order for
one to make meaningful statements about psychological events, the variable or variables
involved must be organized, measured, and then expressed as quantities. Such
measurements are often expressed as measures of central tendency and measures of
variability.
Q2. Explain briefly about Descriptive Statistics?
Ans :
Descriptive Statistics
Descriptive statistics is used to summarize data and make sense out of the raw
data collected during the research. Since the data usually represents a sample, then
the descriptive statistics is a quantitative description of the sample.
39
Rahul Publications
MBA II YEAR III SEMESTER
The level of measurement of the data affects the type of descriptive statistics.
Nominal and ordinal type data (often termed together as categorical type data) will
differ in the analysis from interval and ratio type data (often termed together as
continuous type data).
Descriptive statistics for categorical data
Contingency tables (or frequency tables) are used to tabulate categorical data.
A contingency table shows a matrix or table between independent variables at the top
row versus a dependent variable on the left column, with the cells indicating the
frequency of occurrence of possible combination of levels. (check SPSS for examples).
Descriptive statistics for continuous data
There are two the two aspects of descriptive statistics used for continuous type
data. They are;
Central tendency
Variability of the data
2.1.1 Measures of Central Tendency
Q3. Explain briefly about Measures of Central Tendency ?
Ans :
It refers to a number (statistic) that best characterizes the group as a whole”
(Sommer & Sommer, 1997). It is generally referred to as the average. The three
measures of central tendency, the mean, median, and mode, describe a distribution of
data and are an index of the average, or typical, value of a distribution of scores
The three types of averages are:
1. Mean
2. Median
3. Mode
1. MEAN (M)
It is the arithmetic average (sum of all score divided by the number of cases)
The mean, the arithmetic average of all scores under consideration, is computed
by dividing the sum of the scores by the number of scores.
The sample mean of the values
40
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
2. MEDIAN
It is the midpoint of a distribution of data. Half the scores fall above and half
below the median. The three measures of central tendency, the mean, median,
and mode, describe a distribution of data and are an index of the average, or
typical, value of a distribution of scores.
The median is the point at which 50% of the observations fall below and 50%
above or, in other words, the middle number of a set of numbers arranged in
ascending or descending order. (If the list includes an even number of categories,
the median is the arithmetic average of the middle two numbers.) Based on the
data in Table , the full list of each student’s study hours would be written 10, 9,
9, 9, 8, 8, 8, 8, and so on. If the list were written out in full, it would be clear that
the middle two numbers of the 40 entries are 6 and 6, which average 6. So the
median of the hours studied is 6.
3. MODE
It is the single score that occurs most often in a distribution of data. The mode is
the number that appears most often. Based on the data in Table , the mode of
the number of hours studied is also 6 (8 students studied for 6 hours, so 6 appears
8 times in the list, more than any other number).
2.1.2 Measures of Variability
Q4. Explain the various ways of Measure of variability?
Ans :
There are many ways to describe variability including :
(i) Range
(ii) Interquartile Range (IQR)
(iii) Variance
(iv) Standard Deviation
(i) Range
Range = Maximum – Minimum
(a) Easy to calculate
(b) Very much affected by extreme values (ranges is not a resistant measure of
variability).
41
Rahul Publications
MBA II YEAR III SEMESTER
P%
pth percentile
Thus, the median is the 50th percentile. Fifty percent or the data values fall at or
below the median.
50%
median
Also, Q1 = lower quartile = the 25th percentile and Q3 = upper quartile = the
75th percentile.
25%
42
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
(iii) Variance
Two vending machines A and B drop candies when a quarter is inserted. The
number of pieces of candy one gets is random. The following data are recorded
for six trials at each vending machine:
Pieces of candy from vending machine A:
1, 2, 3, 3, 5, 4
Mean = 3, Median = 3, Mode = 3
Pieces of candy from vending machine B:
2, 3, 3, 3, 3, 4
Mean = 3, Median = 3, Mode = 3
Dotplots for the pieces of candy from vending machine A and vending machine
B:
Machine B
Machine A
1 2 3 4 5
They have the same center, but what about their spreads? One way to compare
their spreads is to compute their standard deviations. In the following section,
we are going to talk about how to compute the sample variance and the sample
standard deviation for a data set.
Variance is the average squared distance from the mean.
Population variance is defined as:
2 = i = 1N (yi – ) / 2N
In this formula is the population mean and the summation is over all possible
values of the population. N is the population size.
The sample variance that is computed from the sample and used to estimate 2
is:
s2 = i = 1n (yi – y )2n – 1
Why do we divide by n – 1 instead of by n? Since is unknown and estimated by
y , the yi’s tend to be closer to y than to . To compensate, we divide by a
smaller number, n – 1.
43
Rahul Publications
MBA II YEAR III SEMESTER
Sample Variance
It is the common default calculations used by software. When asked to calculate
the variance or standard deviation of a set of data, assume - unless otherwise
instructed - this is sample data and therefore calculating the sample variance
and sample standard deviation.
Examples
Let’s find S2 for the data set from vending machine A: 1, 2, 3, 3, 4, 5
y = 1 + 2 + 3 + 3 + 4 + 56 = 3
s2 = (y1 – y )2 + +(yn – y ) 2n – 1
= (1 – 3)2 + (2 – 3)2 + (3 – 3)2 +(3 – 3)2 + (4 – 3)2 + (5 – 3) 26
–1 = 2
Calculate S2 for the data set from vending machine B yourself and check that it is
smaller than the S2 for data set A. Work out your answer first, then click the
graphic to compare answers.
(iv) Standard Deviation
The population standard deviation is notated by and found by = 2 – has
the same unit as yi’s. This is a desirable property since one may think about the
spread in terms of the original unit.
is estimated by the sample standard deviation s :
s = s2 –
For the data set A,
s = 2 – Š = 1.414 pieces of candy.
Ans :
It is a general term that describes any effort to help people understand the
significance of data by placing it in a visual context. Patterns, trends and correlations
that might go undetected in text-based data can be exposed and recognized easier
with data visualization software.
44
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Today’s data visualization tools go beyond the standard charts and graphs used
in Microsoft Excel spreadsheets, displaying data in more sophisticated ways
such as info graphics, dials and gauges, geographic maps, spark lines, heat
maps, and detailed bar, pie and fever charts. The images may include
interactive capabilities, enabling users to manipulate them or drill into the data
for querying and analysis. Indicators designed to alert users when data has been
updated or predefined conditions occur can also be included.
Importance of data visualization
Data visualization has become the de facto standard for modern business
intelligence (BI). The success of the two leading vendors in the BI space, Tableau
and Qlike - both of which heavily emphasize visualization - has moved other
vendors toward a more visual approach in their software. Virtually all BI software
has strong data visualization functionality.
Data visualization tools have been important in democratizing data and analytics
and making data-driven insights available to workers throughout an organization.
They are typically easier to operate than traditional statistical analysis software
or earlier versions of BI software. This has led to a rise in lines of business
implementing data visualization tools on their own, without support from IT.
Data visualization software also plays an important role in big data and advanced
analytics projects. As businesses accumulated massive troves of data during the
early years of the big data trend, they needed a way to quickly and easily get an
overview of their data. Visualization tools were a natural fit.
Visualization is central to advanced analytics for similar reasons. When a data
scientist is writing advanced predictive analytics or machine learning algorithms,
it becomes important to visualize the outputs to monitor results and ensure that
models are performing as intended. This is because visualizations of complex
algorithms are generally easier to interpret than numerical outputs.
Examples of data visualization
Data visualization tools can be used in a variety of ways. The most common use
today is as a BI reporting tool. Users can set up visualization tools to generate
automatic dash boards that track company performance across key performance
indicators and visually interpret the results.
45
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
Most of today’s data visualization tools come with connectors to popular data
sources, including the most common relational databases, Hadoop and a variety of
cloud storage platforms. The visualization software pulls in data from these sources
and applies a graphic type to the data.
Data visualization software allows the user to select the best way of presenting
the data, but, increasingly, software automates this step. Some tools automatically
interpret the shape of the data and detect correlations between certain variables and
then place these discoveries into the chart type that the software determines is optimal.
67.4%
Android 66.2%
57.8%
23.9%
ios 22.4%
22.9%
8.6%
Windows 11.3%
19.3%
46
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Ans :
By using data visualization, it became easier for business owners to understand
their large data in a simple format.
The visualization method is also time saving. So, businesses does not have to
spend much time to make a report or solve a query. They can easily do it in a less
time and in a more appealing way.
Visual analytics offers a story to the viewers. By using charts and graphs or
images, a person can easily exposure the whole concept as well the viewers will
be able to understand the whole thing in an easy way.
The most complicated data will look easy when it gets through the process of
visualization. Complicated data report gets converted into a simple format. And
it helps people to understand the concept in an easy way.
With the visualization process, it gets easier to the business owners to understand
their product growth and market competition in a better way.
Ans :
The data visualization techniques are Diagrams, charts, graphs.
Most widely used forms of data visualization are presented below:
1. Pie Chart
47
Rahul Publications
MBA II YEAR III SEMESTER
Pie Charts : Pie Charts are one of the common popular techniques. It also
comes under data visualization techniques in excel. However, to some people,
it can be hard to understand the chart while comparing to the line and bar type
chart.
2. Line Chart
To make your data simple and more appealing you can simply use the line
charts technique. Line chart basically displays the relationship between two
patterns. Also, it is one of the most used techniques world wide.
3. Combo Chart
48
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Bars charts are also one of the most commonly used techniques when it comes
to comparing two different patterns. The bar charts can display the data in a
horizontal way or in a vertical way. It all depends on your needs.
4. Area Chart
An area chart or area graph is similar to a line chart but provides graphically
quantitative data. The areas can be filled with colour, hatch, pattern. This chart
is generally used when comparing quantities which is depicted by area.
5. Heat Map
49
Rahul Publications
MBA II YEAR III SEMESTER
This type of chart is widely used by websites, mobile application makers, research
institutes etc. These maps shows the concentration of activity/entity over a
particular area.
6. Network Diagram
This is a powerful tool for finding out connections & correlations. It highlights
and bridges the gaps. Shows strongly one activity is connected to other.
7. Scattered 3 D Plot
50
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
2.3.1 Tables
Q9. How “data visualization technique –Tables” can display data
analysis reports using [Link]?
Ans :
Data analysis reports using [Link] can display in a number of ways. However,
if the data analysis results can be visualized as charts that highlight the notable points
in the data, the audience can quickly grasp what they want to project in the data. It
also leaves a good impact on the presentation style.
Here you will get to know how to use Excel charts and Excel formatting features
on charts that enable you to present your data analysis results with emphasis.
In Excel, charts are used to make a graphical representation of any set of data.
A chart is a visual representation of the data, in which the data is represented by
symbols such as bars in a Bar Chart or lines in a Line Chart. Excel provides you with
many chart types and you can choose one that suits your data or you can use the
Excel Recommended Charts option to view charts customized to your data and select
one of those.
Refer to the Tutorial Excel Charts for more information on chart types.
In this chapter, you will understand the different techniques that you can use with the
Excel charts to highlight your data analysis results more effectively.
Suppose you have the target and actual profits for the fiscal year 2015-2016
that you obtained from different regions.
51
Rahul Publications
MBA II YEAR III SEMESTER
52
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Click Combo.
Change the Chart Type for the series Actual to Line with Markers. The preview
appears under Custom Combination.
Click OK.
53
Rahul Publications
MBA II YEAR III SEMESTER
As you observe in the chart, the Target values are in Columns and the Actual
values are marked along the line. The data visualization has become better as it also
shows you the trend of your results.
However, this type of representation does not work well when the data ranges of
your two data values vary significantly.
Creating a Combo Chart with Secondary Axis
Suppose you have the data on the number of units of your product that was
shipped and the actual profits for the fiscal year 2015-2016 that you obtained from
different regions.
If you use the same combination chart as before, you will get the following:
54
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
In the chart, the data of No. of Units is not visible as the data ranges are
varying significantly.
In such cases, you can create a combination chart with secondary axis, so that
the primary axis displays one range and the secondary axis displays the other.
Click the INSERT tab.
Click Combo in Charts group.
Click Create Custom Combo Chart from the drop-down list.
55
Rahul Publications
MBA II YEAR III SEMESTER
56
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
You can observe the values for Actual Profits on the primary axis and the values
for No. of Units on the secondary axis.
A significant observation in the above chart is for Quarter 3 where No. of Units
sold is more, but the Actual Profits made are less. This could probably be assigned to
the promotion costs that were incurred to increase sales. The situation is improved in
Quarter 4, with a slight decrease in sales and a significant rise in the Actual Profits
made.
Discriminating Series and Category Axis
Suppose you want to project the Actual Profits made in Years 2013-2016.
57
Rahul Publications
MBA II YEAR III SEMESTER
As you observe, the data visualization is not effective as the years are not
displayed. You can overcome this by changing year to category.
Remove the header year in the data range.
Now, year is considered as a category and not a series. Your chart looks as follows -
58
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Chart Elements give more descriptions to your charts, thus helping visualizing
your data more meaningfully.
Chart Elements
Chart Styles
Chart Filters
59
Rahul Publications
MBA II YEAR III SEMESTER
60
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
You can use Trendline to graphically display trends in data. You can extend a
Trendline in a chart beyond the actual data to predict future values.
Ans :
Cross tabulation is usually performed on categorical data that can be divided
into mutually exclusive groups.
An example of categorical data is the region of sales for a product. Typically,
region can be divided into categories such as geographic area (North, South, Northeast,
West, etc) or state (Andhra Pradesh, Rajasthan, Bihar, etc). The important thing to
remember about categorical data is that a categorical data point cannot belong to
more than one category.
61
Rahul Publications
MBA II YEAR III SEMESTER
Cross tabulations are used to examine relationships within data that may not be
readily apparent. Cross tabulation is especially useful for studying market research or
survey responses. Cross tabulation of categorical data can be done with through tools
such as SPSS, SAS, and Microsoft Excel.
An example of cross tabulation
“No other tool in Excel gives you the flexibility and analytical power of a pivot
table.” –Bill Jalen
One simple way to do cross tabulations is Microsoft Excel’s pivot table feature.
Pivot tables are a great way to search for patterns as they help in easily grouping raw
data.
Consider the below sample data set in Excel. It displays details about commercial
transactions for four product categories. Let’s use this data set to show cross tabulation
in action.
This data can be converted to pivot table format by selecting the entire table and
inserting a pivot table in the Excel file. The table can correlate different variables row-
wise, column-wise, or value-wise in either table format or chart format.
62
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Let’s use cross tabulation to check the relation between the type of payment
method (i.e. visa, MasterCard, PayPal, etc) and the product category with respect to
the region of sales. We can select these three categories in the pivot table.
63
Rahul Publications
MBA II YEAR III SEMESTER
Cross tabulation 1: Relation between payment method and the total amount of
sales in product category with respect to region in which products sold
It is now clear that the highest sales were done for P1 using Master Card.
Therefore, we can conclude that the MasterCard payment method and product P1
category is the most profitable combination.
Similarly, we can use cross tabulation and find the relation between the product
category and the payment method type with regard to the number of transactions.
This can be done by grouping the payment method, product category, and units sold:
64
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
By default, Excel’s pivot table aggregates values as a sum. Summing the units
will give us the total number of units sold. Since we want to compare the number of
transactions instead of the number of units sold, we need to change the Value Field
Setting from Sum to Count for Units.
The results of this pivot table mapping is as shown below. This is a cross tabulation
analysis of 3 variables - it analyses the correlation between the payment method and
payment category according to the number of transactions.
65
Rahul Publications
MBA II YEAR III SEMESTER
For all regions, we can observe that the highest selling category of products was
P1 and the highest number of transactions was done using Master Card. We can also
see the preferred payment method in each of the product categories. For example,
American Express is the preferred card for P2 products.
Ans :
i) Eliminates confusion while interpreting data
Raw data can be difficult to interpret. Even for small data sets, it is all too easy
to derive wrong results by just looking at the data. Cross tabulation offers a
simple method of grouping variables, which minimizes the potential for confusion
or error by providing clear results.
As we observed in our example, cross tabulation can help us derive great insights
from raw data. These insights are not easy to see when the raw data is formatted
as a table. Since cross tabulation clearly maps out relations between categorical
variables, researchers can gain better and deeper insights — insights that
otherwise would have been overlooked or would have taken a lot of time to
decode from more complicated forms of statistical analysis.
66
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Ans :
You might have to present customer survey results of a product from different
regions. Band Chart is suitable for this purpose. A Band Chart is a Line Chart with an
added shaded area to display the upper and lower boundaries of groups of data.
Suppose your customer survey results from the east and west regions, month-
wise are:
Here, in the data < 50% is Low, 50% - 80% is Medium and > 80% is High.
With Band Chart, you can display your survey results as follows:
67
Rahul Publications
MBA II YEAR III SEMESTER
68
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
69
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
A Gantt Chart is a chart in which a series of horizontal lines shows the amount
of work done in certain periods of time in relation to the amount of work planned for
those periods.
In Excel, you can create a Gantt Chart by customizing a Stacked Bar Chart type
so that it depicts tasks, task duration and hierarchy. An Excel Gantt Chart typically
uses days as the unit of time along the horizontal axis.
70
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
71
Rahul Publications
MBA II YEAR III SEMESTER
72
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Minor Tick Marks at 1 day intervals Format Data Series to make it look
impressive Give a Chart Title.
73
Rahul Publications
MBA II YEAR III SEMESTER
74
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
75
Rahul Publications
MBA II YEAR III SEMESTER
76
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
3. Operational Dashboards
4. Informational Dashboards
Informational dashboards are just for displaying figures, facts and/or statistics.
They can be either static or dynamic with live data but not interactive. For example,
flights arrival/departure information dashboard in an airport.
Ans :
Create Interactive Excel Dashboard
Most of us probably rely on our trusted MS Excel dashboard for the day to
day running of our businesses, but like many, we struggle to turn that data into
something that will actually interest people and want them to know more about it.
It is a comprehensive as well as complete visual report or analysis of your project
which can be shared with other people concerned. Creating a excel dashboard
can be tedious, time consuming as well as difficult if you do not have the proper
knowledge about how to go about doing it. But fret now, that’s where we enter.
77
Rahul Publications
MBA II YEAR III SEMESTER
78
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
2. Select a background
If you are using a pivot table, use the GETPIVOTDATA function. If you use
a flat file, there are a number of formulae you can use like DSUM, DGET,
VLOOKUP, MATCH, INDEX or even a dew math formulas like SUM, SUMIF,
etc
But be careful here, do not punch in formula after formula. Fewer formulas
mean a safer and a more reliable excel dashboard which is also easier to
maintain. You can automatically reduce the formula number by using pivot
tables.
Also, another important point is that you should name all your ranges. Always,
always document your work. Simplify your work by making your excel
dashboard formulas cleaner.
Dashboards that a user can’t interact with don’t make much sense. All your
excel dashboards should have controls which will enable you to change the
markets, product details as well as other nitty critters. What is most important
is that the user must be able to be in complete charge of his or her own excel
dashboard and make changes whenever and wherever they want.
If you are creating interactive charts, you will need dynamic ranges. You can
do this by using the OFFSET() function. You can also add a few cool things
to your excel dashboard like greeting the user and selecting the corresponding
profile when they open the excel dashboard. All this can be done using
79
Rahul Publications
MBA II YEAR III SEMESTER
macros. All you need to do is record a macro, add a FOR NEXT or a FOR
EACH loop. If you have never recorded a macro before, there are a large
number of sites online which give you perfectly tailored macros as per your
needs.
80
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
a) Bar Charts
Bar charts as we all know are bars on the x axis. One of the most common
misgiving about excel dashboards is that the more is better; the truth is, that
is seldom true. Bar charts are simple and very effective. They are particularly
useful to compare one concept to another as well as trends.
81
Rahul Publications
MBA II YEAR III SEMESTER
(d) Tables
Tables are great if you have detailed information with different measuring
units, which may be difficult to represent through other charts or graphs.
(e) Area charts
Area charts are very useful for multiple data series, which may or may not be
related to each other (partially or wholly). They are also useful for an individual
series that represents a physically countable set.
So choose wisely, and you will be good.
82
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
8. Colour theory
Colours in a excel dashboard make it livelier as opposed to the drab and
overused grey, black and white. I could write an entire book on how colour
theory works, but well, that’s already dine. You must know which colours
work together and which do not. For example, you cannot pair bright pink
and red together unless you want an assault on the eyes. One thing you must
keep in mind while selecting a colour coding, that 8% of men and 0.5% or
women are colour blind.
Most people can perceive a colour, but cannot correctly distinguish between
two shades of the same colour. These people can perceive changes in
brightness though, just like me and you. Avoid having shades that overlap,
like the example I gave above. That would not only look ugly, but also be
completely useless for users we discussed above.
[Link] are the benefits of data dash boards.
Ans :
Dashboards allow managers to monitor the contribution of the various departments
in the organization. To monitor the organization’s overall performance, dashboards
allow you to capture and report specific data points from each of the departments in
the organization, providing a snapshot of current performance and a comparison with
earlier performance.
Benefits of dashboards include the following:
Visual presentation of performance measures
Ability to identify and correct negative trends
Measurement of efficiencies/inefficiencies
Ability to generate detailed reports showing new trends
Ability to make more informed decisions based on collected data
Alignment of strategies and organizational goals
Instant visibility of all systems in total
Quick identification of data outliers and correlations
Time-saving with the comprehensive data visualization as compared to
running multiple reports.
83
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
Statistics is a branch of mathematics that deals with collecting, interpreting,
organization and interpretation of data.
2. Descriptive Statistics
Ans :
Descriptive Statistics
Descriptive statistics is used to summarize data and make sense out of the raw
data collected during the research. Since the data usually represents a sample, then
the descriptive statistics is a quantitative description of the sample.
The level of measurement of the data affects the type of descriptive statistics.
Nominal and ordinal type data (often termed together as categorical type data) will
differ in the analysis from interval and ratio type data (often termed together as
continuous type data).
84
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Contingency tables (or frequency tables) are used to tabulate categorical data.
A contingency table shows a matrix or table between independent variables at the top
row versus a dependent variable on the left column, with the cells indicating the
frequency of occurrence of possible combination of levels. (check SPSS for examples).
There are two the two aspects of descriptive statistics used for continuous type
data. They are;
Central tendency
Ans :
It is a general term that describes any effort to help people understand the
significance of data by placing it in a visual context. Patterns, trends and correlations
that might go undetected in text-based data can be exposed and recognized easier
with data visualization software.
Today’s data visualization tools go beyond the standard charts and graphs used
in Microsoft Excel spreadsheets, displaying data in more sophisticated ways
such as info graphics, dials and gauges, geographic maps, spark lines, heat
maps, and detailed bar, pie and fever charts. The images may include
interactive capabilities, enabling users to manipulate them or drill into the data
for querying and analysis. Indicators designed to alert users when data has been
updated or predefined conditions occur can also be included.
4. Cross Tabulations
Ans :
Cross tabulation is usually performed on categorical data - data that can be
divided into mutually exclusive groups.
85
Rahul Publications
MBA II YEAR III SEMESTER
Cross tabulations are used to examine relationships within data that may not be
readily apparent. Cross tabulation is especially useful for studying market research or
survey responses. Cross tabulation of categorical data can be done with through tools
such as SPSS, SAS, and Microsoft Excel.
Ans :
i) Eliminates confusion while interpreting data
Raw data can be difficult to interpret. Even for small data sets, it is all too easy
to derive wrong results by just looking at the data. Cross tabulation offers a
simple method of grouping variables, which minimizes the potential for confusion
or error by providing clear results.
As we observed in our example, cross tabulation can help us derive great insights
from raw data. These insights are not easy to see when the raw data is formatted
as a table. Since cross tabulation clearly maps out relations between categorical
variables, researchers can gain better and deeper insights — insights that
otherwise would have been overlooked or would have taken a lot of time to
decode from more complicated forms of statistical analysis.
86
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
87
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
By using data visualization, it became easier for business owners to understand
their large data in a simple format.
The visualization method is also time saving. So, businesses does not have to
spend much time to make a report or solve a query. They can easily do it in a less
time and in a more appealing way.
Visual analytics offers a story to the viewers. By using charts and graphs or
images, a person can easily exposure the whole concept as well the viewers will
be able to understand the whole thing in an easy way.
The most complicated data will look easy when it gets through the process of
visualization. Complicated data report gets converted into a simple format. And
it helps people to understand the concept in an easy way.
With the visualization process, it gets easier to the business owners to understand
their product growth and market competition in a better way.
9. Gantt Chart
Ans :
A Gantt Chart is a chart in which a series of horizontal lines shows the amount
of work done in certain periods of time in relation to the amount of work planned for
those periods.
In Excel, you can create a Gantt Chart by customizing a Stacked Bar Chart type
so that it depicts tasks, task duration and hierarchy. An Excel Gantt Chart typically
uses days as the unit of time along the horizontal axis.
Start represents number of days from the Start Date of the project
88
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
Ans :
Benefits of dashboards include the following:
Measurement of efficiencies/inefficiencies
89
Rahul Publications
MBA II YEAR III SEMESTER
90
Rahul Publications
UNIT - II BUSINESS ANALYTICS (OU)
91
Rahul Publications
MBA II YEAR III SEMESTER
92
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
PREDICTIVE ANALYTICS
93
Rahul Publications
MBA II YEAR III SEMESTER
supervised machine learning techniques are used to predict a future value (How long
can this machine run before requiring maintenance?) or to estimate a probability (How
likely is this customer to default on a loan?).
Predictive Analytics starts with a business goal to use data to reduce waste, save
time or cut costs. The process harnesses heterogeneous, often massive, data sets into
models that can generate clear, actionable outcomes to support achieving that goal,
such as less material waste, less stocked inventory, and manufactured product that
meets specifications.
94
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
95
Rahul Publications
MBA II YEAR III SEMESTER
d) Nonlinear Regression. If the curve of the regression is not a straight line, the
regression is termed as curved or non-linear regression. The regression equation
will be a functional relation between variables x and y involving terms in x and y
of degree more than one.
Applications / Utility of Regression Test
Regression lines or equations are useful in the predictions of values of one variable
for a specified value of the other variable.
Example
i) For pharmaceutical firms which are interested in studying the effect of new drugs
in patients, regression test helps in such predictions.
ii) When price and demand are related, we can estimate or predict the future demand
for a specified price.
iii) When crop yield depends on the amount of rainfall, then regression test can
predict crop yield for a particular amount of rainfall.
iv) If advertising expenditure and sales are related, then regression analysis helps in
estimating the advertising expenditure for a required amount of sales (or) sales
expected for a particular advertising expenditure.
v) When capital employed and profits earned are related, the test can be used to
predict profits for a specified amount of capital invested.
Q5. Explain the limitations of regression analysis.
Ans :
Limitations of Regression Analysis
Some of the limitations of regression analysis are as follows :
1. Regression analysis assumes that linear relationship exists among the related
variables. But in the ‘area of social sciences, linear relationship may not
exist among the related variables.
2. When regression analysis is used to evaluate the value of dependent variable
based on independent variable, it is assumed that the static conditions of
relationship exist between ‘them. These statistic conditions do not exist in
social sciences, so, this assumption “minimizes the use’ of regression analysis
in social science.
96
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
Ans :
Regression is mainly concerned with the estimation of unknown value of one
variable from the known value of other variable of the given observations. For doing so,
there must be a relation between two variables. This relationship is mathematically
expressed in the form of equation known as “Regression Equation “ or “ Estimating
Equation”.
The regression equation which states and explains the linear relationship between
two variables is known as ‘Linear Regression Equation’. Basically, as there are two
regression lines, there would be two regression equations i.e.,
1. Regression equation of K on X and
2. Regression equation of X on Y.
The regression equation of Y on X is considered for predicting the value of Y
when a specific value of X is given. Whereas the regression equation of X on Y is used
for predicting the unknown value of X when a specific value of Y is given.
Formation of Regression Equations
There are two ways of forming regression equations as follows,
a) Normal equation and
b) Regression coefficient.
Formation of Regression Equation through Normal Equation
Generally, the situations where perfect linear relationship exists between the two
variables X and Y, usually there would be two regression lines and when there are two
regression lines, there would be two regression equations as follows,
1. The regression equation of Y on X is denoted as Y. = a + bX.
2. The regression equation of X on Y is denoted as X = a + bY.
97
Rahul Publications
MBA II YEAR III SEMESTER
In the above equations ‘a’ and ‘b’ are two unknown constants which ascertains
the positions of the regression line. Therefore, these constants are known as parameters
of the regression lines.
The parameter ‘a’ ascertains the level of a fitted line, whereas ‘b’ ascertains the
slope of the line. YC and XC are the symbols stating and showing the values of Y and X
calculated from the relationship for given X or Y.
Regression Equation of Y on X
Y = a + bX
By applying the least square principle, the values of ‘a’ and ‘b’ are determined in
such a way YC = a + bX is minimum.
The normal equation for determining the value of a and b are,
y = Na + bx ...(1)
xy = ax + bx2 ...(2)
Regression Equation of X on Y
Yc = a + by
The normal equation for obtaining the values of a and b are,
x = Na + by ...(1)
xy = ay + by2 ... (2)
After calculating the values of N, x, y, x2, “y2, substitute them in regression
equation Y on X and X on Y for ascertaining the values of a and b. Lastly, by
substituting the values of a and b in regression equation, the required best fitting
straight line is obtained.
b) Regression Coefficients
To estimate values of population parameter 0 and 1, under certain assumptions,
the fitted or estimated regression equation representing the straight line regression model
is written as :
ŷ = a + bx
where
y = estimated average (mean) value of dependent variable y for a given value of
independent variable x.
a or b0 = y - intercept that represents average value of ŷ
98
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
99
Rahul Publications
MBA II YEAR III SEMESTER
iii) No Relationship
The graph of no relationship between two variables looks as follows,
Fig. : No Relationship
100
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
If the relationship displayed in your scatterplot is not linear, you will have to
either run a non-linear regression analysis, perform a polynomial regression or
“transform” your data, which you can do using SPSS Statistics. We show you
how to: (a) create a scatterplot to check for linearity when carrying out linear
regression using SPSS Statistics; (b) interpret different scatterplot results; and (c)
transform your data using SPSS Statistics if there is not a linear relationship
between your two variables.
Assumption #3: There should be no significant outliers. An outlier is an observed
data point that has a dependent variable value that is very different to the value
predicted by the regression equation. As such, an outlier will be a point on a
scatterplot that is (vertically) far away from the regression line indicating that it
has a large residual, as highlighted below:
101
Rahul Publications
MBA II YEAR III SEMESTER
The problem with outliers is that they can have a negative effect on the
regression analysis (e.g., reduce the fit of the regression equation) that is used to
predict the value of the dependent (outcome) variable based on the independent
(predictor) variable. This will change the output that SPSS Statistics produces
and reduce the predictive accuracy of your results. Fortunately, when using SPSS
Statistics to run a linear regression on your data, you can easily include criteria
to help you detect possible outliers. We: (a) show you how to detect outliers using
“case-wise diagnostics”, which is a simple process when using SPSS Statistics;
and (b) discuss some of the options you have in order to deal with outliers.
Assumption #4: We should have independence of observations, which you
can easily check using the Durbin-Watson statistic, which is a simple test to run
using SPSS Statistics.
Assumption #5: Your data needs to show homoscedasticity, which is where
the variances along the line of best fit remain similar as you move along the line.
Whilst we explain more about what this means and how to assess the
homoscedasticity of your data take a look at the three scatterplots below, which
provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called
homoscedasticity):
Whilst these help to illustrate the differences in data that meets or violates the
assumption of homoscedasticity, real-world data can be a lot more messy and
illustrate different patterns of heteroscedasticity. Therefore, we explain: (a) some
of the things you will need to consider when interpreting your data; and (b)
possible ways to continue with your analysis if your data fails to meet this
assumption.
Assumption #6: Finally, you need to check that the residuals (errors) of the
regression line are approximately normally distributed Two common methods to
102
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
103
Rahul Publications
MBA II YEAR III SEMESTER
2. Click on ‘Data’ tab present on excel ribbon and the click ‘Data Analysis’ command
present under’ Analysis’ group.
3. As a result, ‘Data Analysis’ dialog box will be displayed. Goto ‘Analysis Tools’
section and select ‘Regression’ option from the menu list and then click “OK”
button.
104
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
5. In the ‘Regression’ dialog box, goto ‘Input Y Range’ field and provide the range
of dependent variable ‘Y’. Similarly, Goto ‘Input X Range’ field and provide the
range of independent variable ‘X’.
105
Rahul Publications
MBA II YEAR III SEMESTER
7. Goto ‘Output Options’ section and checkmark one of the above three options.
8. Goto ‘Residuals’ section and checkmark beside one of the four options
(‘Residuals’, ‘Residual Plots’, ‘Standardized Residuals’, ‘Line Fit Plots’) to provide
residuals on the output table.
9. Goto ‘Normal Probability’ section and checkmark the option beside ‘Normal
probability plots’ to build or construct normal probability plot for the dependent
variable ‘Y’.
106
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
10. Click on “OK” button. As a result, the regression analysis output will be displayed
on the screen.
107
Rahul Publications
MBA II YEAR III SEMESTER
Ans :
Multiple regression also allows you to determine the overall fit (variance explained)
of the model and me relative contribution of each of the predictors to the total variance
explained. For example, you might want to know how much of the variation in exam
performance can be explained by revision time, test anxiety, lecture attendance and
gender “as a whole”, but also the “relative contribution” of each independent variable
in explaining the variance.
Assumption #1: Your dependent variable should be measured on a continuous
scale (i.e., it is either an interval or ratio variable). Examples of variables that
meet this criterion include revision time (measured in hours), intelligence
(measured using IQ score), exam performance (measured from 0 to 100), weight
(measured in kg), and so forth. You can learn more about interval and ratio
variables in our article: Types of Variable. If your dependent variable was
measured on an ordinal scale, you will need to carry out ordinal regression rather
than multiple regression. Examples of ordinal variables include Likert items (e.g.,
a 7-point scale from “strongly agree” through to “strongly disagree”), amongst
other ways of ranking categories (e.g., a 3-point scale explaining how much a
customer liked a product, ranging from “Not very much” to “Yes, a lot”). You
can access our SPSS Statistics guide on ordinal regression here.
Assumption #2: You have two or more independent variables which can be
either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal
or nominal variable). For examples of continuous and ordinal variables, see the
bullet above. Examples of nominal variables include gender (e.g., 2 groups: male
and female), ethnicity (e.g., 3 groups: Caucasian, African American and
Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and
108
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and
so forth. Again, you can learn more about variables in our article: Types of
Variable. If one of your independent variables is dichotomous and considered a
moderating variable, you might need to run a Dichotomous moderator analysis.
Assumption #3: You should have independence of observations (i.e.,
independence of residuals), which you can easily check using the Durbin-Watson
statistic, which is a simple test to run using SPSS Statistics. We explain how to
interpret the result of the Durbin- Watson statistic, as well as showing you the
SPSS Statistics procedure required, in our enhanced multiple regression guide.
Assumption #4: There needs to be a linear relationship between: (a) the
dependent variable and each of your independent variables, and (b) the dependent
variable and the independent variables collectively. Whilst there are a number of
ways to check for these linear relationships, we suggest creating scatterplots and
partial regression plots using SPSS Statistics, and then visually inspecting these
scatterplots and partial regression plots to check for linearity. If the relationship
displayed in your scatterplots and partial regression plots are not linear, you will
have to either run a non-linear regression analysis or “transform” your data,
which you can do using SPSS Statistics. In our enhanced multiple regression
guide, we show you how to: (a) create scatterplots and partial regression plots to
check for linearity when carrying out multiple regression using SPSS Statistics;
(b) interpret different scatterplot and partial regression plot results; and (c) transform
your data using SPSS Statistics if you do not have linear relationships between
your variables.
Assumption #5: Your data needs to show homoscedasticity, which is where
the variances along the line of best fit remain similar as you move along the line.
We explain more about what this means and how to assess the homoscedasticity
of your data in our enhanced multiple regression guide. When you analyze your
own data, you will need to plot the studentized residuals against the unstandardized
predicted values. In our enhanced multiple regression guide, we explain: (a) how
to test for homoscedasticity using SPSS Statistics; (b) some of the things you will
need to consider when interpreting your data; and (c) possible ways to continue
with your analysis if your data fails to meet this assumption.
Assumption #6: Your data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly correlated with each
other. This leads to problems with understanding which independent variable
109
Rahul Publications
MBA II YEAR III SEMESTER
110
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
y 0 1 X1 2 X 2 ..... p X p
In the above equation, 0, 1 specifies population parameters, X1, X2, .... Xp
specifies independent-variables, Y defines dependent variable and ‘ ’ defines error
term.
The expected value of ‘y’ for a given value of V can be calculated using the above
equation if parameter values of 0, 1, . . ., q are known. On the other hand, if parameter
values are not known then they must be calculated using the sample data.
The estimated regression equation for multiple linear regression can be attained
by substituting the values of sample statistics b0, b1, ... , bp in 0, 1, ... , p.
The estimated regression equation in multiple regression model is,
ŷ = b0 + b1 x1 + b2 x2 + .... + bpxp
In the above equation, y refers to point estimator of expected value of y for a
given value of x, the partial regression coefficients b0, b1, ... ,bp indicates the change in
the mean value of dependent variable ‘y’ for a unit increase in the independent variables,
while holding the values of remaining independent variables constant. For instance,
consider the following excel file containing salary details of employees.
111
Rahul Publications
MBA II YEAR III SEMESTER
In the above table, the multiple regression model can be written as,
Therefore, b, indicates the change in the mean value of CTC for a unit increase
in the associated independent variable ‘EPF’ while holding all remaining independent
variables ‘Basic Salary’, ‘EPF’, ‘ESI’ and ‘Gross Salary Constant like simple linear
regression, multiple linear regression also follows the least squares technique for
estimating both intercept and slope coefficients.
2. Click on ‘Data’ tab present on excel ribbon and the click ‘Data Analysis’ command
present under ‘Analysis’ group.
112
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
3. As a result, ‘Data Analysis’ dialog box will be displayed. Goto ‘Analysis Tools’
section and select ‘Regression’ option from the menu list and then click “OK”
button.
5. In the ‘Regression’, dialog box, Goto ‘Input Y Range’ field and provide the range
of dependent variable Y. Similarly, Goto ‘Input X Range’ field and provide the
entire range of independent variable lJC.
113
Rahul Publications
MBA II YEAR III SEMESTER
(i) Labels: Checkmark this option if data range includes a descriptive level.
114
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
7. Goto ‘Output Option’ section and checkmark one of the above three options.
115
Rahul Publications
MBA II YEAR III SEMESTER
The forecasting method you select is a function of multiple qualities about your
item. Is demand steady, cyclical or sporadic? Are there seasonal trends? Are trends
strong or limited? Is the item new? Each item being forecast has a somewhat unique
history (and future), and therefore an optimal method. A method that accurately forecasts
one data set might prove inaccurate for another.
Determining the optimal forecast method is a rather complex science, especially
across a large product line. This may be nearly impossible using only spreadsheets.
116
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
However, sophisticated forecasting software can within seconds test multiple methods
for each item to determine which method will give you the most accurate results.
Specific Forecasting Methods
1. Moving Averages
2. Exponential Smoothing
3. Regression Analysis Models
4. Hybrid Forecasting Methods
1. Moving Averages
Moving average methods take the average of past actuals and project it forward.
These methods assume that the recent past represents the future. As a result,
they work best for products with relatively little change — steady demand, no
seasonality, limited trends or cycles and no significant demand shifts. Many
companies apply this method because it is simple and easy to use. However,
since few products actually behave in this way, it tends to be less useful than
more specialized methods.
2. Exponential Smoothing
Exponential smoothing is a more advanced form of time series forecasting. Unlike
moving averages, exponential smoothing methods can capture trends and recurring
patterns. They accomplish this by:
Emphasizing the more recent data (as opposed to a moving average which
weights all data equally), and
Smoothing out fluctuations, which are often caused by pure randomness in
the data (or “noise” in the system).
Forecasters determine the forecast weights, controlling how fast or slow the model
responds to demand changes in your actuals. Not all exponential smoothing
methods can handle seasonality or other recurring patterns.
Exponential smoothing forecasting methods include:
(i) Simple exponential smoothing
(ii) Holt’s linear method
(iii) Winters’ multiplicative season
(iv) Winters’ additive season
117
Rahul Publications
MBA II YEAR III SEMESTER
118
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
Several informal methods used in causal forecasting do not rely solely on the
output of mathematical algorithms, but instead use the judgment of the forecaster.
Some forecasts take account of past relationships between variables: if one variable
has, for example, been approximately linearly related to another for a long period of
time, it may be appropriate to extrapolate such a relationship into the future, without
necessarily understanding the reasons for the relationship.
Causal methods include
Regression analysis includes a large group of methods for predicting future
values of a variable using information about other variables. These methods
include both parametric (linear or non-linear) and non-parametric
techniques.
Autoregressive moving average with exogenous inputs
Judgmental Methods
Judgmental forecasting methods incorporate intuitive judgement, opinions and
subjective probability estimates. Judgmental forecasting is used in cases where there is
lack of historical data or during completely new and unique market conditions. Artificial
intelligence methods
Artificial neural networks
Group method of data handling
Support vector machines
Often these are done today by specialized programs loosely labelled :
Data mining
machine learning
Pattern recognition
Other methods
Simulation
Prediction market
Probabilistic forecasting and Ensemble forecasting.
119
Rahul Publications
MBA II YEAR III SEMESTER
120
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
121
Rahul Publications
MBA II YEAR III SEMESTER
The modeling phase and the evaluation phase are coupled. They can be
repeated several times to change parameters until optimal values are achieved.
When the final modeling phase is completed, a model of high quality has been
built.
Evaluation : Data mining experts evaluate the model. If the model does not
satisfy their expectations, they go back to the modeling phase and rebuild the
model by changing its parameters until optimal values are achieved. When they
are finally satisfied with the model, they can extract business explanations and
evaluate the following questions:
Does the model achieve the business objective?
Have all business issues been considered?
At the end of the evaluation phase, the data mining experts decide how to use
the data mining results.
5. Deployment
Data mining experts use the mining results by exporting the results into database
tables or into other applications, for example, spreadsheets.
The Intelligent Miner products assist you to follow this process. You can
apply the functions of the Intelligent Miner products independently, iteratively, or
in combination.
The following figure shows the phases of the Cross Industry Standard Process
for data mining (CRISP DM) process model.
122
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
IM Modeling helps you to select the input data, explore the data, transform the
data, and mine the data. With IM Visualization you can display the data mining results
to analyze and interpret them. With IM Scoring, you can apply the model that you have
created with IM Modeling.
Ans :
Scope of Data Mining
1. Data mining process the work in such a manner that it allows business to more
proactive to grow substantially.
2. It optimizes large database within the short time and works business intelligence
which is more vital to organizational growth.
3. It represents the data in some logical order or may be in pattern to identify the
sequential way of processing of data.
5. Brings a genetic way of classification of different sets of data items to view the
data in the quick glance.
123
Rahul Publications
MBA II YEAR III SEMESTER
4. Association rules: This data mining technique helps to find the association
between two or more items. It discovers a hidden pattern in the data set.
Ans :
Benefits of Data Mining
It is the speedy process which makes it easy for the users to analyze huge amount
of data in less time.
124
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
125
Rahul Publications
MBA II YEAR III SEMESTER
Pros
Since an external scoring engine performs the scoring calculation, model
complexity and performance is hidden within the scoring engine. Thus, the scoring
process does not require any database resources and does not impact other
business intelligence work.
At run time, data is simply read from the database without having to calculate
the score on the fly. Scoring on the fly can slow analysis especially if millions of
scores are involved.
MicroStrategy can use this approach by just creating metrics or attributes for the
scored data.
Cons
This approach requires database space and the support of a database
administrator.
New records that are inserted after the batch scoring are not scored.
Updating the model or scores requires more database and database administrator
overhead.
In many companies, adding or updating information in the enterprise data
warehouse is not done easily or whenever desired. The cross functional effort
required to score the database limits the frequency of scoring and prevents the
vast majority of users from trying new models or changing existing ones.
This approach is really no different than adding other entities to a MicroStrategy
project. For more information, see the Project Design Guide.
ii) Database does the scoring
In this approach, data mining features of the database system are used to perform
the scoring. Nearly all major databases have the ability to score data mining models.
The most common approach persists the model in the database and then generates
scores by using extensions to the SQL queries processed by the database to invoke the
model. A key feature of this approach is that the model can be scored in a system that
is different from the data mining tool that developed the model.
The model can be saved in the database as a Predictive Model Markup Language
(PMML) object, or, less frequently, in some form of executable code. For more
information on PMML, see PMML overview.
126
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
Persisting the model in this way is possible since the sophisticated algorithms
needed to create the model are not required to score them. Scoring simply involves
mathematical calculations on a set of inputs to generate a result.
The ability to represent the model and score it outside of the model creation tool
is relatively new, but more companies are adopting this approach. Its advantages
and disadvantages are described below.
Pros
Scores can be calculated on the fly even if new records are added.
Updating the model is easier than in the Score the database option.
This approach requires less database space than the score the database option.
When the database supports accessing its data mining features via SQL,
MicrovStrategy can take advantage of this approach using its SQL Engine.
Cons
In this approach, predictive models are applied from within the Business
Intelligence platform environment, without requiring support from the database
and from database administrators to implement data mining models. This direct
approach reduces the time required, the potential for data inconsistencies, and
cross-departmental dependencies.
127
Rahul Publications
MBA II YEAR III SEMESTER
Pros
Scores can be done on the fly even if new records are added.
This approach does not require database space or support from a database
administrator.
MicroStrategy can take advantage of this approach by using the Analytical Engine.
Cons
This approach does not take advantage of the database data mining features.
Predictor inputs need to be passed from the database to Intelligence Server. For
large result sets, databases typically handle data operations more efficiently than
moving data to MicroStrategy and scoring it there.
Ans :
Data exploration is an informative search used by data consumers to form true
analysis from the information gathered. Often, data is gathered in a non-rigid or controlled
manner in large bulks. For true analysis, this unorganized bulk of data needs to be
narrowed down. This is where data exploration is used to analyze the data and
information from the data to form further analysis.
Data often converges in a central warehouse called a data warehouse. This data
can come from various sources using various formats. Relevant data is needed for
tasks such as statistical reporting, trend spotting and pattern spotting. Data exploration
is the process of gathering such relevant data.
128
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
129
Rahul Publications
MBA II YEAR III SEMESTER
2. Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate
analysis will depend on whether the variable type is categorical or continuous. Let’s
look at these methods and statistical measures for categorical and continuous variables
individually:
(i) Continuous Variables: In case of continuous variables, we need to
understand the central tendency and spread of the variable. These are
measured using various statistical metrics visualization methods as shown
below:
130
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
Note: Univariate analysis is also used to highlight missing and outlier values.
In the upcoming part of this series, we will look at methods to handle missing
and outlier values. To know more about these methods, you can refer
course descriptive statistics from Udacity.
(ii) Categorical Variables: For categorical variables, we’ll use frequency
table to understand distribution of each category. We can also read as
percentage of values under each category. It can be be measured using two
metrics, Count and Count% against each category. Bar chart can be used
as visualization.
Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we
look for association and disassociation between variables at a pre-defined significance
level. We can perform bi-variate analysis for any combination of categorical and
continuous variables. The combination can be: Categorical & Categorical, Categorical
& Continuous and Continuous & Continuous. Different methods are used to tackle
these combinations during analysis process.
Let’s understand the possible combinations in detail:
Continuous and Continuous
While doing bi-variate analysis between two continuous variables, we should
look at scatter plot. It is a nifty way to find out the relationship between two variables.
The pattern of scatter plot indicates the relationship between variables. The relationship
can be linear or non-linear.
131
Rahul Publications
MBA II YEAR III SEMESTER
Scatter plot shows the relationship between two variable but does not
indicates the strength of relationship amongst them. To find the strength of the
relationship, we use Correlation. Correlation varies between -1 and +1.
–1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
Various tools have function or functionality to identify correlation between
variables. In Excel, function CORREL() is used to return the correlation between two
variables and SAS uses procedure PROC CORR to identify the correlation. These
function returns Pearson Correlation value to identify the relationship between two
variables:
132
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
X 65 72 78 65 72 70 65 68
Y 72 69 79 69 84 75 60 73
Chi-Square Test
This test is used to derive the statistical significance of relationship between the
variables. Also, it tests whether the evidence in the sample is strong enough to generalize
that the relationship for a larger population as well. Chi-square is based on the difference
between the expected and observed frequencies in one or more categories in the two-
way table. It returns probability for the computed chi-square distribution with the degree
of freedom.
133
Rahul Publications
MBA II YEAR III SEMESTER
x1 x 2
Z
S12 S22
n1 n 2
134
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
X1 X 2
t
1 1
S2
N1 N 2
Where
X1, X 2: Averages
S12 , S 22:Variances
N1 , N 2: Counts
ANOVA
It assesses whether the average of more than two groups is statistically different.
Example
Suppose, we want to test the effect of five different exercises. For this, we recruit
20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded
after a few weeks. We need to find out whether the effect of these exercises on them is
significantly different or not. This can be done by comparing the weights of the 5 groups
of 4 men each.
Till here, we have understood the first three stages of Data Exploration, Variable
Identification, Uni-Variate and Bi-Variate analysis. We also looked at various statistical
and visual methods to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly,
we will also look at why missing values occur in our data and why treating them is
necessary.
135
Rahul Publications
MBA II YEAR III SEMESTER
136
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
137
Rahul Publications
MBA II YEAR III SEMESTER
b) Nonparametric
Nonparametric methods are used for storing reduced representations of the
data include histograms, clustering, and sampling.
Regression and Log-Linear Models
Regression and log-linear models can be used to approximate the given
data.
In (simple) linear regression, the data are modeled to fit a straight line.
Multiple linear regression is an extension of (simple) linear regression, which
allows a response variable y to be modeled as a linear function of two or
more predictor variables.
Log-linear models approximate discrete multidimensional probability
distributions.
Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
This allows a higher-dimensional data space to be constructed from lower
dimensional spaces.
Log-linear models are therefore also useful for dimensionality reduction and
data smoothing
Regression and log-linear models can both be used on sparse data, although
their application may be limited.
While both methods can handle skewed data, regression does exceptionally
well. Regression can be computationally intensive when applied to high
dimensional data, whereas log-linear models show good scalability for up
to 10 or so dimensions.
iii) Cardinality Reduction
Transformations applied to obtain a reduced representation of the original data.
The term cardinality refers to the uniqueness of data values contained in a
particular column (attribute) of a database table. The lower the cardinality, the
more duplicated elements in a column. Thus, a column with the lowest possible
cardinality would have the same value for every row. SQL databases use
cardinality to help determine the optimal query plan for a given query.
138
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
139
Rahul Publications
MBA II YEAR III SEMESTER
A classification task begins with a data set in which the class assignments
are known. For example, a classification model that predicts credit risk could
be developed based on observed data for many loan applicants over a period
of time.
In addition to the historical credit rating, the data might track employment
history, home ownership or rental, years of residence, number and type of
investments, and so on.
Credit rating would be the target, the other attributes would be the predictors,
and the data for each customer would constitute a case.
The simplest type of classification problem is binary classification. In binary
classification, the target attribute has only two possible values:
Example, high credit rating or low credit rating. Multi-class targets have
more than two values: for example, low, medium, high, or unknown credit
rating
In the model build (training) process, a classification algorithm finds
relationships between the values of the predictors and the values of the target.
Different classification algorithms use different techniques for finding
relationships. These relationships are summarized in a model, which can
then be applied to a different data set in which the class assignments are
unknown.
Classification models are tested by comparing the predicted values to known
target values in a set of test data. The historical data for a classification
project is typically divided into two data sets: one for building the model; the
other for testing the model.
Q24. Explain various issues relating to data classification.
Ans :
The major issue is preparing the data for Classification and Prediction. Preparing
the data involves the following activities:
Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and the
problem of missing values is solved by replacing a missing value with most
commonly occurring value for that attribute.
140
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
141
Rahul Publications
MBA II YEAR III SEMESTER
Support determines how often a rule is applicable to the data set while confidence
determines how frequently items in Y appear in transactions that contain X.
Use packages like a rules, a rules CBA , a rules Sequences in R
Ex :
library (“a rules”)
data (“Adult”)
rules <- apriori(Adult, parameter = list(supp = 0.5, conf = 0.9, target = “rules”))
142
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
These lagged relationships signify the time lag between the cause–effect
parameters. Identifying lagged relationships between socioeconomic processes
is challenging due to the presence of various complex dependencies in the data.
This dependency among the various parameters has enabled us to identify
relationships among different domain parameters in time series data.
The cause–effect relationship for time series prediction is a step towards extracting
the various existing causal relations between different domain, such as
employment, education, agriculture and rural development etc.
It has also emerged in economics and social sciences such as to improve the
economic development and growth of a country and to study the impact of climate
change.
Q27. Explain the process of cause and effect analysis.
Ans :
The following are the steps to solve a problem with Cause and Effect Analysis:
Step 1: Identify the Problem
First, write down the exact problem you face. Where appropriate, identify who is
involved, what the problem is, and when and where it occurs.
Then, write the problem in a box on the left-hand side of a large sheet of paper,
and draw a line across the paper horizontally from the box. This arrangement, looking
like the head and spine of a fish, gives you space to develop ideas.
Example: In this simple example, a manager is having problems with an
uncooperative branch office.
Step 2: Work Out the Major Factors Involved
Next, identify the factors that may be part of the problem. These may be systems,
equipment, materials, external forces, people involved with the problem, and so on.
Try to draw out as many of these as possible. As a starting point, you can use
models such as the McKinsey 7S Framework (which offers you Strategy, Structure,
Systems, Shared Values, Skills, Style and Staff as factors that you can consider) or the
4Ps of Marketing (which offers Product, Place, Price, and Promotion as possible factors).
Brainstorm any other factors that may affect the situation.
Then draw a line off the “spine” of the diagram for each factor, and label each
line.
143
Rahul Publications
MBA II YEAR III SEMESTER
144
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
145
Rahul Publications
MBA II YEAR III SEMESTER
Regression test generates lines of regression of the two variables which helps in
estimating the values. Lines of regression of y on x is the line which gives the best
estimate for the value of y for any specified value of x. Similarly, line of regression
of x on y is the line which gives the best estimate for the value of x for any
specified value of y.
3. Limitations of Regression Analysis
Ans :
Some of the limitations of regression analysis are as follows :
i) Regression analysis assumes that linear relationship exists among the related
variables. But in the ‘area of social sciences, linear relationship may not
exist among the related variables.
ii) When regression analysis is used to evaluate the value of dependent variable
based on independent variable, it is assumed that the static conditions of
relationship exist between ‘them. These statistic conditions do not exist in
social sciences, so, this assumption “minimizes the use’ of regression analysis
in social science.
iii) The value of dependent variable can be evaluated based on independent
variable by using regression analysis but only upto some limits. If the
circumstances go beyond the limits, then resists would be inaccurate.
4. Moving Averages
Ans :
Moving average methods take the average of past actuals and project it forward.
These methods assume that the recent past represents the future. As a result, they work
best for products with relatively little change - steady demand, no seasonality, limited
trends or cycles and no significant demand shifts. Many companies apply this method
because it is simple and easy to use. However, since few products actually behave in
this way, it tends to be less useful than more specialized methods.
5. Data Mining
Ans :
Data mining is the process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems. Data
mining is an interdisciplinary subfield of computer science and statistics with an overall
goal to extract information (with intelligent methods) from a data set and transform the
information into a comprehensible structure for further use.
146
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
Data mining is the analysis step of the “knowledge discovery in databases” process,
or [Link] from the raw analysis step, it also involves database and data
management aspects, data pre-processing, model and inference considerations,
interestingness metrics, complexity considerations, post-processing of discovered
structures, visualization, and online updating.
The difference between data analysis and data mining is that data analysis is to
summarize the history such as analyzing the effectiveness of a marketing
campaign, in contrast, data mining focuses on using specific machine learning
and statistical models to predict the future and discover the patterns among data.
The term “data mining” is in fact a misnomer, because the goal is the extraction
of patterns and knowledge from large amounts of data, not the extraction (mining)
of data itself. It also is a buzzword and is frequently applied to any form of large-
scale data or information processing (collection, extraction, warehousing, analysis,
and statistics) as well as any application of computer decision support system,
including artificial intelligence (e.g., machine learning) and business intelligence.
6. Scope of Data Mining
Ans :
i) Data mining process the work in such a manner that it allows business to more
proactive to grow substantially.
ii) It optimizes large database within the short time and works business intelligence
which is more vital to organizational growth.
iii) It represents the data in some logical order or may be in pattern to identify the
sequential way of processing of data.
iv) It includes tree-shaped structure to understand the hierarchy of data and
representation of the set of information described in the database.
v) Brings a genetic way of classification of different sets of data items to view the
data in the quick glance.
7. Benefits of Data Mining
Ans :
Data mining technique helps companies to get knowledge-based information.
It helps organizations to make the profitable adjustments in operation and
production.
It is a cost-effective and efficient solution compared to other statistical data
applications.
147
Rahul Publications
MBA II YEAR III SEMESTER
It is the speedy process which makes it easy for the users to analyze huge amount
of data in less time.
Ans :
There are chances of companies selling useful information of their customers to
other companies for money. For example, American Express has sold credit card
purchasesof their customers to the other companies.
Many data mining analytics software is difficult to operate and requires advance
training to work on.
Different data mining tools work in different manners due to different algorithms
employed in their design. Therefore, the selection of correct data mining tool is a
very difficult task.
The data mining techniques are not accurate. Hence, it can cause serious
consequences in certain conditions.
Ans :
Data exploration is an informative search used by data consumers to form true
analysis from the information gathered. Often, data is gathered in a non-rigid or controlled
manner in large bulks. For true analysis, this unorganized bulk of data needs to be
narrowed down. This is where data exploration is used to analyze the data and
information from the data to form further analysis.
Data often converges in a central warehouse called a data warehouse. This data
can come from various sources using various formats. Relevant data is needed for
tasks such as statistical reporting, trend spotting and pattern spotting. Data exploration
is the process of gathering such relevant data.
148
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
149
Rahul Publications
MBA II YEAR III SEMESTER
150
Rahul Publications
UNIT - III BUSINESS ANALYTICS (OU)
7. If one item is fixed and unchangeable and the other item varies, the correlation
coefficient will be: [c]
(a) Positive (b) Negative
(c) Zero (d) Undecided
8. A process by which we estimate the value of dependent variable on the basis of
one or more independent variables is called: [b]
(a) Correlation (b) Regression
(c) Residual (d) Slope
9. The slope of the regression line of Y on X is also called the: [d]
(a) Correlation coefficient of X on Y
(b) Correlation coefficient of Y on X
(c) Regression coefficient of X on Y
(d) Regression coefficient of Y on X
10. In simple linear regression, the numbers of unknown constants are: [b]
(a) One (b) Two
(c) Three (d) Four
151
Rahul Publications
MBA II YEAR III SEMESTER
152
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
PRESCRIPTIVE ANALYTICS
A linear form is meant a thematical expression of the type a1x1 + a2x2 + ... +
anxn, where a1, a2, ..., an are constants and x1, x2, ..., xn are variables. The term
‘Programming’ refers to the process of determining a particular programme or plan of
action. So Linear Programming (L.P.) is one of the most important optimization
(maximization/minimization) techniques developed in the field of Operations Research.
153
Rahul Publications
MBA II YEAR III SEMESTER
154
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
3. Once a basic plan is arrived at through LP, it can be reevaluated for changing
conditions.
4. Highlighting of bottlenecks in the production process is the striking advantages
of this technique.
5. It provides flexibility in analyzing a variety of multidimensional problems.
Limitations of LP
Inspite of wide area of applications, some limitations are associated with linear
programming techniques. These are stated below :
1. In some problems objective functions and constraints are not linear. Generally,
in real life situations concerning business and industrial problems constraints
are not linearly created to variables.
2. There is no guarantee of getting integer valued solutions, for example, in finding
out how may men and machines would be required to perform a particular job,
rounding off the solution to the nearest integer will not give an optimal solution.
Integer programming deals with such problems.
3. Linear programming model does not take into consideration the effect of time
and uncertainty. Thus the model should be defined in such a way that any change
due to internal as well as external factors can be incorporated.
4. Sometimes large-scale problems cannot be solved with linear programming
techniques even when the computer facility is available. Such difficulty may be
removed by decomposing the main problem into several small problems and
then solving them separately.
5. Parameters appearing in the model are assumed to be constant. But, in real life
situations they are neither constant not deterministic.
6. Linear programming deals with only single objective, whereas in real life situations
problems come across with multiobjectives.
Q3. State the assumptions and applications of LPP.
Ans :
Assumptions of LPP
1. Proportionality
A primary requirement of linear programming problem is that the objective function
and every constraint function must be linear. Roughly speaking, it simply means
that if 1 kg of a product costs Rs. 2, then 10 kg will cost Rs. 20. If a steel mill can
produce 200 tons in 1 hour, it can produce 1000 tons in 5 hours.
155
Rahul Publications
MBA II YEAR III SEMESTER
Intuitively, linearity implies that the product of variables such as x1 x2, powers of
variables such as x 32 , and combination of variables such as a1x1 + a2 log x2, are
not allowed.
2. Additivity
Then additivity may not hold, in general. If we mix several liquids of different
chemical composition, then the total volume of the mixture may not be the sum
of the volume of individual liquids.
3. Multiplicativity
It requires :
(a) It takes one hour to make a single item on a given machine, it will take 10
hours to make 10 such items.
(b) The total profit from selling a given number of units is the unit profit times
the number of units sold.
4. Divisibility
It means that the fractional levels of variables must be permissible besides integral
values.
5. Deterministic
All the parameters in the linear programming models are assumed to be known
exactly. While in actual practice, production may depend upon change also.
Applications of LPP
Suppose we are given m persons, n-jobs, and the expected productivity cij of ith
person on the jth job. We want to find an assignment of persons xij 0 for all i
and j, to n jobs so that the average productivity of person assigned is maximum.
156
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
2. Transportation Problem
We suppose that m factories (called sources) supply n warehouses (called
destinations) with a certain product. Factory Fi (i = 1, 2, ..., m) produces ai units
(total or per unit time) and warehouse Wj (j = 1, 2, 3 ..., n) requires bj units. Let
the decision variables xij, be the amount shipped from factory Fi to warehouse
Wj. The objective is to determine the number of units transported from factory Fi
to warehouse Wj. The objective is to determine the number of units transported
m n
157
Rahul Publications
MBA II YEAR III SEMESTER
8. Physical Distribution
Linear programming determines the most economic and efficient manner of
locating manufacturing plants and distribution centres for physical distribution.
Q4. What are the requirements of linear programming problem ?
Ans :
1. Decision variables and their relationship
The decision (activity variables refer to candidates (products, services, projects
etc.) that are competiting with one another for sharing the given limited resources.
These variables are usually inter-related in terms of utilization of resources and
need simultaneous solutions. The relationship among these variables should be
linear.
2. Well defined objective function
A linear programming problem must have a clearly defined objective function to
optimize which may be either to maximize contribution by utilizing available
resources, or it may be to produce at the lowest possible cost by using a limited
amount of productive factors. It should be expressed as a linear function of
decision variables.
3. Presence of constraints or restrictions
There must be limitations on resources (like production capacity, manpower,
time, machines, markets, etc.) which are to be allocated among various competing
activities. These must be capable of being expressed as linear equalities or
inequalities in terms of decision variables.
4. Alternative courses of action
There must be alternative courses of action. For example, it must be possible to
make a selection between various combinations of the productive factors such
as men, machines, materials, markets, etc.
5. Non-negative restrictions
158
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
Example
Rahul and Co. manufacturers two brands of products namely
Shivnath and Harinath. Both these models have to under go the operations
on three machines lathe, milling and grinding. Each unit of Shivnath
gives a profit of Rs. 45 and requires 2 hours on lathe, 3 hours on milling
and 1 hour on grinding. Each unit of Harinath can give a profit of Rs. 70
and requires 3, 5, and 4 hours on lathe, milling and grinding respectively.
Due to prior commitment, the use of lathe hours are restricted to a
maximum of 70 hours in a week. The operators to operate milling machines
are hired for 110 hours / week. Due to scarce availability of skilled man
power for grinding machine, the grinding hours are limited to 100 hours /
week. Formulate the data into an LPP.
Sol :
Step 1 : Selection of Variables
In the above problem, we can observe that the decision is to be taken on how
many products of each brand is to be manufactured. Hence the quantities of products
to be produced per week are the decision variables.
Therefore we assume that the number of units of product Shivnath brand produced
per week = x1.
The number of units of product of Harinath brand produced per week = x2.
Step 2 : Setting Objective
In the given problem the profits on the brands are given.
Therefore objective function is to maximize the profits.
Now, the profit on each unit of Shivnath brand = Rs. 45.
Number of units of Shivnath to be manufactured = x1
The profit on x1 units of Shivnath brand = 45 x1
Similarly, the profit on each unit of Harinath brand = Rs. 70
Number of units of Harinath brand to be manufactured = x2
The profit on x2 units of Harinath brand = 70 x2
The total profit on both brands = 45 x1 + 70 x2
This total profit (say z) is to be maximized
Hence, the objective function is to Maximize z = 45x1 + 70x2
159
Rahul Publications
MBA II YEAR III SEMESTER
160
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
Step 5 : Summary
Maximize Z = 45x1 + 70x2
Subject to 2x1 + 3x2 70
3x1 + 5x2 110
x1 + 4x2 100
x1 0 and x2 0.
Q5. Describe the steps involved in graphical solution to linear
programming models.
Ans :
Simple linear programming problems of two decision variables can be easily
solved by graphical method. The outlines of graphical procedure are as follows :
Step 1 : Consider each inequality-constraint as equation.
Step 2 : Plot each equation on the graph, as each one will geometrically represent a
straight line.
Step 3 : Shade the feasible region. Every point on the line will satisfy the equation of
the line. If the inequality-constraint corresponding to that lines is ‘ ’, then
the region below the line lying in the first quadrant (due to non-negativity of
variables) is shaded. For the inequality-constraint with ‘ ’ sign, the region
above the line in the first quadrant is shaded. The points lying in common
region will satisfy all the constraints simultaneously. The common region
thus obtained is called the feasible region.
Step 4 : Choose the convenient value of z (say = 0) and plot the objective function
line.
Step 5 : Pull the objective function line until the extreme points of the feasible region.
In the maximization case, this line will stop farthest from the origin and
passing through at least one corner of the feasible region. In the minimization
case, this line will stop nearest to the origin and passing through at least one
corner of the feasible region.
Step 6 : Read the coordinates of the extreme point(s) selected in step 5 and find the
maximum or minimum (as the case may be) value of z.
161
Rahul Publications
MBA II YEAR III SEMESTER
162
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
Key column
163
Rahul Publications
MBA II YEAR III SEMESTER
Step 7: Check whether all the values of Zj – Cj are positive. If all are positive, the
optimal solution is reached. Write the solution values and find Zopt. (i.e., Zmax
or Zmin as the case may be).
If Zj – Cj values are still negative, again choose most negative among these
and go to step 5 and repeat the iteration till all the values of Zj – Cj become
positive.
Start
Yes
Rewrite objective function
using decision, slack/surplus,
Artificial variables.
Is No Optimal solution is
Zj – Cj < 0 for Obtained, write solution Stop
any column Values and find Zmax
Are
all the values of Yes Solution is
min. ratio negative Unbounded
or infinity
164
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
165
Rahul Publications
MBA II YEAR III SEMESTER
The firm’s profit from producing and selling x units is the sales revenue xp(x)
minus the production costs, i.e., P(x) = xp(x) – cx.
If each of the firm’s products has a similar profit function, say, Pfixj) for producing
and selling Xj units of product j, then the overall objective function is
166
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
167
Rahul Publications
MBA II YEAR III SEMESTER
168
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
Example
169
Rahul Publications
MBA II YEAR III SEMESTER
The obtained optimum is tested for being an integer solution. If it is not, there is
guaranteed to exist a linear inequality that separates the optimum from the convex
hull of the true feasible set.
Finding such an inequality is the separation problem, and such an inequality is a
cut. A cut can be added to the relaxed linear program.
Then, the current non-integer solution is no longer feasible to the relaxation. This
process is repeated until an optimal integer solution is found.
We start by solving the LP relaxation to get a lower bound for the minimum
objective value.
We assume the final simplex tableau is given, the basic variables having columns
with coefficient 1 in one constraint row and 0 in other rows. The solution can be read
from this form: when the non - basic variables are 0, the basic variables have the
values on right-hand side (RHS). The objective function row is of the same form, with
its basic variable f.
If the LP solution is fractional, i.e., not integer, at least one of the RHS values is
fractional. We proceed by appending to the model a constraint that cuts away a part of
the feasible set so that no integer solutions are lost.
Take a row i from the final simplex tableau, with a fractional RHS d. Denote by
xj0 the basic variable of this row and N the index set of non-basic variables.
Row i as an equation:
xjo + w ij x j d
j N
Denote by Idm, the largest integer, i.e., #d (the whole part of d, if d is positive).
Because all variables are non-negative,
[w ij ]x j # w ij x j
j N jN
xjo + [w ij ]x j # d
j N
170
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
If we denote the fractional parts by symbols r = d – Idm and fij = wij – IWijm,, we
get a cut constraint or a cutting plane in the solution space:
fij x j $r
jN
171
Rahul Publications
MBA II YEAR III SEMESTER
2. Localization method
Goal: Find a point in convex set C described by cutting-plane oracle
Algorithm: choose bounded set P0 containing C; repeat for k 1:
Choose a point x(k) in Pk–1 and query the cutting-plane oracle at x(k).
Termine if Pk = f
172
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
173
Rahul Publications
MBA II YEAR III SEMESTER
174
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
175
Rahul Publications
MBA II YEAR III SEMESTER
Since this is associated with the worst possible outcomes of decrease in sales
worth Rs. 3,00,000 the optimal course of action (or strategy) is obtained S3
applying the maximin criterion.
b) Including an extra row representing the maximum payoff associated with each
course of action and then applying the criterion of maximax, the optimal course
of action is S1, since this associates with it a maximum outcomes of Rs. 7,00,000
as shown in table (2).
c) The minimum value among the maximum regret as shown in table (3) is zero
and this corresponds to course of action S1.
176
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
S1 S2 S3
Maximum
regret
Table (3)
d) Here it is assumed that each course of action has a probability of occurrence
equal to 1/3. Therefore, expected returns can be obtained as shown in table (4).
Course of Action Expected Return
Table (4)
Thus, Laplace criterion suggest that the executive should choose the strategy 5.
[Link] about decision making under risk.
Ans :
Decision-making under risk assumes the long-run relative frequency of the states
of nature occurrence to be given and besides this it also enumerates several states of
nature. The state of natures information is probabilistic in nature i.e., the decision
maker cannot predict which outcome will occur as a result of selecting a particular
177
Rahul Publications
MBA II YEAR III SEMESTER
course of action. As each course of action results in more than one outcome, it is not
easy to calculate the exact monetary payoffs or outcomes for the various combination
of courses of action and states of nature.
The decision maker with the help of the past records or experience assigns
probabilities to the likely possible occurrence of each state of nature. Once the probability
distribution of the states of nature is known, then the best course of action must be
selected which yields the highest expected payoffs.
The most widely used criteria for evaluating the alternative courses of action, is
the Expected Monetary Value (EMV) which is also called as expected utility. The
objective of decision-making under this condition is to optimize the expected payoff.
Example
An electrical manufacturing company has seen its business expanded
to the point where it needs to increase production beyond its existing
capacity. It has narrowed the alternatives to two approaches to increase
the maximum production capacity, (a) Expansion, at a cost of Rs. 8 million,
or (b) Mod-ernization at a cost of Rs. 5 million. Both approaches would
require the same amount of time for implementation. Management believes
that over the required payback period, demand will either be high or
moderate. Since high demand is considered to be somewhat less likely
than moderate demand, the probability of high demand has been setup at
0.35.
If the demand is high, expansion would gross an estimated additional
Rs.12 million but modernization only an additional Rs. 6 million, due to
lower maximum production capability. On the other hand, if the demand
is moderate, the comparable figures would be Rs. 7 million for expansion
and Rs. 5 million for modernization.
a) Calculate conditional profit in relation to various action and
outcome combinations and states of nature.
b) If company wishes to maximize its expected monetary value,
then it should modernize or expand?
c) Calculate the EVPI.
d) Construct the conditional opportunity loss table and also
calculate EOL.
Sol :
a) Defining the state of nature of outcome (over which the company has no control)
and course of action (company’s possible decision).
Let,
States of nature : 01 = High demand, O2 = Moderate demand
178
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
b) The payoff table (1) can be rewritten as follows along with the given probabilities
of states of nature.
Course of Action
State of Nature Oj Probability P(Oj) S1 (Expand) S2 (Modernize)
O1 (high demand) 0.35 4 1
O2 0.65 S2 0 0 × 0.65 = 0
EPPI 1.40
179
Rahul Publications
MBA II YEAR III SEMESTER
The optimal EMV* is Rs. 0.75 million corresponding to the course of action S1.
Then,
EVPI = EPPI – EMV (S1) = 1.40 – 0.75 = Rs. 0.65 million
Alternately, if the company could get a perfect information (for forecast) of demand
(high or moderate) it should consider paying upto 0.65 million for an information.
The expected value of perfect information in business helps in getting and absolute
upper bound on the amount that should be spent to get additional information
on which a given decision is based.
d) The opportunity loss value are shown below.
State of Probability Conditional Profit (Rs. million) Loss ([Link])
Nature P(Oj) Course of Action Courses of Action
Oj S1 S2 S1 S2
O1 0.35 4 1 0 3
O2 0.65 –1 0 1 0
180
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
181
Rahul Publications
MBA II YEAR III SEMESTER
v) Parameters appearing in the model are assumed to be constant. But, in real life
situations they are neither constant not deterministic.
vi) Linear programming deals with only single objective, whereas in real life situations
problems come across with multiobjectives.
182
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
183
Rahul Publications
MBA II YEAR III SEMESTER
184
Rahul Publications
UNIT - IV BUSINESS ANALYTICS (OU)
185
Rahul Publications
MBA II YEAR III SEMESTER
186
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
PROGRAMMING USING R
UNIT R Environment, R packages, Reading and Writing data in
R, R functions, Control Statements, Frames and Subsets,
V Managing and Manipulating data in R.
187
Rahul Publications
MBA II YEAR III SEMESTER
Statistical Features of R
1. R has some topical relevance
It is free, open source software.
R is available under free software Foundation.
2. R has some statistical features
Basic Statistics : Mean, variance, median.
Static graphics : Basic plots, graphic maps.
Probability distributions : Beta, Binomial.
Any Doubt yet in Why Learn R programming? Please Comment.
Programming Features of R
1. R has some topical relevance
Data inputs such as data type, importing data, keyboard typing.
Data Management such as data variables, operators.
2. R has some programming features
Distributed Computing – Distributed computing is an open source, high-
performance platform for the R language. It splits tasks between multiple
processing nodes to reduce execution time and analyze large datasets.
R packages – R packages are a collection of R functions, compiled
code and sample data. By default, R installs a set of packages during
installation.
Q2. Explain the basic tips for using R?
Ans :
R is command-line driven. It requires you to type or copy-and-paste commands
after a command prompt (>) that appears when you open R. After typing a
command in the R console and pressing Enter on your keyboard, the command
will run. If your command is not complete, R issues a continuation prompt
(signified by a plus sign: +). Alternatively you can write a script in the script
window, and select a command, and click the Run button.
188
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
R is case sensitive. Make sure your spelling and capitalization are correct.
Commands in R are also called functions. The basic format of a function in R
is: [Link](argument, options).
The up arrow (^) on your keyboard can be used to bring up previous commands
that you’ve typed in the R console.
The $ symbol is used to select a particular column within the table
(e.g., table$column).
Any text that you do not want R to act on (such as comments, notes, or instructions)
needs to be preceded by the # symbol (a.k.a. hash-tag, comment, pound, or
number symbol). R ignores the remainder of the script line following
For example: Plot(x, y) # This text will not affect the plot function because of
the comment.
Q3. What is R environment.
Ans :
R is an integrated suite of software facilities for data manipulation, calculation
and graphical display. It includes
An effective data handling and storage facility,
A suite of operators for calculations on arrays, in particular matrices,
A large, coherent, integrated collection of intermediate tools for data analysis,
Graphical facilities for data analysis and display either on-screen or on hardcopy,
and
A well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and
coherent system, rather than an incremental accretion of very specific and inflexible
tools, as is frequently the case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to
add additional functionality by defining new functions. Much of the system is itself
written in the R dialect of S, which makes it easy for users to follow the algorithmic
189
Rahul Publications
MBA II YEAR III SEMESTER
choices made. For computationally-intensive tasks, C, C++ and Fortran code can be
linked and called at run time. Advanced users can write C code to manipulate R objects
directly.
Many users think of R as a statistics system. We prefer to think of it of an
environment within which statistical techniques are implemented. R can be extended
(easily) via packages. There are about eight packages supplied with the R distribution
and many more are available through the CRAN family of Internet sites covering a very
wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply
comprehensive documentation, both on-line in a number of formats and in hardcopy.
Q4. Explain the various types of operators in R program.
Ans :
1. Arithmetic Operators
2. Relational Operators
3. Logical Operators
4. Assignment Operators
5. Miscellaneous Operators
1. Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The
operators act on each element of the vector.
Operator Description Example
+ Adds two vectors v <- c( 2, 5.5, 6)
t <- c(8, 3, 4)
print(v+t)
It produces the following result:
[1] 10.0 8.5 10.0
– Subtracts second vector v <- c( 2, 5.5, 6)
from the first t <- c(8, 3, 4)
print(v-t)
It produces the following result:
[1]-6.0 2.5 2.0
190
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
2. Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
191
Rahul Publications
MBA II YEAR III SEMESTER
192
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
3. Logical Operators
Each element of the first vector is compared with the corresponding element of
the second vector. The result of comparison is a Boolean value.
The logical operator && and || considers only the first element of the vectors
and gives a vector of single element as output.
193
Rahul Publications
MBA II YEAR III SEMESTER
194
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
5.2 R PACKAGE
Q6. What is R package?
Ans :
R packages are a collection of R functions, complied code and sample data.
They are stored under a directory called ”library” in the R environment. By default,
R installs a set of packages during installation. More packages are added later, when
they are needed for some specific purpose. When we start the R console, only the
default packages are available by default. Other packages which are already installed
have to be loaded explicitly to be used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.
Below is a list of commands to be used to check, verify and use the R packages.
Check Available R Packages
Get library locations containing R packages
.libPaths()
When we execute the above code, it produces the following result. It may vary
depending on the local settings of your pc.
195
Rahul Publications
MBA II YEAR III SEMESTER
196
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
197
Rahul Publications
MBA II YEAR III SEMESTER
[Link](“Package Name”)
# Install the package named “XML”.
[Link](“XML”)
ii) Install package manually
Go to the link R Packages to download the package needed. Save the package
as a .zip file in a suitable location in the local system.
Now you can run the following command to install this package in the R
environment.
[Link](file_name_with_path, repos = NULL, type = “source”)
# Install the package named “XML”
[Link](“E:/XML_3.[Link]”, repos = NULL, type = “source”)
iii) Load Package to Library
Before a package can be used in the code, it must be loaded to the current R
environment. You also need to load a package that is already installed previously but
not available in the current environment.
A package is loaded using the following command “
library(“package Name”, [Link] = “path to library”)
# Load the package named “XML”
[Link](“E:/XML_3.[Link]”, repos = NULL, type = “source”)
198
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
Writing Data in R
Following are few functions for writing (exporting) data to files.
[Link](), and [Link]() exports data to wider range of file format including
csv and tab-delimited.
writeLines() write text lines to a text-mode connection.
dump() takes a vector of names of R objects and produces text representations
of the objects on a file (or connection). A dump file can usually be sourced into
another R session.
dput() writes an ASCII text representation of an R object to a file (or connection)
or uses one to recreate the object.
save() writes an external representation of R objects to the specified file.
Reading data files with [Link]()
The [Link]() function is one of the most commonly used functions for reading
data into R. It has a few important arguments.
file, the name of a file, or a connection
header, logical indicating if the file has a header line
sep, a string indicating how the columns are separated
colClasses, a character vector indicating the class of each column in the data
set
nrows, the number of rows in the dataset
[Link], a character string indicating the comment character
skip, the number of lines to skip from the beginning
stringsAsFactors, should character variables be coded as factors?
[Link]() and [Link]() Examples
> data<-[Link](“[Link]”)
> data<-[Link](“D:\\datafiles\\[Link]”)
> data<-[Link](“D:\\datafiles\\[Link]”)
R will automatically skip lines that begin with a #, figure out how many rows
there are (and how much memory needs to be allocated). R also figure out what type of
variable is in each column of the table.
199
Rahul Publications
MBA II YEAR III SEMESTER
5.4 R FUNCTIONS
Q9. What is R Function? Explain the components of R Functions.
Ans :
A function is a set of statements organized together to perform a specific task. R
has a large number of in-built functions and the user can create their own functions. In
R, a function is an object so the R interpreter is able to pass control to the function,
along with arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well
as any result which may be stored in other objects.
Definition
An R function is created by using the keyword function. The basic syntax of an
R function definition is as follows :
function_name <- function(arg_1, arg_2, ...) {
Function body
}
200
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
Components of R
The different parts of a function are:
Function Name: This is the actual name of the function. It is stored in R
environment as an object with this name.
Arguments: An argument is a placeholder. When a function is invoked, you
pass a value to the argument. Arguments are optional; that is, a function may
contain no arguments. Also arguments can have default values.
Function Body: The function body contains a collection of statements that
defines what the function does.
Return Value: The return value of a function is the last expression in the function
body to be evaluated.
R has many in-built functions which can be directly called in the program
without defining them first. We can also create and use our own functions referred
as user defined functions.
Q10. Explain different types of functions?
Ans :
i) Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and
paste(...) etc. They are directly called by user written programs. You can refer most
widely used R functions.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
# Find mean of numbers from 25 to 82.
print(mean(25:82))
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
When we execute the above code, it produces the following result “
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
201
Rahul Publications
MBA II YEAR III SEMESTER
202
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
203
Rahul Publications
MBA II YEAR III SEMESTER
204
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
Condition
If Condition If Condition
is true is false
Conditional
Code
205
Rahul Publications
MBA II YEAR III SEMESTER
206
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
y <- c(8, 3, 2, 5)
if(any(y < 0)){
print(“y contains negative numbers”)
}.
[Link] briefly about if.... else statement?
Ans :
The conditional if...else statement is used to test an expression similar to the if
statement. However, rather than nothing happening if the test_expression is FALSE,
the else part of the function will be evaluated.
# syntax of if...else statement
if (test_expression) {
statement 1
} else {
statement 2
}
The following extends the previous example illustrated for the if statement in
which the if statement tests if any values in a vector are negative; if TRUE it produces
one output and if FALSE it produces the else output.
# this test results in statement 1 being executed
x <- c(8, 3, -2, 5)
if(any(x < 0)){
print(“x contains negative numbers”)
} else{
print(“x contains all positive numbers”)
}
## [1] “x contains negative numbers”
# this test results in statement 2 (or the else statement) being executed
y <- c(8, 3, 2, 5)
207
Rahul Publications
MBA II YEAR III SEMESTER
208
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
209
Rahul Publications
MBA II YEAR III SEMESTER
210
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
x
## [1] “The year is 2010” “The year is 2011” “The year is 2012” “The year is 2013”
## [5] “The year is 2014” “The year is 2015” “The year is 2016”
Another example in which we create an empty matrix with 5 rows and 5 columns.
The for loop then iterates over each column (note how i takes on the values 1 through
the number of columns in the [Link] matrix) and takes a random draw of 5 values
from a poisson distribution with mean i in column i:
[Link] <- matrix(NA, nrow = 5, ncol = 5)
for(i in 1:ncol([Link])){
[Link][, i] <- rpois(5, lambda = i)
}
[Link]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 2 1 7 1
## [2,] 1 2 2 3 9
## [3,] 2 1 5 6 6
## [4,] 2 1 5 2 10
## [5,] 0 2 2 2 4
[Link] briefly about while loop?
Ans :
While loops begin by testing a condition. If it is true, then they execute the
statement. Once the statement is executed, the condition is tested again, and so forth,
until the condition is false, after which the loop exits. It’s considered a best practice to
include a counter object to keep track of total iterations
# syntax of while loop
counter <- 1
while(test_expression) {
statement
counter <- counter + 1
}
211
Rahul Publications
MBA II YEAR III SEMESTER
while loops can potentially result in infinite loops if not written properly; therefore,
you must use them with care. To provide a simple example to illustrate how
similiar for and while loops are:
counter <- 1
while(counter <= 10) {
print(counter)
counter <- counter + 1
}
# this for loop provides the same output
counter <- vector(mode = “numeric”, length = 10)
for(i in 1:length(counter)) {
print(i)
}
The primary difference between a for loop and a while loop is: a for loop is
used when the number of iterations a code should be run is known where a while loop
is used when the number of iterations is not known. For instance, the following takes
value x and adds or subtracts 1 from the value randomly until x exceeds the values
in the test expression. The output illustrates that the code runs 14 times until x exceeded
the threshold with the value 9.
counter <- 1
x <- 5
[Link](3)
while(x >= 3 && x <= 8 ) {
coin <- rbinom(1, 1, 0.5)
if(coin == 1) { ## random walk
x <- x + 1
} else {
x <- x - 1
}
cat(“On iteration”, counter, “, x =”, x, ‘\n’)
counter <- counter + 1
}
212
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
## On iteration 1 , x = 4
## On iteration 2 , x = 5
## On iteration 3 , x = 4
## On iteration 4 , x = 3
## On iteration 5 , x = 4
## On iteration 6 , x = 5
## On iteration 7 , x = 4
## On iteration 8 , x = 3
## On iteration 9 , x = 4
## On iteration 10 , x = 5
## On iteration 11 , x = 6
## On iteration 12 , x = 7
## On iteration 13 , x = 8
## On iteration 14 , x = 9
[Link] briefly about repeat loop?
Ans :
A repeat loop is used to iterate over a block of code multiple number of times.
There is test expression in a repeat loop to end or exit the loop. Rather, we must put a
condition statement explicitly inside the body of the loop and use the break function
to exit the loop. Failing to do so will result into an infinite loop.
# syntax of repeat loop
counter <- 1
repeat {
statement
if(test_expression){
break
}
counter <- counter + 1
}
213
Rahul Publications
MBA II YEAR III SEMESTER
For example ,say we want to randomly draw values from a uniform distribution
between 1 and 25. Furthermore, we want to continue to draw values randomly until
our sample contains at least each integer value between 1 and 25; however, we do not
care if we’ve drawn a particular value multiple times. The following code repeats the
random draws of values between 1 and 25 (in which we round). We then include
an if statement to check if all values between 1 and 25 are present in our sample. If so,
we use the break statement to exit the loop. If not, we add to our counter and let the
loop repeat until the conditional ifstatement is found to be true. We can then check
the counter object to assess how many iterations were required to reach our
conditional requirement.
counter <- 1
x <- NULL
repeat {
x <- c(x, round(runif(1, min = 1, max = 25)))
214
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
x <- 1:5
for (i in x) {
if (i == 3){
break
print(i)
## [1] 1
## [1] 2
The next argument is useful when we want to skip the current iteration of a
loop without terminating it. On encountering next, the R parser skips further evaluation
and starts the next iteration of the loop. In this example, the forloop will iterate for each
element in x; however, when it gets to the element that equals 3 it will skip the for loop
execution of printing the element and simply jump to the next iteration.
x <- 1:5
for (i in x) {
if (i == 3){
next
print(i)
## [1] 1
## [1] 2
## [1] 4
## [1] 5
215
Rahul Publications
MBA II YEAR III SEMESTER
216
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
217
Rahul Publications
MBA II YEAR III SEMESTER
218
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
219
Rahul Publications
MBA II YEAR III SEMESTER
Extract 3rd and 5th row with 2nd and 4th column
# Create the data frame.
[Link] <- [Link](
emp_id = c (1:5),
emp_name = c(“Rick”,”Dan”,”Michelle”,”Ryan”,”Gary”),
salary = c(623.3,515.2,611.0,729.0,843.25),
Join_date = [Link](c(“2012-01-01”, “2013-09-23”, “2014-11-15”, “2014-05-11”,
“2015-03-27”)),
stringsAsFactors = FALSE
)
# Extract 3rd and 5th row with 2nd and 4th column.
result <- [Link][c(3,5),c(2,4)]
print(result)
When we execute the above code, it produces the following result
emp_name Join_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
iv) Expand Data Frame
A data frame can be expanded by adding columns and rows.
Add Column
Just add the column vector using a new column name.
# Create the data frame.
[Link] <- [Link](
emp_id = c (1:5),
emp_name = c(“Rick”,”Dan”,”Michelle”,”Ryan”,”Gary”),
salary = c(623.3,515.2,611.0,729.0,843.25),
220
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
221
Rahul Publications
MBA II YEAR III SEMESTER
dept = c(“IT”,”Operations”,”IT”,”HR”,”Finance”),
stringsAsFactors = FALSE
)
# Create the second data frame
[Link] <- [Link] (
emp_id = c (6:8),
emp_name = c(“Rasmi”,”Pranab”,”Tusar”),
salary = c(578.0,722.5,632.8),
Join_date = [Link](c(“2013-05-21”,”2013-07-30",”2014-06-17")),
dept = c(“IT”,”Operations”,”Fianance”),
stringsAsFactors = FALSE
)
# Bind the two data frames.
[Link] <- rbind([Link],[Link])
print([Link])
When we execute the above code, it produces the following result “
[Link]. emp_id emp_name salary Join_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance
222
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
223
Rahul Publications
MBA II YEAR III SEMESTER
# first 5 observations
newdata <- mydata[1:5,]
# based on variable values
newdata <- mydata[ which(mydata$gender==’F’
& mydata$age > 65), ]
# or
attach(mydata)
newdata <- mydata[ which(gender==’F’ & age > 65),]
detach(mydata)
Selection using the Subset Function
The subset( ) function is the easiest way to select variables and observations.
In the following example, we select all rows that have a value of age greater than or
equal to 20 or age less then 10. We keep the ID and Weight columns.
# using subset function
newdata <- subset(mydata, age >= 20 | age < 10,
select=c(ID, Weight))
In the next example, we select all men over the age of 25 and we keep variables
weight through income (weight, income and all columns between them).
# using subset function (part 2)
newdata <- subset(mydata, sex==”m” & age > 25,
select=weight:income)
To practice the subset() function, try this this interactive exercise. on subsetting
[Link].
Random Samples
Use the sample( ) function to take a random sample of size n from a
dataset.
# take a random sample of size 50 from a dataset mydata
# sample without replacement
mysample <- mydata[sample(1:nrow(mydata), 50,
replace=FALSE),]
224
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
Merge
Select
Visits Recode
Transform
Data frame
Once you have access to your data, you will want to massage it into useful
form. This includes creating new variables (including recoding and renaming existing
variables), sorting and merging datasets, aggregating data, reshaping data,
and subsetting datasets (including selecting observations that meet criteria, randomly
sampling observeration, and dropping or keeping variables).
Each of these activities usually involve the use of R’s built-in operators (arithmetic
and logical) and functions (numeric, character, and statistical). Additionally, you may
need to use control structures (if-then, for, while, switch) in your programs and/or
create your own functions. Finally you may need to convert variables or datasets
from one type to another (e.g. numeric to character or matrix to data frame).
This section describes each task from an R perspective.
[Link] to manipulate data in R?
Ans :
Data Manipulation In R
Data structures provide the way to represent data in data analytics. We can
manipulate data in R for analysis and visualization.
Before we start playing with data in R, let us see how to import data in
R and ways to export data from R to different external sources like SAS, SPSS, text
file or CSV file.
One of the most important aspects of computing with data Data Manipulation in
R and enable its subsequent analysis and visualization. Let us see few basic data
structures in R:
225
Rahul Publications
MBA II YEAR III SEMESTER
(a) Vectors in R
These are ordered a container of primitive elements and are used for 1-
dimensional data.
Types – integer, numeric, logical, character, complex
(b) Matrices in R
These are Rectangular collections of elements and are useful when all data is of
a single class that is numeric or characters.
Dimensions – two, three, etc.
(c) Lists in R
These are ordered a container for arbitrary elements and are used for higher
dimension data, like customer data information of an organization. When data
cannot be represented as an array or a data frame, list is the best choice. This is
so because lists can contain all kinds of other objects, including other lists or
data frames, and in that sense, they are very flexible.
[Link] is Data Manipulation and Data Processing?
Ans :
In this R Programming we can learn Data manipulation in R and data
processing with R. Moreover, we will see three subset operators in R and how to perform
R data manipulation like subsetting in R, sorting and merging of data in R programming
language. Also, we will learn data structures in R, how to create subsets in R and usage
of R sample() command, ways to create R data subgroups or bins of data in R. Along
with this, we will look at different ways to combine data in R, how to merge data in R,
sorting and ordering data in R, ways to traverse data in R and formula interface in R.
At last, this R Data Manipulation topics will provide you complete tutorial on ways for
manipulating and processing data in R.
So, let’s start Data Manipulation in R.
Data Manipulation in R
Creating Subsets Creating Subgroups Match( ) Function
Adding Calculated
Sorting and Ordering Variables
Fields
226
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
227
Rahul Publications
MBA II YEAR III SEMESTER
228
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
Ans :
Following are few functions for writing (exporting) data to files.
[Link](), and [Link]() exports data to wider range of file format including
csv and tab-delimited.
writeLines() write text lines to a text-mode connection.
dump() takes a vector of names of R objects and produces text representations
of the objects on a file (or connection). A dump file can usually be sourced into
another R session.
dput() writes an ASCII text representation of an R object to a file (or connection)
or uses one to recreate the object.
save() writes an external representation of R objects to the specified file.
5. R Function
Ans :
A function is a set of statements organized together to perform a specific task. R
has a large number of in-built functions and the user can create their own functions. In
R, a function is an object so the R interpreter is able to pass control to the function,
along with arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well
as any result which may be stored in other objects.
Definition
An R function is created by using the keyword function. The basic syntax of an
R function definition is as follows :
function_name <- function(arg_1, arg_2, ...) {
Function body
}
229
Rahul Publications
MBA II YEAR III SEMESTER
6. Components of R
Ans :
The different parts of a function are:
Function Name: This is the actual name of the function. It is stored in R
environment as an object with this name.
Arguments: An argument is a placeholder. When a function is invoked, you
pass a value to the argument. Arguments are optional; that is, a function may
contain no arguments. Also arguments can have default values.
Function Body: The function body contains a collection of statements that
defines what the function does.
Return Value: The return value of a function is the last expression in the function
body to be evaluated.
7. Control statements.
Ans :
Looping is similiar to creating functions in that they are merely a means to
automate a certain multi step process by organizing sequences of R expressions. R
consists of several loop control statements which allow you to perform repetititve code
processes with different intentions and allow these automated expressions to naturally
respond to features of your data. Consequently, learning these loop control statements
will go a long ways in reducing code redundancy and becoming a more efficient data
wrangler.
Condition
If Condition If Condition
is true is false
Conditional
Code
230
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
231
Rahul Publications
MBA II YEAR III SEMESTER
232
Rahul Publications
UNIT - V BUSINESS ANALYTICS (OU)
233
Rahul Publications
MBA II YEAR III SEMESTER
ANSWERS
1. R
2. Environment
3. Library
4. Two
5. Function
6. Looping
7. While loops
8. Repeat
9. Data
10. R
234
Rahul Publications
SOLVED MODEL PAPERS BUSINESS ANALYTICS (OU)
FACULTY OF MANAGEMENT
BBA III Year - VI Semester
Model Paper - I
BUSINESS ANALYTICS
Time : 3 Hours ] [Max. Marks : 80
PART - A (5 × 4 = 20 Marks)
[Short Answer type]
ANSWERS
1. a) What is Business analytics? (Unit-I, SQA 1)
PART - B (5 × 12 = 60 Marks)
[Essay Answer type]
Answer all the questions using the internal choice
235
Rahul Publications
MBA II YEAR III SEMESTER
236
Rahul Publications
SOLVED MODEL PAPERS BUSINESS ANALYTICS (OU)
FACULTY OF MANAGEMENT
BBA III Year - VI Semester
Model Paper - II
BUSINESS ANALYTICS
Time : 3 Hours ] [Max. Marks : 80
PART - A (5 × 4 = 20 Marks)
[Short Answer type]
ANSWERS
1. a) What is Big Data? (Unit-I, SQA 3)
b) Descriptive Statistics (Unit-II, SQA 2)
b) What is Data visualization? (Unit-II, SQA 3)
d) Data Mining (Unit-III, SQA 5)
e) Association in Data Mining (Unit-III, SQA 9)
f) Cutting Plane Method (Unit-IV, SQA 5)
g) R Function (Unit-V, SQA 5)
h) Reading Data in R (Unit-V, SQA 3)
PART - B (5 × 12 = 60 Marks)
[Essay Answer type]
Answer all the questions using the internal choice
237
Rahul Publications
MBA II YEAR III SEMESTER
238
Rahul Publications
SOLVED MODEL PAPERS BUSINESS ANALYTICS (OU)
FACULTY OF MANAGEMENT
BBA III Year - VI Semester
Model Paper - III
BUSINESS ANALYTICS
Time : 3 Hours ] [Max. Marks : 80
PART - A (5 × 4 = 20 Marks)
[Short Answer type]
ANSWERS
1. a) Various Challenges in Business Analytics (Unit-I, SQA 5)
b) Cross Tabulations (Unit-II, SQA 4)
b) Gantt Chart (Unit-II, SQA 9)
d) Limitations of Regression Analysis (Unit-III, SQA 3)
e) Data Reduction (Unit-III, SQA 10)
f) Decision analysis (Unit-IV, SQA 6)
g) Control statements (Unit-V, SQA 7)
h) Data frame in R (Unit-V, SQA 8)
PART - B (5 × 12 = 60 Marks)
[Essay Answer type]
Answer all the questions using the internal choice
239
Rahul Publications
MBA II YEAR III SEMESTER
240
Rahul Publications
SOLVED PREVIOUS QUESTIONS PAPER BUSINESS ANALYTICS (OU)
FACULTY OF MANAGEMENT
B.B.A VI-Semester (CBCS) Examination
MAY - 2019
BUSINESS ANALYTICS
Time: 3 Hours Max. Marks : 80
PART – A (5 × 4 = 20 Marks)
(Short Answer Type)
Note: Answer all the questions.
ANSWERS
1. Answer any five of the following questions in not exceeding 20 lines each..
(a) Define data and data types with examples. (Unit-I, [Link]. 16)
(b) Explain business analytics in practice with examples. (Unit-I, [Link]. 7)
(c) Explain types of charts. (Unit-II, [Link]. 8)
(d) Prepare a Dash Board for daily sales report using MS-Excel.
Ans :
Dashboards are made up of tables, charts, gauges, and numbers. They can be
used in any industry, for almost any purpose. For example, you could make a project
dashboard, financial dashboard, marketing dashboard, and more.
(i) How to Bring Data into Excel
Before creating dashboards in Excel, you need to import the data into Excel. You
can copy and paste the data, or if you use CommCare, you can create an Excel
Connection to your export. But, the best way is to use ODBC (or Live Data
Connector). ODBC can connect your apps to Excel, passing real-time data from
your app to Excel. As data is updated in your app, your Excel dashboard will
also be updated to reflect the latest information. This is a perfect option if you
track and store data in another place, and prefer creating a dashboard in
Excel. Data can be imported two different ways: in a flat file or a pivot table.
(ii) Set Up Your Excel Dashboard File
Once you have added your data, you need to structure your workbook. Open a
new Excel Workbook and create two to three sheets (two to three tabs). You
241
Rahul Publications
MBA II YEAR III SEMESTER
could have one sheet for your dashboard and one sheet for the raw data (so you
can hide the raw data). This will keep your Excel workbook organized. In this
example, we’ll have two tabs.
(iii) Create a Table with Raw Data
(a) In the Raw Data sheet, import or copy and paste your data. Make sure the
information is in a tabular format. This means that each item or data point
lives in one cell.
(b) In this example, we’re adding columns for Project Name, Timeline, Number
of Team Members, Budget, Risks, Open Tasks, and Pending Actions.
(iv) Analyze the Data
Before building the dashboard, take some time to look at your data and figure
out what you want to highlight. Do you need to display all the information? What kind
of story are you trying to communicate? Do you need to add or remove any data?
Once you have an idea of your dashboard’s purpose, think about the different
tools you can use. Options include:
Excel formulas like SUMIF, OFFSET, COUNT, VLOOKUP, GETPIVOTDATA
and others
Pivot tables
Excel tables
Data validation
Auto-shapes
Named ranges
Conditional formatting
(v) Build the Dashboard
Add a Gantt Chart
We’ll add a Gantt chart to visually show your project timeline.
(1) Go to your Dashboard sheet and click Insert.
(2) In the Charts section, click the bar chart icon and select the second option.
242
Rahul Publications
SOLVED PREVIOUS QUESTIONS PAPER BUSINESS ANALYTICS (OU)
PART – B (5 × 12 = 60 Marks)
(Essay Answer Type)
Note: Answer all the questions using the internal choice.
2. (a) Explain categories of Business Analytical Methods and
models with examples. (Unit-I, [Link]. 3, 4)
OR
(b) Explain the role of big data in competing food apps Swiggy and Zomato.
Ans :
As the online food ordering trend is becoming more and more prominent in
India, the food delivery platforms like Swiggy and Zomato are growing their user base
at an exponential rate.
243
Rahul Publications
MBA II YEAR III SEMESTER
1. Swiggy
According to a report, the number of user interactions on Swiggy has grown
exponentially from 2 billion in October 2017 to a massive 40 billion in January 2019.
To keep up with the massive growth the company looks up to Artificial intelligence as a
solution to many of the problems. Head of the Engineering and Data Science Team at
Swiggy, Dale Vaz says “AI is critical for us to sustain our growth,”.
Artificial intelligence helps Swiggy distinguish the dishes from images classifying
them as vegan or non-vegan dishes. Natural Language Processing can greatly help the
platform in serving wider geography without having to consider linguistic boundaries
that enables search using colloquial terms which customers could use to obtain accurate
results.
2. Zomato
As seen by Swiggy as its arch-rival in the food delivery market, Zomato doesn’t
seem to be backing down too. Recently the company had raised Rs 284 crore from a
US investor Glade Brook Capital Partners as part of its strategy to acquire more market
share from its rivals. Last month Zomato made a claim of achieving a 28 million monthly
order run rate as of December compared to the 21 million in October which also helps
the company in projecting future order volume. The platforms Gold Subscription package
also claims to have worked out for the company, bringing on board 7 lakh members
and over 6,000 restaurant partners, up from 6 lakh members and 4,000 restaurants.
It was in the late December that Zomato acquired Lucknow-based startup
TechEagle Innovations looking forward to establishing a drone-based delivery network
in India.
Zomato’s Founder and CEO Deepinder Goyal said in a press release -”We believe
that robots powering the last-mile delivery is an inevitable part of the future and hence
is going to be a significant area of investment for us.”
3. (a) Explain any six data visualization techniques and their
importance in Business Analysis. (Unit-II, [Link]. 8)
OR
(b) Explain the process of charts in spss by clearly
mentioning the path. (Unit-II, [Link]. 16)
4. (a) Explain cause and effect modelling with a hypothetical
example. (Unit-III, [Link]. 26)
OR
244
Rahul Publications
SOLVED PREVIOUS QUESTIONS PAPER BUSINESS ANALYTICS (OU)
OR
(b) Explain the concept of Decision Analysis. (Unit-IV, [Link]. 11)
OR
(b) Using the example given explain the remaining data
types in R.
Data type Example Verify
1. Logical True, False V True
2. Numeric ? ?
3. Integer ? ?
4. Complex ? ?
5. Character ? ?
6. Raw ? ?
Ans :
Data type Example Verify
1. Logical True, False V True
2. Numeric True, False V True
3. Integer True, False V True
4. Complex True, False V False
5. Character True, False V False
6. Raw True, False V False
245
Rahul Publications