BSADM Question Bank - MBA Sem 1
BSADM Question Bank - MBA Sem 1
Semester – I
Module 1
Measures of Dispersion (Variation) & Symmetry: Significance of measuring Dispersion, Requisites and
classification of measures of Dispersion, Distance measures - Range, Interquartile range. Average Deviation
measures - Mean Absolute Deviation, Variance and Standard deviation, Chebyshev’s Theorem, Coefficient of
variation & its significance. Concept of Skewness & Kurtosis
CO1: For a given dataset, the student should be able estimate the dispersion / variance & symmetry of
the data using various measures and draw inferences to facilitate decision making
Question 1:
Question 2:
Question 4:
Question 5:
Question 6:
Question 7:
Measures of Association: Correlation, Types & Methods of Correlation analysis - Karl Pearson’s coefficient of
correlation, Spearman’s Rank correlation, Probable error, Coefficient of Determination, Standard error of
coefficient of correlation. Introduction to regression analysis and its advantages, Types of regression models,
methods to determine regression coefficients (normal equations)
CO2: For a given dataset, the student should be able assess the level of association between given
variables in the data using various types of correlation analysis techniques. The students should also be
able to predict the values of a variable using regression analysis techniques.
Question 1. Summarize Correlation Analysis by exemplifying its meaning, nature, assumptions and
limitations.(4+4+4+4)
Ans: Bivariate data: Data relating to two variables is called bivariate data. Bivariate data set may reveal
some kind of association between two variables x and y and we may be interested in numerically measuring the
degree of strength of this association. Such a measure can be performed with correlation.
Positive or direct correlation
If higher values of the one variable are associated with higher values of the other or when lower values of the
one are accompanied by the lower values of the other (in other words, movements of the two variables are in
the same direction) it is said that there exists positive or direct correlation between the variables.
Example
The greater the sides of a rectangle, the greater will be its area; the higher the dividend declared by a company,
the higher will be market price of its shares.
6 d 2
rs = 1 − .
n(n 2 − 1)
6 d 2
So, rs = 1 − .
n(n − 1)
= 1 – {6×11/ 8(64-1)} = 1- 0.130952 = +0.869048.
Interpretation:
Here, rs = 0.869048. We conclude that there is perfect Positive correlation between the per capita income and
expenditure.
Question 5:
Question 7:
Question 8:
Probability: Basic terminology, types of probability, probability rules, conditional probabilities, Baye’s
Theorem. Random Variables, Probability distributions; Binomial distribution, Poisson distribution, Normal
distribution. Choosing correct probability distribution
CO3: For given situations a student should be able determine the various probabilities arising out of the
situation and make use of probability theory and appropriate probability distributions for the purpose
of decision making.
Question 1:The lifetimes of certain electronic devices have a mean of 300 hours and standard deviation of 25
hours. Assuming that the distribution of these lifetimes, which is measured to the nearest hour, can be
approximated closely with the normal curve.
a) Find the probability that any one of these electronic devices will have lifetimes more than 350 hours
b) What percentage has lifetimes of 300 hours or less?
c) What percentage will have lifetimes from 220 or 260 hours
Solution:
Z = x– μ = 350-300 = 2
Ϭ 25
The area under normal curve between z = 0 and z = 2 is 0.9772. Thus the required probability is 1 – 0.9772 =
0.0228 = 2.28%
b) Z = x – μ= 300-300 = 0
Ϭ 25
Question 2: A company X has 1500 employees and every year 300 employees quit the company. Estimate the
probability of attrition in this company per year. Also estimate the probability of retention rate of employees by
the same company. (8+8 marks)
Solution
1.Calculation of attrition rate - According to frequency estimation equation, the probability of an event is given
by,
2.Calculation of Retention rate = According to frequency estimation equation, the probability of an event is
given by,
P(Y) = No. of observations in favour of event Y/Total No. of observations = n(Y)/N
In the given condition of the problem,
P(Retention) = No.of employees staying in the job in a year/Total No.of employees
= 1200/1500 = 12/15 = 0.8 = 80%
Hence the probability of retention in this company per year = 0.8 or 80%
Question 3: A web site displays total 10 advertisements. When the visitor to this web site clicks and sees any
of the advertisements, the web site gets its revenue. Out of total 2500 visitors to this site in a day, thirty visitors
clicked on advertisement one advertisement, fifteen visitors clicked on two advertisements, while five visitors
clicked on three advertisements. Rest of the visitors did not click on any of the advertisements. Under these
conditions, estimate,
1. Probability that the visitor will click on any of the advertisements.
2. Probability that any visitor will click on atleast two of the advertisements.
3. Probability that any visitor will not click on any of the advertisements. (6+6+4)
Solution
1. Probability that any visitor will click on any of the advertisement.
According to frequency estimation equation, the probability of an event is given by,
P(X) = No. of observations in favour of event X/Total No. of observations = n(X)/N
In the given condition of the problem,
P(Any One advertisement Click) = No. of visitors clicking on at least one advertisement in a day /Total No. of
visitors to the site
= 50/2500 = 1/50 = 0.02 = 2%
Probability that the visitor will click on any of the advertisements is = 0.02 = 2%
2. Probability that any visitor will click on at least two of the advertisements.
According to frequency estimation equation, the probability of an event is given by,
P(Y) = No. of observations in favour of event Y/Total No. of observations = n(Y)/N
In the given condition of the problem,
P(Any two advertisements Click) = No. of visitors clicking on any two advertisements in a day /Total No. of
visitors to the site
= 20/2500 = 1/125 = 0.008 = 0.8%
Probability that the visitor will click on any two the advertisements = 0.08 = 0.8%
Question 4:
Question 7:
Hypothesis Testing: Introduction, Hypothesis testing procedure, errors in hypothesis testing, Power
of a statistical test. t-test, ANOVA and Chi–Square test, (Students should be able to perform testing on
spreadsheets)
CO4:For a given research problem, student should be able to construct appropriate hypotheses
and draw conclusions by using a suitable hypothesis testing procedure so as to address the
research problem in question.
Question 1: An automobile company has given you the task of ascertaining the preferences of customers for
particular colours of cars; Red, Silver, White & Black. The company also wants to know if the customers’ gender
may have any impact on this decision. Explain how you would plan to conduct this research in a step by step
manner, keeping in mind the research process and the hypothesis testing procedure. What may be the
possibilities of committing errors, if any?
Solution:
Expected steps in research process, hypothesis testing procedure and choice of suitable hypothesis test. In
addition, possibility of committing Type I and Type II errors to be discussed.
Question 2: Helix Corporation wants to study whether the welfare facilities provided by the organisation are
effective or not. The organisation has conducted the survey comprising of 784 sample size asking about
effectiveness of polices. The sample size comprise of all members of staff including females, Males and contract
employees. The female staff is 140, male staff is 260 and contractual employees are 87. The result of the survey
is tabulated as under.
Effective 95 55 42 192
Effective to some
89 49 32 170
Extent
Formulate the hypothesis and apply chi square test at 5% significance level to test whether policies are
effective or not. Draw the inference. (Use the Tabulated value of Chi square 9.488)
Solution:
α = 0.05
n = 487
Total 7.5859
fe - Expected Frequency
RT - Row Total
CT – Column Total
= (3-1)*(3-1)
=4
For 4 degrees of freedom and significance level of 0.05 the tabulated value of chi square is 9.488. The calculated
value is 7.586. This is less than tabulated value. Hence null hypothesis that Helix Corporation is not providing
effective welfare facilities to employees is accepted. And alternate hypothesis is rejected. Hence it is concluded
that Helix Corporation does not provide effective welfare policies to its employees.
Question3: Supratech Hospital is conducting the treatment of 250 patients suffering from a certain disease.
The hospital has devised a new treatment and it wants to know whether the new treatment is superior to
conventional treatment. The details of result of treatment are as given in the following table.
Conventional 60 20 80
Formulate the hypothesis and test it using Chi Square Test at 5% level of Significance and draw the inference.
The tabulated value of Chi square at 5% level of Significance and 1 degree of freedom is 3.84.
Solution:
α = 0.05
n = 250
fe =(RT x
Row Column Fo fo-fe (fo-fe)2 (fo-fe)2/fe
CT)/n
1 2 30 34 -4 16 0.471
2 1 60 64 -4 16 0.250
2 2 20 16 4 16 1.000
Total 1.839
fe = Expected Frequency
RT – Row Total
CT – Column Total
= (2-1)*(2-1)
=1
For 1 degree of freedom and significance level of 0.05 the tabulated value of chi square is 3.84. The calculated
value is 1.839. This is less than tabulated value. Hence null hypothesis that There is no significant difference in
the new and conventional treatment is accepted. And alternate hypothesis is rejected.
Area
Mumbai Bangalore Delhi Total
Preference
New Method 20 7 3 30
Conventional
14 5 1 20
Method
Total 34 12 4 50
Formulate the hypothesis and test it at 5% level of significance using Chi square test and draw the inference.
Use 5.99 as the tabulated value of Chi square for 2 degrees of freedom and at 5% level of significance.
Solution:
Null Hypothesis: There is no significant difference between the proportion of respondents who prefer new
training method and those who prefer conventional training method.
Alternate Hypothesis: There is significant difference between the proportion of respondents who prefer new
training method and those who prefer conventional training method
H0 : PM = PB = PD
H1 : PM ≠ PB ≠ PD
α = 0.05
The hypothesis is tested using chi square test as the data comes from more than one population and also the
distribution is not known.
Total 0.176
fe = Expected Frequency
= (2-1) (3-1)
=2
For 0.05 significance level & 2 degrees of freedom the tabulated value of chi square is 5.99. The calculated value
is 0.176 which is less than the tabulated value. Hence the null hypothesis that there is no significant difference
between the proportion of respondents who prefer new training method and those who prefer conventional
training method is accepted and alternate hypothesis is rejected.
Question 5: Triveni Industries Inc operates in Mumbai, Kolkata and Delhi for retail selling of certain
commodity. The company wants to test the significance of variation in the pricing of the commodity in all of its
outlets in the above mentioned cities. For this purpose it has chosen 4 shops in each city randomly. The Data
regarding variation in pricing is given below.
Mumbai 16 8 12 14
Kolkata 14 10 10 6
Delhi 4 10 8 8
Formulate the hypothesis at 5% significance level and draw the inference whether the variation in pricing in
different cities is significant or not. For df1 = 2 and df2 = 9 the F static has a critical value of 4.26.
Solution:
H0: There is no significant difference in the prices of the commodity in the three cities
H1: There is significant difference in the prices of the commodity in the three cities
α = 0.05
F = 25/9.55 =
Between Samples 50 2 25
2.617
Question 6: Pride agency is engaged in selling electronic and electrical goods. The agency has 4 salesmen and
wants to assess their sales performance for selling refrigerator during the months May, June and July. The
following table gives the details of their sales for mentioned months.
Salesman
Month
A B C B
May 50 40 48 39
June 46 48 50 45
July 39 44 40 39
The company wants to know is there a significant difference in the sales made by the four salesmen. The
company also wants to know whether there is significant difference in the sales made during months specified.
Formulate the hypothesis and test it at 5% level of significance. The critical value of F static is df
Solution:
H0: There is no significant difference in the prices of the sales made by the four salesmen during different
months.
H1: There is no significant difference in the prices of the sales made by the four salesmen during different
months.
α = 0.05
Coding of data: The problem contains the data which is high in values. Hence for sake of simplicity the data is
coded. Subtract 40 from every value in the problem. The new table is as under.
Salesman
Month
A B C B
May 10 0 8 -1
June 6 8 10 5
July -1 4 0 -1
Salesman
Months A B C D
Row Sum
x1 x12 x2 x22 x3 x32 x4 x42
May 10 100 0 0 8 64 -1 1 17
June 6 36 8 64 10 100 5 25 29
July -1 1 4 16 0 0 -1 1 2
Column
15 137 12 80 18 164 3 27 48
Sum
T = Grand Total = 48
Correction Factor (CF) = T2/n = (48)2/12 = 192
SST = Total Sum of the Squares
= (Σ x12 +Σ x22 + Σ x32 + Σ x42) – CF
= (137 + 80 + 164 + 27) – 192
= 216
SSC = Sum of Squares between the Salesman (Columns)
= (x1)2/ n1+ (x2)2/ n2 + (x3)2/ n3 + (x4)2/ n4 – CF
= (15)2/3 + (12)2/3 (18)2/3 + (3)2/3 - 192
= 42
SSR = Sum of squares Between the Months (Rows)
= (17)2/4 + (29)2/4 + (2)2/4 – 192
= 91.5
SSE = SST – (SSC+SSR) = 216 – (42+91.5) = 82.5
Degrees of Freedom:
Total Degrees of freedom df = 12 – 1 = 11
dfc = c-1 = 4-1 = 3 (Column wise)
dfr = r-1 = 3 – 1 = 2 (Row wise)
dfE = (c-1)(r-1) = 3 X 2 = 6
MSC = mean sum squares or variance between samples (Shop wise)
= SSC/dfc = 42/3 = 14
MSR = mean sum squares or variance within sample (Area Wise)
= SSR/dfr = 91.5/2 = 45.75
MSE = SSE/df= 82.5/6 = 13.75
1. Since the calculated value of Ftreatment= 1.108 is less than its critical value F = 4.75 at df1 = 3 and df2 = 6
for 5% significance level, the null hypothesis is accepted. Hence it is concluded that sales made by the
salesman do not differ significantly.
2. Since the calculated value of Fblock = 3.327 is less than its critical value F = 5.14 at df1 = 2 and df2 = 6 at
5% significance level, the null hypothesis is accepted. Hence it may be concluded that sales made during
different months do not differ significantly.
Question 7: Falcon Fitness Studio declared a weight reduction program and claimed to reduce more than 7 Kg
a month. An overweight executive of one corporate is interested to join the program but he is not showing
confidence in the claim made by the Company and asked for evidences. The company asked him to choose any
10 previous participants randomly and check their weights before joining the program and after finishing the
program. The details are as given below.
94.5 85
101 89.5
110 101.5
103.5 96
97 86
88.5 80.5
96.5 87
101 93.5
104 93
116.5 102
The overweight executive wants to test at 5% significance level the claimed average weight loss of more than 7
kg. Formulate the hypothesis and test whether there is significant difference in the claim and actual weight loss
and claimed weight loss. The critical value of t at 9 degrees of freedom and 5% level of significance is 1.833
(One Tailed).
H0: μ = 7
H1: μ > 7
α = 0.05
n = 10
= (9.85 – 7)/1.215
= 2.85/1.215
= 2.345
The calculated value of t static is 2.345. Since the test is one tailed, the critical value of t static at 9 degrees of
freedom and 5% confidence level is 1.833. (Since both the tails share 5% of area each, single tail will take 10%.
Hence the level of confidence is to be taken 10% while selecting the value of t static. Hence degree of freedom =
9 and Level of significance is 10% the value is 1.833 and not 2.262). The calculated value is more than critical
value. Hence null hypothesis is rejected. The claim of Company is legitimate.
Solution
H1: The machine does not mix 12 Kg of Nitrate in every 100 Kg of Fertiliser.
In other words:
H0: μ = 12
H1: μ ≠ 12
α = 0.05
n = 10
It is assumed that the weight of nitrate in fertiliser bags is normally distributed and its standard deviation is not
known. The values of 𝑥̅ , and sample standard deviation s are calculated below.
Weight of
Deviation
Nitrate x2 d2
d = x-𝑥̅
x
= √1573/9 - 9(12.5)/9
= √174.78/9 - 112.5/9
= √19.42 – 12.5
=√ 6.92
= 2.63
The critical value of t statistic at 9 degrees of freedom and 5% significance level for two tailed test is 2.262. The
calculated value is less than the critical value. Hence null hypothesis is accepted. It is concluded that the
machine mixes 12 kg of nitrate in every 100 kg of fertilizer.
CO5: The student will be able to differentiate between various forms of analytics and will also be able to
choose suitable analytics for decision making.
Question.1: Examine the role played by analytics in the growth of e-commerce companies.
Ans.
1.Demand forecasting
Question 2: A firm wants to develop a new product. List out the various steps involved in data driven
decision making in solving the problems likely to be encountered using analytics.
Ans.
Ans.
3. Explain the example of price insensitivity of the parents towards their children.
1. Constitutes Statistical & Operations Research Techniques, Machine Learning & Deep Learning Algorithms
2. Classification problems.
Solution
C. Prescriptive Analytics - Which the best model for optimization of the business revenue.
The big data revolution has given birth to different kinds, types and stages of data analysis. Boardrooms across
companies are buzzing around with data analytics - offering enterprise wide solutions for business success.
However, what do these really mean to businesses? The key to companies successfully using Big Data, is by
gaining the right information which delivers knowledge, that gives businesses the power to gain a competitive
edge. The main goal of big data analytics is to help organizations make smarter decisions for better business
outcomes.
Big data analytics cannot be considered as a one-size-fits-all blanket strategy. In fact, what distinguishes a best
data scientist or data analyst from others, is their ability to identify the kind of analytics that can be leveraged
to benefit the business - at an optimum. The three dominant types of analytics –Descriptive, Predictive and
Prescriptive analytics, are interrelated solutions helping companies make the most out of the big data that they
have. Each of these analytic types offers a different insight. In this article we explore the three different types of
analytics -Descriptive Analytics, Predictive Analytics and Prescriptive Analytics - to understand what each type
of analytics delivers to improve on, an organization’s operational capabilities.
Big data analytics helps a business understand the requirements and preferences of a customer, so that
businesses can increase their customer base and retain the existing ones with personalized and relevant
offerings of their products or services. According to IDC, the big data and analytics industry is anticipated to
grow at a CAGR of 26.4% reaching a value of $41.5 billion by end of 2018. The big data industry is growing at a
rapid pace due to various applications like smart power grid management, sentiment analysis, fraud detection,
personalized offerings, traffic management, etc. across myriad industries. After the organizations collect big
data, the next important step is to get started with analytics. Many organizations do not know where to begin,
what kind of analytics can nurture business growth and what these different types of analytics mean.
90% of organizations today use descriptive analytics which is the most basic form of analytics. The simplest
way to define descriptive analytics is that, it answers the question “What has happened?”. This type of analytics,
analyses the data coming in real-time and historical data for insights on how to approach the future. The main
objective of descriptive analytics is to find out the reasons behind precious success or failure in the past. The
‘Past’ here, refers to any particular time in which an event had occurred and this could be a month ago or even
just a minute ago. The vast majority of big data analytics used by organizations falls into the category of
descriptive analytics.
A business learns from past behaviours to understand how they will impact future outcomes. Descriptive
analytics is leveraged when a business needs to understand the overall performance of the company at an
aggregate level and describe the various aspects.
Descriptive analytics are based on standard aggregate functions in databases, which just require knowledge of
basic school math. Most of the social analytics are descriptive analytics. They summarize certain groupings
based on simple counts of some events. The number of followers, likes, posts, fans are mere event counters.
These metrics are used for social analytics like average response time, average number of replies per post,
%index, number of page views, etc. that are the outcome of basic arithmetic operations.
The best example to explain descriptive analytics are the results, that a business gets from the web server
through Google Analytics tools. The outcomes help understand what actually happened in the past and validate
if a promotional campaign was successful or not based on basic parameters like page views.
The subsequent step in data reduction is predictive analytics. Analysing past data patterns and trends can
accurately inform a business about what could happen in the future. This helps in setting realistic goals for the
business, effective planning and restraining expectations. Predictive analytics is used by businesses to study the
data and ogle into the crystal ball to find answers to the question “What could happen in the future based on
previous trends and patterns?”
Organizations collect contextual data and relate it with other customer user behaviour datasets and web server
data to get real insights through predictive analytics. Companies can predict business growth in future if they
keep things as they are. Predictive analytics provides better recommendations and more future looking
answers to questions that cannot be answered by BI.
Predictive analytics helps predict the likelihood of a future outcome by using various statistical and machine
learning algorithms but the accuracy of predictions is not 100%, as it is based on probabilities. To make
predictions, algorithms take data and fill in the missing data with best possible guesses. This data is pooled
with historical data present in the CRM systems, POS Systems, ERP and HR systems to look for data patterns
MBA –Outcome Based Syllabus: Question Bank 2019-20 Page 43
Course: Business Statistics and Analytics for Decision Making
and identify relationships among various variables in the dataset. Organizations should capitalise on hiring a
group of data scientists in 2016 who can develop statistical and machine learning algorithms to leverage
predictive analytics and design an effective business strategy.
Data Mining- Identifying correlated data (click here to get sample use-cases with code).
Pattern Identification and Alerts –When should an action be invoked to correct a process.
Sentiment analysis is the most common kind of predictive analytics. The learning model takes input in the form
of plain text and the output of the model is a sentiment score that helps determine whether the sentiment is
positive, negative or neutral.
Organizations like Walmart, Amazon and other retailers leverage predictive analytics to identify trends in sales
based on purchase patterns of customers, forecasting customer behaviour, forecasting inventory levels,
predicting what products customers are likely to purchase together so that they can offer personalized
recommendations, predicting the amount of sales at the end of the quarter or year. The best example where
predictive analytics find great application is in producing the credit score. Credit score helps financial
institutions decide the probability of a customer paying credit bills on time.
Big data might not be a reliable crystal ball for predicting the exact winning lottery numbers but it definitely
can highlight the problems and help a business understand why those problems occurred. Businesses can use
the data-backed and data-found factors to create prescriptions for the business problems, that lead to
realizations and observations.
Prescriptive analytics is the next step of predictive analytics that adds the spice of manipulating the future.
Prescriptive analytics advises on possible outcomes and results in actions that are likely to maximise key
business metrics. It basically uses simulation and optimization to ask “What should a business do?”
Stochastic optimization that helps understand how to achieve the best outcome and identify data uncertainties
to make better decisions.
Simulating the future, under various set of assumptions, allows scenario analysis - which when combined with
different optimization techniques, allows prescriptive analysis to be performed. Prescriptive analysis explores
several possible actions and suggests actions depending on the results of descriptive and predictive analytics of
a given dataset.
Prescriptive analytics is a combination of data, and various business rules. The data for prescriptive analytics
can be both internal (within the organization) and external (like social media data).Business rules are
Prescriptive analytics are comparatively complex in nature and many companies are not yet using them in day-
to-day business activities, as it becomes difficult to manage. Prescriptive analytics if implemented properly can
have a major impact on business growth. Large scale organizations use prescriptive analytics for scheduling the
inventory in the supply chain, optimizing production, etc. to optimize customer experience.
Aurora Health Care system saved $6 million annually by using prescriptive analytics to reduce re-admission
rates by 10%. Prescriptive analytics can be used in healthcare to enhance drug development, finding the right
patients for clinical trials, etc.
As increasing number of organizations realize that big data is a competitive advantage and they should ensure
that they choose the right kind of data analytics solutions to increase ROI, reduce operational costs and
enhance service quality.
Question5: Discuss how regression analysis stands a basis for effective predictive analysis.
Ans: Regression Analysis is a statistical tool, with the help of which, we are in a position to estimate (or predict)
the unknown values of one variable from known values of another variable is called regression.
With the help of regression analysis we are in a position to Predict the average probable change in one variable
given a certain amount of change in another. Regression analysis is thus designed to examine the relationship
of variable y to a variable x. Thus through using regression analysis one can predict the possible new values of a
dependent variable based on the changes in the independent variable. Hence regression analysis forms an
effective tool for predictive analysis.
Question 6 : A company engaged in manufacturing & marketing of FMCG products wants to enter into a
new untapped market. Basing on the available purchase patterns of the consumers suggest which
products the company should introduce in that market. What data you would require to make this
decision? Will you be using data analytics or data mining or both the tools? Elaborate.
Ans :
In above mentioned situation, both the tools are required for decision of launch. The data mining tools will
provide us the insights on understanding buying patterns of the retailers & consumers in that market while the
analytics tools will provide us the idea about the projected sales, projected success rate, likely buyers for the
new products etc.
The following data is required by this company while taking a decision to introduce the existing products in a
new market.
1.Data about the new market : Total population of the market, demographic profiling of the population, gender
wise classification of the total population, existing direct & indirect competitors, their monthly sales, Retail
shelf size occupied, buying habits of the consumers, no.of available retailers in that market, their payment
terms, population/market growth, seasonality if any etc.
2. Historical data : Past one year sales of the competitors, new entrants in the market during last six months,
competitors who left the market during last one year.
3.Logistics Data ; Available modes of supplies, no.of visits of the distributors in that market, lead time etc.
Basing on the above three sets of data, the management of the company can take a suitable decision about
introduction into that market.
MBA –Outcome Based Syllabus: Question Bank 2019-20 Page 45
Course: Business Statistics and Analytics for Decision Making
Question 7: Discuss the role of Predictive Analytics in business decision making. Illustrate using any
real life/hypothetical situation.
In analytics, it is the basic technique including data science, artificial intelligence & statistical tools for
measuring or estimating the relationship among variables that constitute the essence of economic theory and
If we know that two variables, price(x) and demand(y), are closely related. We can predict or classify the most
probable value of x for given value of y or the most probable value of y for a given value of x.
Similarly, if we know that the amount of ax and the rise in price of a commodity are closely related, we can find
out the expected price for a certain amount of tax levy.
Question 8: Critically evaluate the differences between Data Analytics, Data Analysis and Data Mining?
Solution
People are generating tons of data every second. Every post on social media, each heartbeat, every link clicked
on the internet is data. The world generated more than 1ZB of data in 2010. The massive data are often stored
in data warehouses. These warehouses collect data from all possible sources. However, these data are often
unstructured and meaningless, therefore, professionals need to make sense of them. Experts in this field use
certain tools to make sense of these data in order to help businesses make an informed decision. Hence, those
tools include data analytics, data analysis and data mining.
The terms data analytics, data analysis and data mining are used interchangeably by people. However, there are
small differences between the three terms. In simplest terms, data mining is a proper subset of data analytics
and data analytics is a proper subset of data analysis and they are all proper subset of data science. It is easy to
get confused, read on to get a better understanding of the three terms.
Data Mining
We are starting with data mining because it is the smallest in the set we’re considering. Every tool, method or
process used in data mining is also used in data analytics. Data analytics is data mining plus more. Wikipedia
defines data mining as “the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems”. The Economic Times defines it as “process
used to extract usable data from a larger set of any raw data”. These definitions give an overview of what data
mining is about. Let’s delve deeper.
Data mining was very popular in the 90s and early 2000s. Some sources say data mining is also known as the
Knowledge Discovery in Databases (KDD) while others say it is one of the stages in KDD. However, what’s most
important is data mining brings together data from a larger pool and tries to find a correlation between two
concepts or items. For instance, it can find the correlation between almonds and fungi or beer and diapers.
The more common operations in data mining used to make meaning of data include clustering, predictive or
descriptive model – forecasting, deviations, correlations between data sets, classification, regression and
summarization.
Clustering
This is a common task in data mining and it is used in grouping data that are similar together. Information that
has similar characteristics are grouped together. It brings a set of data together to find how similar they are and
find facts that were previously unknown. Clustering explains the data and uses the data to predict possible
future trends. Data mining uses clustering method in predicting the future.
This is also known as anomaly detection or deviation detection. This aims to understand why certain patterns
are different from the rest. It further studies data error and aims to find out why they are different and what
caused the difference.
Summarisation: This makes data more compact, therefore, making easier to understand, visualise and report.
Classification
This task aims to put data in groups. New data are classified into already existing structure or groups. For
instance, carrying out a test to know the blood group of a person would place the person into any of the four
blood groups. Furthermore, attempting to classify incoming emails as junk or genuine.
Correlations
This is understanding the links between two data sets. It is sometimes known as association rule learning. Its
goal is to find patterns between two unrelated data. For instance, finding the relationship between diapers’ and
‘beers’.
Regression
Data Analytics
Techopedia defined data analytics “as qualitative and quantitative techniques and processes used to enhance
productivity and business gain“. Analytics is the logic and tools behind the analysis. Analytics is the engine that
drives analysis. Businesses make decisions on the outcome of analytics. Margaret Rouse in her article, data
analytics, included the use of “specialised systems and software” in the definition of data analytics. There are
numerous tools used by data analysts, some software are Tableau Public, Open Refine, KNIME, Rapid Miner,
Google Fusion Tables, NodeXL and many more.
Data Analytics is the superset of data mining and a proper subset of data analysis. Data analytics involves using
tools to analyze data in making a business decision. For instance, your business offers massage services to
people using electric massage chairs to help relieve stress and backache. If you’re interested in knowing who
patronizes you, then you can create a table of your customers. You can further group your data by occupation,
age, home address, etc using the data analytics tool.
Quantitative techniques use mathematical and statistical tools and theories to manipulate numbers to obtain a
result or pattern. On the other hand, qualitative analytics is interpretive, it is the use of non-numerical data
such as images, audio, video, point of view, interviews or texts. More advanced data analytics tool include data
mining, machine learning, text mining and big data analytics. Data analytics can also refer to software ranging
from business intelligence (BI), online analytical processing (OLAP).
Data analytics starts with defining the business objective, collecting data, checking for data quality, building an
analytical model and then a decision based on the outcome.
1. Business objective: Data analytics starts with understanding the final goal. The team needs to know what is
required of them. This is part the team plans, select the possible dataset and establish project plans in line with
company goals.
2. Collecting data: The team selects the data that is required to carry out the analysis they want. Since data
comes from different sources. The team has to check and collect data that are most relevant to the information
they are trying to find out.
4. Building analytical models: Once the team ensures the data is clean, the team gathers the data for analysis
and they build analytical models. This is done with analytics software and programming languages such as
Python, SQL, R and Scala. In most cases, a test run is done on the data to check if the outcome is close to or in
line with the predicted outcome. If this turns out okay, the team then runs full analysis.
5. Outcome and decision: The next stage is the outcome, the result is evaluated. The team checks for accuracy of
the results and degree of error generated. The result is then deployed, a report is written and the team
performs a final check on the project as a whole. This is termed project review. Once, this is done, observations
and results are passed to the management to make an informed decision.
Data Analysis
EDUCBA defines data analysis as “extracting, cleaning, transforming, modeling and visualization of data with an
intention to uncover meaningful and useful information that can help in deriving conclusion and take
decisions“. This definition is comprehensive and it covers every aspect of data analysis. However, John Turkey,
a world-renowned statistician, added that data analysis includes making the results more precise or accurate
over time.
Data analysis often used interchangeably with data analytics, however, there are slight differences between
them. In the definition of analytics, we saw that it involved the use of specialized software and tools. Data
analysis is a broader term and it fully engulfs data analytics, in other words, data analytics is a subcomponent of
data analysis.
Data analysis involves both technical and non-technical tools. There are several stages in data analysis and the
phases can be iterative to improve accuracy and get better results. Data analysis is very wide and teams work
on different aspects. However, we state the most common steps used by data analysis teams. This is putting a
team together, understanding business objective, data collection, data cleaning, data manipulation,
communication, optimise and repeat.
1. Put a team together: In testing any hypothesis, the first step is to put a team together that would carry out the
analysis.
2. Business objective: The problem bugging the business is put across to the team. This serves as the
background of the analysis the team hopes to get a hypothesis on.
3. Data collection: Once the team understands the business objective, it set out to collect data needed.
4. Data cleaning: This is a very important and crucial step. This is identifying inaccurate or incomplete data and
deleting or modifying them. Dirty data can lead to wrong conclusions which can be fatal for a business. The
team has to ensure the data is as clean as possible. This is the stage the data is inspected.
5. Data manipulation: In this stage, the data is subjected to mathematical and statistical methods, algorithms
modelling of data. The data is transformed from one structure to another.
6. Optimise, communicate and repeat: Before communicating results and reports to the management, the team
has to optimise the data by checking and accounting for error due to calculation or mathematical method. Once,
the results are ready, the team presents their findings to the management in form of images, graphs or video. If
results require the new perspective, then the team would repeat the process from the beginning.