Notes of Week-1 and Week-2
Notes of Week-1 and Week-2
for
STATISTICS FOR DATA SCIENCE - 1
Week-1 and 2
Contents
1 Statistics 3
1.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Major branches of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose of statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Data 5
2.1 Unstructured and Structured Data . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Variables and Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Classification of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Categorical Data and Numerical Data . . . . . . . . . . . . . . . . . 7
2.2.1.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1.2 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Time-series and cross-sectional Data . . . . . . . . . . . . . . . . . . 8
2.2.3 Scales of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3.1 Nominal scale of measurement . . . . . . . . . . . . . . . . . 8
2.2.3.2 Ordinal scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.3 Interval scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.4 Ratio scale of measurement . . . . . . . . . . . . . . . . . . 10
2
Chapter 1
1 Statistics
Statistics is the art of learning from data. It is concerned with the collection of data, their
subsequent description, and their analysis, which often leads to the drawing of conclusions.
We can understand about sample and population from the following picture:
Example:
Suppose a survey is conducted to know the prices of all houses in Tamil Nadu and 1000
houses were randomly selected from the urban areas of Tamil Nadu for this study. It is con-
cluded that price of a house per square feet is roughly 5680 Rs. Then, the sample consists
of the selected 1000 houses from the urban areas of Tamil Nadu and the population consists
of all houses in Tamil Nadu.
3
• Summarization of data means numerical/graphical summary of data or to describe
the main points of data.
• A descriptive study may be performed either on a sample or on a population data.
• If the information is obtained from a sample of a population and the purpose of the
study is to use that information to draw conclusions/inferences about the population,
the study is inferential.
For Example: A teacher wants to know the average marks of all students in the school.
Since there is a large number of students in the school, the teacher collects a sample
of students from the school and calculates the average marks of the selected students
which is, say, 60 marks. Then, teacher made the conclusion (using statistical tech-
niques) that average marks of all students in the school is 60. This type of study is
called inferential statistics because here we are making conclusion about population
based on the sample data.
4
Chapter 2
2 Data
Definition
Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation.
• Purpose to collect the data :
Generally, we collect the data when we are interested to understand the characteristics or
attributes of some group or groups of people, places, things, or events.
For Example:
When data are scattered with no structure, i.e., not in any standard format, the infor-
mation is of very little use.
Structured Data
Structured data is a standardized format for providing information about a dataset and it
is clearly defined and searchable, as for the information in a dataset to be useful, we must
know the context of the numbers and text it holds. Also, structured data is easy to analyze
and understand. Hence, we need to organize the data.
Let’s consider the following two examples:
5
(1) Dataset of students:
The student dataset shown in Table 1 can be considered as structured data because this
data is in a tabular form and provides the information about Gender, Date of Birth,
Marks in 10th class and Board of the students. Also, this data is easy to analyze and
understand as we can easily get information about any student e.g. Anjali has scored
484 marks in class 10th of State board, Pradeep is Male and have date of birth as 3rd
June, 2002 etc.
Fertilizers dataset shown in table 2 can also be considered as structured data because
this data is in a tabular form and provides the information about fertilizers. Also, this
data is easy to analyze and understand as we can easily get information e.g. Potassium
is an inorganic fertilizer and can be used for pulse in the amount of 320 Kg etc.
6
2.1.1 Variables and Cases
Case (observation) : A case/observation is a unit for which data is collected. Cases should
uniquely identify each row in the dataset.
Variable : A variable is a characteristic or attribute that varies across all units. Intuitively,
a variable is that “varies”.
For Example:
In the table 1 of student dataset, each student, i.e., “Anjali, Pradeep, Divya etc.” are cases
as data is collected for every student and all the names uniquely identify each row in the
dataset.
And, variables are “Name, Gender, Date of Birth, Board etc., as their values keeps on
varying.”
Note: The student dataset is in tabular form. If we want to organise a data in a tabular
form, then following two points should take into consideration:
• Columns represent variables: For each variable, same type of value for each case is
recorded.
7
it has two categories as F and M. We can classify any observation into one of these two
categories.
Also, Board is a categorical variable since it has three categories as State Board, ICSE and
CBSE and any observation can be categorized into one of these three groups.
• If the data is observed at the same time, then it is called cross-sectional data.
Example:
The data collected to observe the temperature of Delhi, Chennai, Jaipur and Bhopal
on a particular day is a cross-sectional data. Because, data is recorded at the same
time and it is observed for several places.
8
• Sometimes nominal variables might be numerically coded like we might code men as 1
and women as 2 or code men as 3 and women as 1.
• In short “ Nominal scale is just categories or labels which does not contain any order.”
• Ratios of values have no meaning here because the value of zero is arbitrary.
Example:
Consider an AC room where temperature is set at 20°C and the temperature outside the
room is 40°C. It is correct to say that the difference in temperature is 20°C, but it is incorrect
to say that the outdoor is twice as hot as indoor.
Also, temperature in degrees Fahrenheit or degrees centigrade has an interval scale of mea-
surement, because it has no absolute zero. In the Celsius scale, 0 and 100 are set to be as
the freezing point and the boiling point whereas, in Fahrenheit it is 32 and 212.
9
2.2.3.4 Ratio scale of measurement
If the data have all the properties of interval data and the ratio of two values is meaningful,
then the scale of measurement is ratio scale.
Ratio scale of measurement has absolute zero property which is the key difference between
ratio and interval scale.
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like height,
weight and marks can be added, subtracted and multiplied or divided as it all have absolute
zero property.
A summary about all scales of measurement can be described as follows :
10
Unsolved Problems
(1) An analyst wants to conduct a survey for testing the maintenance of hospitals in a
particular district in Bihar, for which he selects 25 hospitals randomly from that district.
Identify the sample and population. [2 Marks]
(a) The population is all the hospitals in Bihar and the sample is all the hospitals in
the district.
(b) The population is all the hospitals in Bihar and the sample is 25 selected hospitals
in Bihar.
(c) The population is all hospitals in the district of Bihar and the sample is 25 selected
hospitals in the district.
(d) None of the above
Answer: c
(2) In the 2011 Cricket ODI World Cup quarter-final match between India and Australia,
a media organization estimated that Australia would beat India by 50 runs if Australia
bats first, based on the information of matches played between the two teams previously.
Which branch of statistics does the above analysis belong to?
Answer: Inferential Statistics
(3) Values of temperature and humidity of a room are measured for 24 hours at a regular
time interval of 30 minutes. Based on this information, choose the correct option:
Answer: b
Answer: a
(5) What kind of variable is the qualification of a candidate sitting for a job interview?
11
Answer : b
(6) If addition, subtraction can be performed on a variable, then the scale(s) of measurement
of the variable could be:
(a) Ordinal
(b) Ratio
(c) Interval
(d) Nominal
Answer : b, c
Answer: b, c, d
12
Chapter 3
(1) A, A, B, C, A, D, A, B, D, C
(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A
(3) A, B, B, C, A, D, B, B, D, C, A, B, C, D, B
(4) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D
13
Category Tally mark Frequency
A 6
B 3
C 4
D 5
Total 18
(1) A, A, B, C, A, D, A, B, D, C
(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A
14
Example: Consider the frequency table of the dataset A, A, B, C, A, D, A, B, D, C.
Table 3.1
Figure 3.1 is the pie chart representation of the dataset in Table 3.1:
As pie chart gives us the share of a pie, share of category A is 40%, category B is 20%,
category C is 20% and category D is 20%.
15
Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D
Table 3.2
Figure 3.2 represents the bar chart of the dataset in Table 3.2 as follows:
16
Category Frequency Relative frequency
A 6 0.33
B 3 0.17
C 4 0.22
D 5 0.28
Total 18 1
Table 3.3
Figure 3.3 is the pareto chart representation of the dataset in Table 3.3 as follows:
Note: If the categorical variable is ordinal, then the bar chart must preserve the ordering.
For example:
The T-shirt sizes L, M, M, S, L, S, S, M, L, M, M, S, S, L, M, S, M, S, L, M of twenty
students is listed in Table 3.4:
Table 3.4
17
Dataset of Table 3.4 is ordinal. So, we have preserved the order of the data.
And, bar chart representation for the dataset of Table 3.4 is given as follows:
18
Now, we can do grouping of other categories together as follows:
Grouping other categories together in a major category conveys two important things.
19
3.4 The Area Principle
The area principle says that the area occupied by a part of the graph should correspond to
the amount of data it represents.
Display of data must obey the rule of area principle and violations of the area principle are
a common way to mislead with statistics.
Figure 3.5 gives us the total wine exports in UK, Canada, Japan and Italy. But, there
is no baseline and the chart shows bottles on top of labeled boxes of various sizes and
shapes.
20
Now, Figure 3.6 represents the chart which is not decorated:
Figure 3.6
We have labeled each one of the categories. It is accurate and it has a baseline. This
chart is actually consistent and the width of the bars for each countries are equal. Also,
the area occupied by the graph is proportional to the data that is being presented.
Figure 3.7
21
The pie chart of the Figure 3.7 is violating the area principle as areas occupied by sales
distribution of HTC and Apple do not correspond to the amount of data it represent.
Left graph exaggerates the number as it is not at zero. But, the graph on right side
shows same data with the baseline at zero.
(2) The following figure represents the share of votes in an election in USA.
From the length of the bar we observe that Republic party voting percentage is less than
half of the Democratic party but if we consider the actual number this is not the case.
22
3.4.3 Manipulated y-axis
Expanding or compressing the scale on a graph that can make changes in the data seem less
significant than they actually are, is known as the manipulation of y-axis.
For example: Following bar charts represent the number of sales of smart phone A and B of
a local shop.
Figure : 3.8
Figure : 3.9
23
From the figure 3.8 we are getting the information that a significant amount of sales is being
done of both the smart phones but from the figure 3.9 it seems that the sales is very low of the
smart phone A and B. So, the graph in figure 3.9 is misleading because it has manipulated
y-axis.
Figure : 3.10
Category Percentage
A 22.3
B 35.6
C 12.6
D 11
E 18.5
Total 100
24
In the table, the value of total sum is 100%.
Suppose, we round off the values and draw a pie chart as follows:
In this pie chart has round-off errors because total sum of all entries is 100.5% which is
different from 100%.
• Numbers that are used to describe data sets are called descriptive measures.
• Descriptive measures that indicate where the center or most typical value of a data set
lies are called measures of central tendency.
3.5.1 Mode
The mode of a categorical variable is the most common category, the category with the
highest frequency.
Mode labels the longest bar in a bar chart, the widest slice in a pie chart and the first
category shown in a Pareto chart.
Example: Let’s consider the dataset A, A, B, C, A, D, A, B, C, C, A, B, C, D, A.
Here, category A is the mode of the data as it occurs with the highest frequency.
Now, figure 3.11, 3.12 and 3.13 represent the bar chart, pie chart and pareto chart for the
dataset as follows:
25
(1) Bar chart representation for the above dataset is:
Figure : 3.11
In the figure 3.11, category A has the longest bar. Thus, mode of the dataset is category
“A”.
(2) Pie chart representation of the above dataset is:
Figure : 3.12
In the above pie chart, category A has the widest slice. Thus, mode of the dataset is
category “A”.
26
(3) Pareto chart for the above dataset is:
Figure : 3.13
In the above pareto chart, first bar is for category A. Thus, mode of the dataset is
category “A”.
In the above bar chart, both categories “A” and “C” have highest frequency.
27
3.5.2 Median
The median of an ordinal variable is the category of the middle observation of the sorted
values.
If there are an even number of observations, then we can choose the category on either side
of the middle of the sorted list as the median.
Examples:
Note: Median can be defined only for ordinal data whereas mode can be defined for both
nominal as well as ordinal data.
28
Unsolved Problems
(1) If an analyst wants to represent the revenues of various companies using graphs, then
which of the following graphical representation/s is/are most appropriate for the pur-
pose?(More than one option can be correct)
(a) A pie chart with a pie/slice for each company and the width corresponding to its
revenue in crore rupees.
(b) A bar chart with a bar for each company on the x-axis and the length corresponding
to its revenue in crore rupees on the y-axis.
(c) A bar chart with a bar for each company on the y-axis and the length corresponding
to its revenue in crore rupees on the x-axis.
(d) A bar chart with the minimum revenue as a baseline.
Answer: b, c
(2) Mode of a categorical variable is:(More than one option can be correct)
(a) The last bar in ascending order of a Pareto chart.
(b) The middle-most bar in a Pareto chart.
(c) The longest bar in a bar chart.
(d) The widest slice in a pie chart.
Answer: a, c, d
(3) Which of the following can be defined for both nominal and ordinal data?
(a) Mean
(b) Median
(c) Mode
(d) All of the above
Answer: c
A total of 2000 cases of Covid-19 have been registered on 5th May 2020 in 5 key districts
of Maharashtra. The proportion (out of 5 districts) of cases in each district has been
listed in Table 2.1.A. Based on the information given, answer questions (4) and (5).
29
(4) Find the relative frequency of district Nagpur.
Answer: 0.12
30