0% found this document useful (0 votes)
41 views

Notes of Week-1 and Week-2

The document discusses key concepts in statistics and data science, including: - It defines statistics, population, sample, descriptive statistics, and inferential statistics. - It differentiates between unstructured and structured data, and provides examples of each. - It describes different types of data like categorical, numerical, time-series, and cross-sectional data. It also defines nominal, ordinal, interval, and ratio scales of measurement. - Frequency distributions and different charts to describe categorical data like pie charts, bar charts, and Pareto charts are discussed. Issues like violating the area principle and misleading graphs are also covered.

Uploaded by

ram7177
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Notes of Week-1 and Week-2

The document discusses key concepts in statistics and data science, including: - It defines statistics, population, sample, descriptive statistics, and inferential statistics. - It differentiates between unstructured and structured data, and provides examples of each. - It describes different types of data like categorical, numerical, time-series, and cross-sectional data. It also defines nominal, ordinal, interval, and ratio scales of measurement. - Frequency distributions and different charts to describe categorical data like pie charts, bar charts, and Pareto charts are discussed. Issues like violating the area principle and misleading graphs are also covered.

Uploaded by

ram7177
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Notes

for
STATISTICS FOR DATA SCIENCE - 1
Week-1 and 2
Contents
1 Statistics 3
1.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Major branches of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose of statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Data 5
2.1 Unstructured and Structured Data . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Variables and Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Classification of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Categorical Data and Numerical Data . . . . . . . . . . . . . . . . . 7
2.2.1.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1.2 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Time-series and cross-sectional Data . . . . . . . . . . . . . . . . . . 8
2.2.3 Scales of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3.1 Nominal scale of measurement . . . . . . . . . . . . . . . . . 8
2.2.3.2 Ordinal scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.3 Interval scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.4 Ratio scale of measurement . . . . . . . . . . . . . . . . . . 10

3 Describing categorical data: Frequency distribution 13


3.1 Frequency Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Charts of categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Pareto Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 The Area Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Misleading graphs: violating area principle . . . . . . . . . . . . . . . 20
3.4.2 Misleading graphs: truncated graphs . . . . . . . . . . . . . . . . . . 22
3.4.3 Manipulated y-axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.4 Indicating a y-axis break . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.5 Round-off errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Summarizing Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1.1 Bimodal and Multimodal data . . . . . . . . . . . . . . . . 27
3.5.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2
Chapter 1

1 Statistics
Statistics is the art of learning from data. It is concerned with the collection of data, their
subsequent description, and their analysis, which often leads to the drawing of conclusions.

1.1 Population and Sample


Population
The total collection of all the elements that we are interested in is called a population.
Sample
A subgroup of the population that will be studied in detail is called a sample.

We can understand about sample and population from the following picture:

Example:
Suppose a survey is conducted to know the prices of all houses in Tamil Nadu and 1000
houses were randomly selected from the urban areas of Tamil Nadu for this study. It is con-
cluded that price of a house per square feet is roughly 5680 Rs. Then, the sample consists
of the selected 1000 houses from the urban areas of Tamil Nadu and the population consists
of all houses in Tamil Nadu.

1.2 Major branches of statistics


1. Descriptive Statistics Statistics
The part of statistics concerned with the description and summarization of data is called
descriptive statistics.

3
• Summarization of data means numerical/graphical summary of data or to describe
the main points of data.
• A descriptive study may be performed either on a sample or on a population data.

2. Inferential Statistics Statistics


The part of statistics concerned with drawing conclusions from the data is called inferential
statistics.

1.3 Purpose of statistical analysis


• If the purpose of the analysis is to examine and explore information about the collected
data only, then the study is descriptive.
For Example: A class of 50 students gave an exam (of 100 marks) and the average
marks of the class is calculated as 65. This type of study is called descriptive statistics
because here we are just summarizing the data (calculating the average marks of whole
class).

• If the information is obtained from a sample of a population and the purpose of the
study is to use that information to draw conclusions/inferences about the population,
the study is inferential.
For Example: A teacher wants to know the average marks of all students in the school.
Since there is a large number of students in the school, the teacher collects a sample
of students from the school and calculates the average marks of the selected students
which is, say, 60 marks. Then, teacher made the conclusion (using statistical tech-
niques) that average marks of all students in the school is 60. This type of study is
called inferential statistics because here we are making conclusion about population
based on the sample data.

4
Chapter 2

2 Data
Definition
Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation.
• Purpose to collect the data :
Generally, we collect the data when we are interested to understand the characteristics or
attributes of some group or groups of people, places, things, or events.
For Example:

(1) To know about temperatures in a particular month in Chennai, India.

(2) To know about the marks obtained by students in their Class X.

2.1 Unstructured and Structured Data


Unstructured Data
Unstructured data is a dataset that is not organized in a predefined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts
as well. Also, unstructured data requires more work to process and understand.
For Example: You-tube comments, Image files, Social-media posts, lyrics of a song etc.

When data are scattered with no structure, i.e., not in any standard format, the infor-
mation is of very little use.

Structured Data
Structured data is a standardized format for providing information about a dataset and it
is clearly defined and searchable, as for the information in a dataset to be useful, we must
know the context of the numbers and text it holds. Also, structured data is easy to analyze
and understand. Hence, we need to organize the data.
Let’s consider the following two examples:

5
(1) Dataset of students:

Name Gender Date of Birth Marks in class 10th Board


Anjali F 17 Feb, 2003 484 State Board
Pradeep M 3 June, 2002 514 ICSE
Divya F 22 Mar, 2003 397 State Board
Sarita F 19 May, 2002 533 ICSE
Harsha M 4 March, 2002 436 CBSE
Bhavana F 7 Apr, 2003 526 State Board
Rohit M 4 March, 2002 378 CBSE
Vikash M 11 Oct, 2001 526 CBSE

Table 1: Student dataset

The student dataset shown in Table 1 can be considered as structured data because this
data is in a tabular form and provides the information about Gender, Date of Birth,
Marks in 10th class and Board of the students. Also, this data is easy to analyze and
understand as we can easily get information about any student e.g. Anjali has scored
484 marks in class 10th of State board, Pradeep is Male and have date of birth as 3rd
June, 2002 etc.

(2) Dataset of fertilizers:

Fertilizers Types of Fertilizers Area of fields Types of Crops Amount of fertilizers


(In acres) (In Kg)
Nitrogen Inorganic 1 Rice 200
Phosphorus Inorganic 2 Wheat 400
Manure Organic 1.5 Potato 300
Compost Organic 1.3 Rice 260
Potassium Inorganic 1.6 Pulse 320

Table 2 : Fertilizers dataset

Fertilizers dataset shown in table 2 can also be considered as structured data because
this data is in a tabular form and provides the information about fertilizers. Also, this
data is easy to analyze and understand as we can easily get information e.g. Potassium
is an inorganic fertilizer and can be used for pulse in the amount of 320 Kg etc.

6
2.1.1 Variables and Cases
Case (observation) : A case/observation is a unit for which data is collected. Cases should
uniquely identify each row in the dataset.
Variable : A variable is a characteristic or attribute that varies across all units. Intuitively,
a variable is that “varies”.
For Example:
In the table 1 of student dataset, each student, i.e., “Anjali, Pradeep, Divya etc.” are cases
as data is collected for every student and all the names uniquely identify each row in the
dataset.
And, variables are “Name, Gender, Date of Birth, Board etc., as their values keeps on
varying.”
Note: The student dataset is in tabular form. If we want to organise a data in a tabular
form, then following two points should take into consideration:

• Rows represent cases: For each case, same attribute is recorded.

• Columns represent variables: For each variable, same type of value for each case is
recorded.

2.2 Classification of Data


Data is broadly classified into two categories; categorical data and numerical data.

2.2.1 Categorical Data and Numerical Data


2.2.1.1 Categorical Data
Categorical data are also called qualitative variables and it identifies the group membership.
Also, we cannot perform any meaningful mathematical operations on it.
In the student dataset which is illustrated in Table 1, Gender is a categorical variable because

7
it has two categories as F and M. We can classify any observation into one of these two
categories.
Also, Board is a categorical variable since it has three categories as State Board, ICSE and
CBSE and any observation can be categorized into one of these three groups.

2.2.1.2 Numerical Data


Numerical data are also called quantitative variables. It describes the numerical properties
of the data, i.e., we can perform mathematical operations on the data.
In the student dataset of table 1, Marks is a numerical variable because we can describe the
numerical properties of data as marks of Rohit is 378, marks of Pradeep is 514 or marks of
Bhavana is more than marks of Harsha etc.
• Measurement units
Scale defines the meaning of numerical data, such as weights measured in kilograms, prices
in rupees, heights in centimeters, etc.
Also, the data that make up a numerical variable in a data table must share a common unit.

2.2.2 Time-series and cross-sectional Data


• If the data is recorded over a period of time, then it is called time-series data. Also,
graph of a time series showing values in chronological order is known as Time-plot.
Example:
The data collected to observe the temperature in Delhi for seven different days is a
time-series data. Because, data is recorded only for one place (i.e. Delhi) and it is
recorded over a period of time (i.e. seven different days).

• If the data is observed at the same time, then it is called cross-sectional data.
Example:
The data collected to observe the temperature of Delhi, Chennai, Jaipur and Bhopal
on a particular day is a cross-sectional data. Because, data is recorded at the same
time and it is observed for several places.

2.2.3 Scales of measurement


We have four scales of measurement called nominal, ordinal, interval and ratio scale. Data
collection requires any one of the scales of measurement.

2.2.3.1 Nominal scale of measurement


When the data for a variable consist of labels or names used to identify the characteristic of
an observation, the scale of measurement is considered a nominal scale.
Example: Name, Board, Gender, Blood group etc.
Note:

8
• Sometimes nominal variables might be numerically coded like we might code men as 1
and women as 2 or code men as 3 and women as 1.

• There is no ordering in the variable.

• In short “ Nominal scale is just categories or labels which does not contain any order.”

2.2.3.2 Ordinal scale of measurement


When data exhibits properties of nominal data and the order or rank of data is meaningful,
the scale of measurement is considered an ordinal scale.
Example:
Each customer who visits a restaurant provides a service rating of excellent, good, or poor.
Here, the data obtained are the labels as excellent, good, or poor, i.e., the data have the
properties of nominal data. Also, the data can be ranked/ordered, with respect to the service
quality.
Note:
• We can code an ordinal scale of measurement, as bad can be coded as 1, good can be
coded as 2 and excellent can be coded as 3. There is an order in 1, 2, 3 but one thing
need to understand is the distance between bad and good need not be same as the
distance between good and excellent. It is just an order.
As we know excellent is better than good, but we cannot say that the difference between
good and excellent is the same as the difference between good and bad. Thus, we have
just an order.

• In short “ Ordinal scale is just categories or labels which contain an order.”

2.2.3.3 Interval scale of measurement


If the data have all the properties of ordinal data and the interval between values is expressed
in terms of a fixed unit of measure, then the scale of measurement is interval scale.
Note:
• Data with interval scale of measurement are always numeric and we can find out the
difference between any two values.

• Ratios of values have no meaning here because the value of zero is arbitrary.
Example:
Consider an AC room where temperature is set at 20°C and the temperature outside the
room is 40°C. It is correct to say that the difference in temperature is 20°C, but it is incorrect
to say that the outdoor is twice as hot as indoor.
Also, temperature in degrees Fahrenheit or degrees centigrade has an interval scale of mea-
surement, because it has no absolute zero. In the Celsius scale, 0 and 100 are set to be as
the freezing point and the boiling point whereas, in Fahrenheit it is 32 and 212.

9
2.2.3.4 Ratio scale of measurement
If the data have all the properties of interval data and the ratio of two values is meaningful,
then the scale of measurement is ratio scale.
Ratio scale of measurement has absolute zero property which is the key difference between
ratio and interval scale.
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like height,
weight and marks can be added, subtracted and multiplied or divided as it all have absolute
zero property.
A summary about all scales of measurement can be described as follows :

10
Unsolved Problems

(1) An analyst wants to conduct a survey for testing the maintenance of hospitals in a
particular district in Bihar, for which he selects 25 hospitals randomly from that district.
Identify the sample and population. [2 Marks]

(a) The population is all the hospitals in Bihar and the sample is all the hospitals in
the district.
(b) The population is all the hospitals in Bihar and the sample is 25 selected hospitals
in Bihar.
(c) The population is all hospitals in the district of Bihar and the sample is 25 selected
hospitals in the district.
(d) None of the above

Answer: c

(2) In the 2011 Cricket ODI World Cup quarter-final match between India and Australia,
a media organization estimated that Australia would beat India by 50 runs if Australia
bats first, based on the information of matches played between the two teams previously.
Which branch of statistics does the above analysis belong to?
Answer: Inferential Statistics

(3) Values of temperature and humidity of a room are measured for 24 hours at a regular
time interval of 30 minutes. Based on this information, choose the correct option:

(a) It is a cross-sectional data.


(b) It is time-series data.

Answer: b

(4) What kind of data is “Social media posts”?

(a) Unstructured data


(b) Structured data

Answer: a

(5) What kind of variable is the qualification of a candidate sitting for a job interview?

(a) Numerical/ Quantitative


(b) Categorical/ Qualitative
(c) Numerical and discrete
(d) Numerical and continuous

11
Answer : b

(6) If addition, subtraction can be performed on a variable, then the scale(s) of measurement
of the variable could be:

(a) Ordinal
(b) Ratio
(c) Interval
(d) Nominal

Answer : b, c

(7) Which of the following variable(s) have nominal scale of measurement?

(a) Education qualification of a person.


(b) Hair color
(c) Brand name of mobile phone
(d) Number plate of cars

Answer: b, c, d

12
Chapter 3

3 Describing categorical data: Frequency distribution


3.1 Frequency Distribution
A frequency distribution of qualitative data is a listing of the distinct values and their
frequencies.
Each row of a frequency table lists a category along with the number of cases in this category.

Example: Let’s construct a frequency table for the following data.

(1) A, A, B, C, A, D, A, B, D, C

Category Tally mark Frequency


A 4
B 2
C 2
D 2
Total 10

(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A

Category Tally mark Frequency


A 6
B 3
C 3
D 3
Total 15

(3) A, B, B, C, A, D, B, B, D, C, A, B, C, D, B

Category Tally mark Frequency


A 3
B 6
C 3
D 3
Total 15

(4) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

13
Category Tally mark Frequency
A 6
B 3
C 4
D 5
Total 18

3.2 Relative frequency


The ratio of the frequency to the total number of observations is called relative frequency.
Note: Relative frequency plays an important role for comparing two data sets because
relative frequencies always fall between 0 and 1, they provide a standard for comparison.
Examples: Let us find the relative frequencies for the following data.

(1) A, A, B, C, A, D, A, B, D, C

Category Frequency Relative Frequency


A 4 0.4
B 2 0.2
C 2 0.2
D 2 0.2
Total 10 1

(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A

Category Frequency Relative Frequency


A 6 0.4
B 3 0.2
C 3 0.2
D 3 0.2
Total 15 1

3.3 Charts of categorical data


The two most common displays of a categorical variable are a bar chart and a pie chart.

3.3.1 Pie Chart


A pie chart is a circle divided into pieces proportional to the relative frequencies of the
qualitative data and it is used to show the proportions of a categorical variable. And, a pie
chart is a good way to show that one category makes up more than half of the total.

14
Example: Consider the frequency table of the dataset A, A, B, C, A, D, A, B, D, C.

Category Frequency Relative Frequency


A 4 0.4
B 2 0.2
C 2 0.2
D 2 0.2
Total 10 1

Table 3.1

Figure 3.1 is the pie chart representation of the dataset in Table 3.1:

Figure 3.1: Pie chart representation

As pie chart gives us the share of a pie, share of category A is 40%, category B is 20%,
category C is 20% and category D is 20%.

3.3.2 Bar Chart


A bar chart displays the distinct values of the qualitative data on a horizontal axis and the
relative frequencies (or frequencies or percents) of those values on a vertical axis. The fre-
quency/relative frequency of each distinct value is represented by a vertical bar whose height
is equal to the frequency/relative frequency of that value. The bars should be positioned so
that they do not touch each other.
Bar chart is most appropriate to represent the count of a particular category and it can be
oriented either horizontally or vertically.

15
Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

Category Frequency Relative frequency


A 6 0.33
B 3 0.17
C 4 0.22
D 5 0.28
Total 18 1

Table 3.2

Figure 3.2 represents the bar chart of the dataset in Table 3.2 as follows:

Figure 3.2: Bar chart representation

3.3.3 Pareto Chart


When the categories in a bar chart are sorted by frequency, the bar chart is sometimes called
a Pareto chart. Pareto charts are popular in quality control to identify problems in a business
process.
Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

16
Category Frequency Relative frequency
A 6 0.33
B 3 0.17
C 4 0.22
D 5 0.28
Total 18 1

Table 3.3

Figure 3.3 is the pareto chart representation of the dataset in Table 3.3 as follows:

Figure 3.3: Pareto chart representation

Note: If the categorical variable is ordinal, then the bar chart must preserve the ordering.
For example:
The T-shirt sizes L, M, M, S, L, S, S, M, L, M, M, S, S, L, M, S, M, S, L, M of twenty
students is listed in Table 3.4:

Size Frequency Relative frequency


Small 7 0.35
Medium 8 0.40
Large 5 0.25
Total 20 1

Table 3.4

17
Dataset of Table 3.4 is ordinal. So, we have preserved the order of the data.
And, bar chart representation for the dataset of Table 3.4 is given as follows:

Figure 3.4: Bar chart of Ordinal data

Purpose of using charts


(1) Pie charts are best to use when we are trying to compare parts of a whole.
(2) Bar graphs are used to compare things between different groups.
Many Categories:
A bar chart or pie chart with too many categories might conceal the more important cate-
gories. In some case, grouping other categories together might be done.
Now, let’s consider the following bar chart with too many categories:

18
Now, we can do grouping of other categories together as follows:

Grouping other categories together in a major category conveys two important things.

(1) We are not excluding any data.

(2) We have a significant number that comes from smaller categories.

19
3.4 The Area Principle
The area principle says that the area occupied by a part of the graph should correspond to
the amount of data it represents.
Display of data must obey the rule of area principle and violations of the area principle are
a common way to mislead with statistics.

3.4.1 Misleading graphs: violating area principle


(1) Decorated graphs: Sometimes charts are decorated to attract attention which often vio-
late the area principle.
For Example: Figure 3.5 is an example of decorated graph:

Figure 3.5: Decorated graph

Figure 3.5 gives us the total wine exports in UK, Canada, Japan and Italy. But, there
is no baseline and the chart shows bottles on top of labeled boxes of various sizes and
shapes.

20
Now, Figure 3.6 represents the chart which is not decorated:

Figure 3.6

We have labeled each one of the categories. It is accurate and it has a baseline. This
chart is actually consistent and the width of the bars for each countries are equal. Also,
the area occupied by the graph is proportional to the data that is being presented.

(2) Violation of area principle in a pie chart


Figure 3.7 represents the pie chart of the sales distribution of mobile phones of different
company.

Figure 3.7

21
The pie chart of the Figure 3.7 is violating the area principle as areas occupied by sales
distribution of HTC and Apple do not correspond to the amount of data it represent.

3.4.2 Misleading graphs: truncated graphs


Another common violation is when the baseline of a bar chart is not at zero.

(1) Consider the following two bar chart:

Left graph exaggerates the number as it is not at zero. But, the graph on right side
shows same data with the baseline at zero.

(2) The following figure represents the share of votes in an election in USA.

From the length of the bar we observe that Republic party voting percentage is less than
half of the Democratic party but if we consider the actual number this is not the case.

22
3.4.3 Manipulated y-axis
Expanding or compressing the scale on a graph that can make changes in the data seem less
significant than they actually are, is known as the manipulation of y-axis.
For example: Following bar charts represent the number of sales of smart phone A and B of
a local shop.

Figure : 3.8

Figure : 3.9

23
From the figure 3.8 we are getting the information that a significant amount of sales is being
done of both the smart phones but from the figure 3.9 it seems that the sales is very low of the
smart phone A and B. So, the graph in figure 3.9 is misleading because it has manipulated
y-axis.

3.4.4 Indicating a y-axis break


We can indicate a y-axis break in a bar chart in the following way:

Figure : 3.10

3.4.5 Round-off errors


It is important to check for round-off errors. Round-off errors occur when table entries are
percentages or proportions, the value of total sum may slightly differ from 100% or 1. This
might result in a pie chart.
For Example: Consider the following table:

Category Percentage
A 22.3
B 35.6
C 12.6
D 11
E 18.5
Total 100

24
In the table, the value of total sum is 100%.
Suppose, we round off the values and draw a pie chart as follows:

In this pie chart has round-off errors because total sum of all entries is 100.5% which is
different from 100%.

3.5 Summarizing Categorical Data


• Bar chart and Pie chart are graphical summaries of categorical data.

• Numbers that are used to describe data sets are called descriptive measures.

• Descriptive measures that indicate where the center or most typical value of a data set
lies are called measures of central tendency.

3.5.1 Mode
The mode of a categorical variable is the most common category, the category with the
highest frequency.
Mode labels the longest bar in a bar chart, the widest slice in a pie chart and the first
category shown in a Pareto chart.
Example: Let’s consider the dataset A, A, B, C, A, D, A, B, C, C, A, B, C, D, A.
Here, category A is the mode of the data as it occurs with the highest frequency.
Now, figure 3.11, 3.12 and 3.13 represent the bar chart, pie chart and pareto chart for the
dataset as follows:

25
(1) Bar chart representation for the above dataset is:

Figure : 3.11

In the figure 3.11, category A has the longest bar. Thus, mode of the dataset is category
“A”.
(2) Pie chart representation of the above dataset is:

Figure : 3.12

In the above pie chart, category A has the widest slice. Thus, mode of the dataset is
category “A”.

26
(3) Pareto chart for the above dataset is:

Figure : 3.13

In the above pareto chart, first bar is for category A. Thus, mode of the dataset is
category “A”.

3.5.1.1 Bimodal and Multimodal data


If two or more categories tie for the highest frequency, the data is called bimodal (in the case
of two) or multimodal (more than two).
Example:
Let’s consider the dataset A, A, B, C, A, C, A, B, C, C, A, C, C, D, A, A, C, D, B.
Here both categories “A” and “C” have highest frequency. Thus, this data is bimodal.
Now, we can consider the following bar chart also.

In the above bar chart, both categories “A” and “C” have highest frequency.

27
3.5.2 Median
The median of an ordinal variable is the category of the middle observation of the sorted
values.
If there are an even number of observations, then we can choose the category on either side
of the middle of the sorted list as the median.
Examples:

(1) When number of observations is odd:


Let’s consider the grades of 15 students as A, B, B, C, A, D, B, B, A, C, B, B, C, D, A.
Now to find the median of the categorical data, we need to order the data. So, the
ordered data is A, A, A, A, B, B, B, B, B, B, C, C, C, D, D.
Hence, the median grade is the category associated with the 8th observation which is
“B”.

(2) When number of observations is even:


Let’s consider the grades of 14 students which is listed as A, B, B, C, A, D, B, B, A, C,
B, B, C, D.
Now, the ordered data is A, A, A, B, B, B, B, B, B, C, C, C, D, D.
The median grade is the category associated with the 7th or 8th observation which is
“B”.
In the example (1), mode of the dataset is also category “B”. Here, mode and median
both are same.

(3) Consider the grades of 15 students which is listed as A, B, B, C, A, D, A, B, A, C, B,


A, C, D, A.
The ordered data is A, A, A, A, A, A, B, B, B, B, C, C, C, D, D.
The median grade is the category associated with the 8th observation which is “B”.
The most common grade is “A”, hence mode is “A”. In this example both mode and
median are the different.

Note: Median can be defined only for ordinal data whereas mode can be defined for both
nominal as well as ordinal data.

28
Unsolved Problems
(1) If an analyst wants to represent the revenues of various companies using graphs, then
which of the following graphical representation/s is/are most appropriate for the pur-
pose?(More than one option can be correct)
(a) A pie chart with a pie/slice for each company and the width corresponding to its
revenue in crore rupees.
(b) A bar chart with a bar for each company on the x-axis and the length corresponding
to its revenue in crore rupees on the y-axis.
(c) A bar chart with a bar for each company on the y-axis and the length corresponding
to its revenue in crore rupees on the x-axis.
(d) A bar chart with the minimum revenue as a baseline.
Answer: b, c
(2) Mode of a categorical variable is:(More than one option can be correct)
(a) The last bar in ascending order of a Pareto chart.
(b) The middle-most bar in a Pareto chart.
(c) The longest bar in a bar chart.
(d) The widest slice in a pie chart.
Answer: a, c, d
(3) Which of the following can be defined for both nominal and ordinal data?
(a) Mean
(b) Median
(c) Mode
(d) All of the above
Answer: c
A total of 2000 cases of Covid-19 have been registered on 5th May 2020 in 5 key districts
of Maharashtra. The proportion (out of 5 districts) of cases in each district has been
listed in Table 2.1.A. Based on the information given, answer questions (4) and (5).

District Relative Frequency


Mumbai 0.35
Pune 0.20
Nagpur x
Thane 0.25
Nashik 0.08

29
(4) Find the relative frequency of district Nagpur.
Answer: 0.12

(5) How many cases were registered in Pune on 5th May?


Answer: 400

30

You might also like