0% found this document useful (0 votes)

41 views

Notes of Week-1 and Week-2

The document discusses key concepts in statistics and data science, including: - It defines statistics, population, sample, descriptive statistics, and inferential statistics. - It differentiates between unstructured and structured data, and provides examples of each. - It describes different types of data like categorical, numerical, time-series, and cross-sectional data. It also defines nominal, ordinal, interval, and ratio scales of measurement. - Frequency distributions and different charts to describe categorical data like pie charts, bar charts, and Pareto charts are discussed. Issues like violating the area principle and misleading graphs are also covered.

Uploaded by

ram7177

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Notes of Week-1 and Week-2

Uploaded by

ram7177

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Notes

for
STATISTICS FOR DATA SCIENCE - 1
Week-1 and 2
Contents
1 Statistics 3
1.1 Population and Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Major branches of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose of statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Data 5
2.1 Unstructured and Structured Data . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Variables and Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Classification of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Categorical Data and Numerical Data . . . . . . . . . . . . . . . . . 7
2.2.1.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1.2 Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Time-series and cross-sectional Data . . . . . . . . . . . . . . . . . . 8
2.2.3 Scales of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3.1 Nominal scale of measurement . . . . . . . . . . . . . . . . . 8
2.2.3.2 Ordinal scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.3 Interval scale of measurement . . . . . . . . . . . . . . . . . 9
2.2.3.4 Ratio scale of measurement . . . . . . . . . . . . . . . . . . 10

3 Describing categorical data: Frequency distribution 13

3.1 Frequency Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Charts of categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Pareto Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 The Area Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Misleading graphs: violating area principle . . . . . . . . . . . . . . . 20
3.4.2 Misleading graphs: truncated graphs . . . . . . . . . . . . . . . . . . 22
3.4.3 Manipulated y-axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.4 Indicating a y-axis break . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.5 Round-off errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Summarizing Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1.1 Bimodal and Multimodal data . . . . . . . . . . . . . . . . 27
3.5.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2
Chapter 1

1 Statistics
Statistics is the art of learning from data. It is concerned with the collection of data, their
subsequent description, and their analysis, which often leads to the drawing of conclusions.

1.1 Population and Sample

Population
The total collection of all the elements that we are interested in is called a population.
Sample
A subgroup of the population that will be studied in detail is called a sample.

We can understand about sample and population from the following picture:

Example:
Suppose a survey is conducted to know the prices of all houses in Tamil Nadu and 1000
houses were randomly selected from the urban areas of Tamil Nadu for this study. It is con-
cluded that price of a house per square feet is roughly 5680 Rs. Then, the sample consists
of the selected 1000 houses from the urban areas of Tamil Nadu and the population consists
of all houses in Tamil Nadu.

1.2 Major branches of statistics

1. Descriptive Statistics Statistics
The part of statistics concerned with the description and summarization of data is called
descriptive statistics.

3
• Summarization of data means numerical/graphical summary of data or to describe
the main points of data.
• A descriptive study may be performed either on a sample or on a population data.

2. Inferential Statistics Statistics

The part of statistics concerned with drawing conclusions from the data is called inferential
statistics.

1.3 Purpose of statistical analysis

• If the purpose of the analysis is to examine and explore information about the collected
data only, then the study is descriptive.
For Example: A class of 50 students gave an exam (of 100 marks) and the average
marks of the class is calculated as 65. This type of study is called descriptive statistics
because here we are just summarizing the data (calculating the average marks of whole
class).

• If the information is obtained from a sample of a population and the purpose of the
study is to use that information to draw conclusions/inferences about the population,
the study is inferential.
For Example: A teacher wants to know the average marks of all students in the school.
Since there is a large number of students in the school, the teacher collects a sample
of students from the school and calculates the average marks of the selected students
which is, say, 60 marks. Then, teacher made the conclusion (using statistical tech-
niques) that average marks of all students in the school is 60. This type of study is
called inferential statistics because here we are making conclusion about population
based on the sample data.

4
Chapter 2

2 Data
Definition
Data are the facts and figures collected, analyzed, and summarized for presentation and
interpretation.
• Purpose to collect the data :
Generally, we collect the data when we are interested to understand the characteristics or
attributes of some group or groups of people, places, things, or events.
For Example:

(1) To know about temperatures in a particular month in Chennai, India.

(2) To know about the marks obtained by students in their Class X.

2.1 Unstructured and Structured Data

Unstructured Data
Unstructured data is a dataset that is not organized in a predefined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts
as well. Also, unstructured data requires more work to process and understand.
For Example: You-tube comments, Image files, Social-media posts, lyrics of a song etc.

When data are scattered with no structure, i.e., not in any standard format, the infor-
mation is of very little use.

Structured Data
Structured data is a standardized format for providing information about a dataset and it
is clearly defined and searchable, as for the information in a dataset to be useful, we must
know the context of the numbers and text it holds. Also, structured data is easy to analyze
and understand. Hence, we need to organize the data.
Let’s consider the following two examples:

5
(1) Dataset of students:

Name Gender Date of Birth Marks in class 10th Board

Anjali F 17 Feb, 2003 484 State Board
Pradeep M 3 June, 2002 514 ICSE
Divya F 22 Mar, 2003 397 State Board
Sarita F 19 May, 2002 533 ICSE
Harsha M 4 March, 2002 436 CBSE
Bhavana F 7 Apr, 2003 526 State Board
Rohit M 4 March, 2002 378 CBSE
Vikash M 11 Oct, 2001 526 CBSE

Table 1: Student dataset

The student dataset shown in Table 1 can be considered as structured data because this
data is in a tabular form and provides the information about Gender, Date of Birth,
Marks in 10th class and Board of the students. Also, this data is easy to analyze and
understand as we can easily get information about any student e.g. Anjali has scored
484 marks in class 10th of State board, Pradeep is Male and have date of birth as 3rd
June, 2002 etc.

(2) Dataset of fertilizers:

Fertilizers Types of Fertilizers Area of fields Types of Crops Amount of fertilizers

(In acres) (In Kg)
Nitrogen Inorganic 1 Rice 200
Phosphorus Inorganic 2 Wheat 400
Manure Organic 1.5 Potato 300
Compost Organic 1.3 Rice 260
Potassium Inorganic 1.6 Pulse 320

Table 2 : Fertilizers dataset

Fertilizers dataset shown in table 2 can also be considered as structured data because
this data is in a tabular form and provides the information about fertilizers. Also, this
data is easy to analyze and understand as we can easily get information e.g. Potassium
is an inorganic fertilizer and can be used for pulse in the amount of 320 Kg etc.

6
2.1.1 Variables and Cases
Case (observation) : A case/observation is a unit for which data is collected. Cases should
uniquely identify each row in the dataset.
Variable : A variable is a characteristic or attribute that varies across all units. Intuitively,
a variable is that “varies”.
For Example:
In the table 1 of student dataset, each student, i.e., “Anjali, Pradeep, Divya etc.” are cases
as data is collected for every student and all the names uniquely identify each row in the
dataset.
And, variables are “Name, Gender, Date of Birth, Board etc., as their values keeps on
varying.”
Note: The student dataset is in tabular form. If we want to organise a data in a tabular
form, then following two points should take into consideration:

• Rows represent cases: For each case, same attribute is recorded.

• Columns represent variables: For each variable, same type of value for each case is
recorded.

2.2 Classification of Data

Data is broadly classified into two categories; categorical data and numerical data.

2.2.1 Categorical Data and Numerical Data

2.2.1.1 Categorical Data
Categorical data are also called qualitative variables and it identifies the group membership.
Also, we cannot perform any meaningful mathematical operations on it.
In the student dataset which is illustrated in Table 1, Gender is a categorical variable because

7
it has two categories as F and M. We can classify any observation into one of these two
categories.
Also, Board is a categorical variable since it has three categories as State Board, ICSE and
CBSE and any observation can be categorized into one of these three groups.

2.2.1.2 Numerical Data

Numerical data are also called quantitative variables. It describes the numerical properties
of the data, i.e., we can perform mathematical operations on the data.
In the student dataset of table 1, Marks is a numerical variable because we can describe the
numerical properties of data as marks of Rohit is 378, marks of Pradeep is 514 or marks of
Bhavana is more than marks of Harsha etc.
• Measurement units
Scale defines the meaning of numerical data, such as weights measured in kilograms, prices
in rupees, heights in centimeters, etc.
Also, the data that make up a numerical variable in a data table must share a common unit.

2.2.2 Time-series and cross-sectional Data

• If the data is recorded over a period of time, then it is called time-series data. Also,
graph of a time series showing values in chronological order is known as Time-plot.
Example:
The data collected to observe the temperature in Delhi for seven different days is a
time-series data. Because, data is recorded only for one place (i.e. Delhi) and it is
recorded over a period of time (i.e. seven different days).

• If the data is observed at the same time, then it is called cross-sectional data.
Example:
The data collected to observe the temperature of Delhi, Chennai, Jaipur and Bhopal
on a particular day is a cross-sectional data. Because, data is recorded at the same
time and it is observed for several places.

2.2.3 Scales of measurement

We have four scales of measurement called nominal, ordinal, interval and ratio scale. Data
collection requires any one of the scales of measurement.

2.2.3.1 Nominal scale of measurement

When the data for a variable consist of labels or names used to identify the characteristic of
an observation, the scale of measurement is considered a nominal scale.
Example: Name, Board, Gender, Blood group etc.
Note:

8
• Sometimes nominal variables might be numerically coded like we might code men as 1
and women as 2 or code men as 3 and women as 1.

• There is no ordering in the variable.

• In short “ Nominal scale is just categories or labels which does not contain any order.”

2.2.3.2 Ordinal scale of measurement

When data exhibits properties of nominal data and the order or rank of data is meaningful,
the scale of measurement is considered an ordinal scale.
Example:
Each customer who visits a restaurant provides a service rating of excellent, good, or poor.
Here, the data obtained are the labels as excellent, good, or poor, i.e., the data have the
properties of nominal data. Also, the data can be ranked/ordered, with respect to the service
quality.
Note:
• We can code an ordinal scale of measurement, as bad can be coded as 1, good can be
coded as 2 and excellent can be coded as 3. There is an order in 1, 2, 3 but one thing
need to understand is the distance between bad and good need not be same as the
distance between good and excellent. It is just an order.
As we know excellent is better than good, but we cannot say that the difference between
good and excellent is the same as the difference between good and bad. Thus, we have
just an order.

• In short “ Ordinal scale is just categories or labels which contain an order.”

2.2.3.3 Interval scale of measurement

If the data have all the properties of ordinal data and the interval between values is expressed
in terms of a fixed unit of measure, then the scale of measurement is interval scale.
Note:
• Data with interval scale of measurement are always numeric and we can find out the
difference between any two values.

• Ratios of values have no meaning here because the value of zero is arbitrary.
Example:
Consider an AC room where temperature is set at 20°C and the temperature outside the
room is 40°C. It is correct to say that the difference in temperature is 20°C, but it is incorrect
to say that the outdoor is twice as hot as indoor.
Also, temperature in degrees Fahrenheit or degrees centigrade has an interval scale of mea-
surement, because it has no absolute zero. In the Celsius scale, 0 and 100 are set to be as
the freezing point and the boiling point whereas, in Fahrenheit it is 32 and 212.

9
2.2.3.4 Ratio scale of measurement
If the data have all the properties of interval data and the ratio of two values is meaningful,
then the scale of measurement is ratio scale.
Ratio scale of measurement has absolute zero property which is the key difference between
ratio and interval scale.
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like height,
weight and marks can be added, subtracted and multiplied or divided as it all have absolute
zero property.
A summary about all scales of measurement can be described as follows :

10
Unsolved Problems

(1) An analyst wants to conduct a survey for testing the maintenance of hospitals in a
particular district in Bihar, for which he selects 25 hospitals randomly from that district.
Identify the sample and population. [2 Marks]

(a) The population is all the hospitals in Bihar and the sample is all the hospitals in
the district.
(b) The population is all the hospitals in Bihar and the sample is 25 selected hospitals
in Bihar.
(c) The population is all hospitals in the district of Bihar and the sample is 25 selected
hospitals in the district.
(d) None of the above

Answer: c

(2) In the 2011 Cricket ODI World Cup quarter-final match between India and Australia,
a media organization estimated that Australia would beat India by 50 runs if Australia
bats first, based on the information of matches played between the two teams previously.
Which branch of statistics does the above analysis belong to?
Answer: Inferential Statistics

(3) Values of temperature and humidity of a room are measured for 24 hours at a regular
time interval of 30 minutes. Based on this information, choose the correct option:

(a) It is a cross-sectional data.

(b) It is time-series data.

Answer: b

(4) What kind of data is “Social media posts”?

(a) Unstructured data

(b) Structured data

Answer: a

(5) What kind of variable is the qualification of a candidate sitting for a job interview?

(a) Numerical/ Quantitative

(b) Categorical/ Qualitative
(c) Numerical and discrete
(d) Numerical and continuous

11
Answer : b

(6) If addition, subtraction can be performed on a variable, then the scale(s) of measurement
of the variable could be:

(a) Ordinal
(b) Ratio
(c) Interval
(d) Nominal

Answer : b, c

(7) Which of the following variable(s) have nominal scale of measurement?

(a) Education qualification of a person.

(b) Hair color
(c) Brand name of mobile phone
(d) Number plate of cars

Answer: b, c, d

12
Chapter 3

3 Describing categorical data: Frequency distribution

3.1 Frequency Distribution
A frequency distribution of qualitative data is a listing of the distinct values and their
frequencies.
Each row of a frequency table lists a category along with the number of cases in this category.

Example: Let’s construct a frequency table for the following data.

(1) A, A, B, C, A, D, A, B, D, C

Category Tally mark Frequency

A 4
B 2
C 2
D 2
Total 10

(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A

Category Tally mark Frequency

A 6
B 3
C 3
D 3
Total 15

(3) A, B, B, C, A, D, B, B, D, C, A, B, C, D, B

Category Tally mark Frequency

A 3
B 6
C 3
D 3
Total 15

(4) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

13
Category Tally mark Frequency
A 6
B 3
C 4
D 5
Total 18

3.2 Relative frequency

The ratio of the frequency to the total number of observations is called relative frequency.
Note: Relative frequency plays an important role for comparing two data sets because
relative frequencies always fall between 0 and 1, they provide a standard for comparison.
Examples: Let us find the relative frequencies for the following data.

(1) A, A, B, C, A, D, A, B, D, C

Category Frequency Relative Frequency

A 4 0.4
B 2 0.2
C 2 0.2
D 2 0.2
Total 10 1

(2) A, A, B, C, A, D, A, B, D, C, A, B, C, D, A

Category Frequency Relative Frequency

A 6 0.4
B 3 0.2
C 3 0.2
D 3 0.2
Total 15 1

3.3 Charts of categorical data

The two most common displays of a categorical variable are a bar chart and a pie chart.

3.3.1 Pie Chart

A pie chart is a circle divided into pieces proportional to the relative frequencies of the
qualitative data and it is used to show the proportions of a categorical variable. And, a pie
chart is a good way to show that one category makes up more than half of the total.

14
Example: Consider the frequency table of the dataset A, A, B, C, A, D, A, B, D, C.

Category Frequency Relative Frequency

A 4 0.4
B 2 0.2
C 2 0.2
D 2 0.2
Total 10 1

Table 3.1

Figure 3.1 is the pie chart representation of the dataset in Table 3.1:

Figure 3.1: Pie chart representation

As pie chart gives us the share of a pie, share of category A is 40%, category B is 20%,
category C is 20% and category D is 20%.

3.3.2 Bar Chart

A bar chart displays the distinct values of the qualitative data on a horizontal axis and the
relative frequencies (or frequencies or percents) of those values on a vertical axis. The fre-
quency/relative frequency of each distinct value is represented by a vertical bar whose height
is equal to the frequency/relative frequency of that value. The bars should be positioned so
that they do not touch each other.
Bar chart is most appropriate to represent the count of a particular category and it can be
oriented either horizontally or vertically.

15
Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

Category Frequency Relative frequency

A 6 0.33
B 3 0.17
C 4 0.22
D 5 0.28
Total 18 1

Table 3.2

Figure 3.2 represents the bar chart of the dataset in Table 3.2 as follows:

Figure 3.2: Bar chart representation

3.3.3 Pareto Chart

When the categories in a bar chart are sorted by frequency, the bar chart is sometimes called
a Pareto chart. Pareto charts are popular in quality control to identify problems in a business
process.
Example: A, A, B, C, A, D, A, B, D, C, A, B, C, D, A, C, D, D

16
Category Frequency Relative frequency
A 6 0.33
B 3 0.17
C 4 0.22
D 5 0.28
Total 18 1

Table 3.3

Figure 3.3 is the pareto chart representation of the dataset in Table 3.3 as follows:

Figure 3.3: Pareto chart representation

Note: If the categorical variable is ordinal, then the bar chart must preserve the ordering.
For example:
The T-shirt sizes L, M, M, S, L, S, S, M, L, M, M, S, S, L, M, S, M, S, L, M of twenty
students is listed in Table 3.4:

Size Frequency Relative frequency

Small 7 0.35
Medium 8 0.40
Large 5 0.25
Total 20 1

Table 3.4

17
Dataset of Table 3.4 is ordinal. So, we have preserved the order of the data.
And, bar chart representation for the dataset of Table 3.4 is given as follows:

Figure 3.4: Bar chart of Ordinal data

Purpose of using charts

(1) Pie charts are best to use when we are trying to compare parts of a whole.
(2) Bar graphs are used to compare things between different groups.
Many Categories:
A bar chart or pie chart with too many categories might conceal the more important cate-
gories. In some case, grouping other categories together might be done.
Now, let’s consider the following bar chart with too many categories:

18
Now, we can do grouping of other categories together as follows:

Grouping other categories together in a major category conveys two important things.

(1) We are not excluding any data.

(2) We have a significant number that comes from smaller categories.

19
3.4 The Area Principle
The area principle says that the area occupied by a part of the graph should correspond to
the amount of data it represents.
Display of data must obey the rule of area principle and violations of the area principle are
a common way to mislead with statistics.

3.4.1 Misleading graphs: violating area principle

(1) Decorated graphs: Sometimes charts are decorated to attract attention which often vio-
late the area principle.
For Example: Figure 3.5 is an example of decorated graph:

Figure 3.5: Decorated graph

Figure 3.5 gives us the total wine exports in UK, Canada, Japan and Italy. But, there
is no baseline and the chart shows bottles on top of labeled boxes of various sizes and
shapes.

20
Now, Figure 3.6 represents the chart which is not decorated:

Figure 3.6

We have labeled each one of the categories. It is accurate and it has a baseline. This
chart is actually consistent and the width of the bars for each countries are equal. Also,
the area occupied by the graph is proportional to the data that is being presented.

(2) Violation of area principle in a pie chart

Figure 3.7 represents the pie chart of the sales distribution of mobile phones of different
company.

Figure 3.7

21
The pie chart of the Figure 3.7 is violating the area principle as areas occupied by sales
distribution of HTC and Apple do not correspond to the amount of data it represent.

3.4.2 Misleading graphs: truncated graphs

Another common violation is when the baseline of a bar chart is not at zero.

(1) Consider the following two bar chart:

Left graph exaggerates the number as it is not at zero. But, the graph on right side
shows same data with the baseline at zero.

(2) The following figure represents the share of votes in an election in USA.

From the length of the bar we observe that Republic party voting percentage is less than
half of the Democratic party but if we consider the actual number this is not the case.

22
3.4.3 Manipulated y-axis
Expanding or compressing the scale on a graph that can make changes in the data seem less
significant than they actually are, is known as the manipulation of y-axis.
For example: Following bar charts represent the number of sales of smart phone A and B of
a local shop.

Figure : 3.8

Figure : 3.9

23
From the figure 3.8 we are getting the information that a significant amount of sales is being
done of both the smart phones but from the figure 3.9 it seems that the sales is very low of the
smart phone A and B. So, the graph in figure 3.9 is misleading because it has manipulated
y-axis.

3.4.4 Indicating a y-axis break

We can indicate a y-axis break in a bar chart in the following way:

Figure : 3.10

3.4.5 Round-off errors

It is important to check for round-off errors. Round-off errors occur when table entries are
percentages or proportions, the value of total sum may slightly differ from 100% or 1. This
might result in a pie chart.
For Example: Consider the following table:

Category Percentage
A 22.3
B 35.6
C 12.6
D 11
E 18.5
Total 100

24
In the table, the value of total sum is 100%.
Suppose, we round off the values and draw a pie chart as follows:

In this pie chart has round-off errors because total sum of all entries is 100.5% which is
different from 100%.

3.5 Summarizing Categorical Data

• Bar chart and Pie chart are graphical summaries of categorical data.

• Numbers that are used to describe data sets are called descriptive measures.

• Descriptive measures that indicate where the center or most typical value of a data set
lies are called measures of central tendency.

3.5.1 Mode
The mode of a categorical variable is the most common category, the category with the
highest frequency.
Mode labels the longest bar in a bar chart, the widest slice in a pie chart and the first
category shown in a Pareto chart.
Example: Let’s consider the dataset A, A, B, C, A, D, A, B, C, C, A, B, C, D, A.
Here, category A is the mode of the data as it occurs with the highest frequency.
Now, figure 3.11, 3.12 and 3.13 represent the bar chart, pie chart and pareto chart for the
dataset as follows:

25
(1) Bar chart representation for the above dataset is:

Figure : 3.11

In the figure 3.11, category A has the longest bar. Thus, mode of the dataset is category
“A”.
(2) Pie chart representation of the above dataset is:

Figure : 3.12

In the above pie chart, category A has the widest slice. Thus, mode of the dataset is
category “A”.

26
(3) Pareto chart for the above dataset is:

Figure : 3.13

In the above pareto chart, first bar is for category A. Thus, mode of the dataset is
category “A”.

3.5.1.1 Bimodal and Multimodal data

If two or more categories tie for the highest frequency, the data is called bimodal (in the case
of two) or multimodal (more than two).
Example:
Let’s consider the dataset A, A, B, C, A, C, A, B, C, C, A, C, C, D, A, A, C, D, B.
Here both categories “A” and “C” have highest frequency. Thus, this data is bimodal.
Now, we can consider the following bar chart also.

In the above bar chart, both categories “A” and “C” have highest frequency.

27
3.5.2 Median
The median of an ordinal variable is the category of the middle observation of the sorted
values.
If there are an even number of observations, then we can choose the category on either side
of the middle of the sorted list as the median.
Examples:

(1) When number of observations is odd:

Let’s consider the grades of 15 students as A, B, B, C, A, D, B, B, A, C, B, B, C, D, A.
Now to find the median of the categorical data, we need to order the data. So, the
ordered data is A, A, A, A, B, B, B, B, B, B, C, C, C, D, D.
Hence, the median grade is the category associated with the 8th observation which is
“B”.

(2) When number of observations is even:

Let’s consider the grades of 14 students which is listed as A, B, B, C, A, D, B, B, A, C,
B, B, C, D.
Now, the ordered data is A, A, A, B, B, B, B, B, B, C, C, C, D, D.
The median grade is the category associated with the 7th or 8th observation which is
“B”.
In the example (1), mode of the dataset is also category “B”. Here, mode and median
both are same.

(3) Consider the grades of 15 students which is listed as A, B, B, C, A, D, A, B, A, C, B,

A, C, D, A.
The ordered data is A, A, A, A, A, A, B, B, B, B, C, C, C, D, D.
The median grade is the category associated with the 8th observation which is “B”.
The most common grade is “A”, hence mode is “A”. In this example both mode and
median are the different.

Note: Median can be defined only for ordinal data whereas mode can be defined for both
nominal as well as ordinal data.

28
Unsolved Problems
(1) If an analyst wants to represent the revenues of various companies using graphs, then
which of the following graphical representation/s is/are most appropriate for the pur-
pose?(More than one option can be correct)
(a) A pie chart with a pie/slice for each company and the width corresponding to its
revenue in crore rupees.
(b) A bar chart with a bar for each company on the x-axis and the length corresponding
to its revenue in crore rupees on the y-axis.
(c) A bar chart with a bar for each company on the y-axis and the length corresponding
to its revenue in crore rupees on the x-axis.
(d) A bar chart with the minimum revenue as a baseline.
Answer: b, c
(2) Mode of a categorical variable is:(More than one option can be correct)
(a) The last bar in ascending order of a Pareto chart.
(b) The middle-most bar in a Pareto chart.
(c) The longest bar in a bar chart.
(d) The widest slice in a pie chart.
Answer: a, c, d
(3) Which of the following can be defined for both nominal and ordinal data?
(a) Mean
(b) Median
(c) Mode
(d) All of the above
Answer: c
A total of 2000 cases of Covid-19 have been registered on 5th May 2020 in 5 key districts
of Maharashtra. The proportion (out of 5 districts) of cases in each district has been
listed in Table 2.1.A. Based on the information given, answer questions (4) and (5).

District Relative Frequency

Mumbai 0.35
Pune 0.20
Nagpur x
Thane 0.25
Nashik 0.08

29
(4) Find the relative frequency of district Nagpur.
Answer: 0.12

(5) How many cases were registered in Pune on 5th May?

Answer: 400

Solution Manual For Business Statistics 8th Edition Groebner
100% (2)
Solution Manual For Business Statistics 8th Edition Groebner
48 pages
Unit 7 - Data Interpretation
100% (5)
Unit 7 - Data Interpretation
16 pages
Intro To Statistics
No ratings yet
Intro To Statistics
35 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
Note for Int to Statistics
No ratings yet
Note for Int to Statistics
24 pages
Introduction Book 1
No ratings yet
Introduction Book 1
41 pages
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
No ratings yet
Introduction To Statistics: "There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics." (B.Disraeli)
32 pages
Apuntes Estadistica
No ratings yet
Apuntes Estadistica
116 pages
Statistics - Basic Concepts
No ratings yet
Statistics - Basic Concepts
29 pages
Statistics
No ratings yet
Statistics
81 pages
Data Management ( 1)
No ratings yet
Data Management ( 1)
46 pages
Stats For PGDM
No ratings yet
Stats For PGDM
52 pages
Stat For ds-1 (IITM BS Degree)
No ratings yet
Stat For ds-1 (IITM BS Degree)
109 pages
Basics of Statistics
No ratings yet
Basics of Statistics
32 pages
Statistics 24 04 2021 20210618114031
No ratings yet
Statistics 24 04 2021 20210618114031
41 pages
Lecture 1 Statistics and Lecture2 (1)
No ratings yet
Lecture 1 Statistics and Lecture2 (1)
44 pages
Pa 1 2024
No ratings yet
Pa 1 2024
88 pages
Unit 2
No ratings yet
Unit 2
72 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
Introduction To STATISTICS-new
No ratings yet
Introduction To STATISTICS-new
44 pages
Statistics L 1
No ratings yet
Statistics L 1
27 pages
Classification, Collection & Presentation of Data
100% (2)
Classification, Collection & Presentation of Data
6 pages
Stat Introduction Units 1& 2
No ratings yet
Stat Introduction Units 1& 2
108 pages
Statistics Ppt.1
No ratings yet
Statistics Ppt.1
39 pages
DSA Unit 2 Answers
No ratings yet
DSA Unit 2 Answers
22 pages
Data Science (Unit 02) Notes
No ratings yet
Data Science (Unit 02) Notes
7 pages
Written Report Gathering and Organizing Data
No ratings yet
Written Report Gathering and Organizing Data
13 pages
Desc. Stat
No ratings yet
Desc. Stat
55 pages
Descriptive_Statistics_Hand-out__MMS
No ratings yet
Descriptive_Statistics_Hand-out__MMS
27 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
41 pages
STAT. Lec.1
No ratings yet
STAT. Lec.1
30 pages
math notes module 4A
No ratings yet
math notes module 4A
4 pages
Revision SB Chap 2 7
No ratings yet
Revision SB Chap 2 7
55 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
Part1 141104090445 Conversion Gate01
No ratings yet
Part1 141104090445 Conversion Gate01
27 pages
C C: GS - 301 C I:: Ourse ODE Ourse Nstructor
No ratings yet
C C: GS - 301 C I:: Ourse ODE Ourse Nstructor
38 pages
Basic Statistics PPT
No ratings yet
Basic Statistics PPT
54 pages
1 Descriptive Part
No ratings yet
1 Descriptive Part
13 pages
Pdf24 Merged
No ratings yet
Pdf24 Merged
99 pages
Lesson 01
No ratings yet
Lesson 01
6 pages
Descriptive Statistics, Tables and Graphs 20
No ratings yet
Descriptive Statistics, Tables and Graphs 20
34 pages
Unit .......
No ratings yet
Unit .......
45 pages
Lecture 1
No ratings yet
Lecture 1
27 pages
1-Introduction To Statistics
100% (1)
1-Introduction To Statistics
19 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
QT Module-2
No ratings yet
QT Module-2
45 pages
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
No ratings yet
Course Introduction Inferential Statistics Prof. Sandy A. Lerio
46 pages
Introduction To Statistics: Lecturer: LE HONG VAN Foreign Trade University - HCM Campus Email: Lehongvan - Cs2@ftu - Edu.vn
No ratings yet
Introduction To Statistics: Lecturer: LE HONG VAN Foreign Trade University - HCM Campus Email: Lehongvan - Cs2@ftu - Edu.vn
62 pages
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
No ratings yet
Introduction To Stati Stics: There Are Three Kinds of Lies: Lies, Damned Lies, A ND Statistics." (B.Disraeli)
39 pages
Stats_Notes
No ratings yet
Stats_Notes
81 pages
statistics notes part - 1
No ratings yet
statistics notes part - 1
25 pages
1 Data and Statistics
No ratings yet
1 Data and Statistics
65 pages
Week 1
No ratings yet
Week 1
6 pages
Statistics
No ratings yet
Statistics
41 pages
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
No ratings yet
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
53 pages
INTRODUCTION TO SATISTICS .DOC1
No ratings yet
INTRODUCTION TO SATISTICS .DOC1
7 pages
chapter 1_250119_072242
No ratings yet
chapter 1_250119_072242
11 pages
Lecture 01 Introduction to Statistics Ppt 06022025 095924am
No ratings yet
Lecture 01 Introduction to Statistics Ppt 06022025 095924am
40 pages
Nilkanta Sir Merged PDF
No ratings yet
Nilkanta Sir Merged PDF
623 pages
CHAPTER 1 & 2_ STATS
No ratings yet
CHAPTER 1 & 2_ STATS
5 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Shubhakruth Shannavathi SANSKRIT
No ratings yet
Shubhakruth Shannavathi SANSKRIT
106 pages
Bus Route List
No ratings yet
Bus Route List
14 pages
Vidyut: A Phonetic Keyboard For Sanskrit
No ratings yet
Vidyut: A Phonetic Keyboard For Sanskrit
27 pages
1 & 2 Bhagavatam Pratama & Dwitiya Skandams
No ratings yet
1 & 2 Bhagavatam Pratama & Dwitiya Skandams
444 pages
FRM Prep Handbook 2013-Web
No ratings yet
FRM Prep Handbook 2013-Web
15 pages
Yavanarani - 1 PDF
No ratings yet
Yavanarani - 1 PDF
334 pages
LR 20 Maths E3 Presenting Data
No ratings yet
LR 20 Maths E3 Presenting Data
16 pages
"Effects of Blended Learning Approach To The Students' Performance in Osmeña Colleges, Teacher of Education Department
No ratings yet
"Effects of Blended Learning Approach To The Students' Performance in Osmeña Colleges, Teacher of Education Department
5 pages
Math-7 Q4 SLM WK4
No ratings yet
Math-7 Q4 SLM WK4
10 pages
Creating and Formatting Charts:: Create A Chart
No ratings yet
Creating and Formatting Charts:: Create A Chart
10 pages
LS3 - JHSWORKSHEETS M1 7 With Anskey
No ratings yet
LS3 - JHSWORKSHEETS M1 7 With Anskey
29 pages
Chapter 2A
No ratings yet
Chapter 2A
80 pages
Math Grade 8 Module
No ratings yet
Math Grade 8 Module
2 pages
Data Interpretation-Study Materials Bar Chart: Definition of A Bar Graph
No ratings yet
Data Interpretation-Study Materials Bar Chart: Definition of A Bar Graph
6 pages
A Complete Guide to Line Charts _ Atlassian
No ratings yet
A Complete Guide to Line Charts _ Atlassian
8 pages
Practical Research 2: Quarter 2 - Module 8
50% (4)
Practical Research 2: Quarter 2 - Module 8
38 pages
Presentation of Tables Graphs and Maps
No ratings yet
Presentation of Tables Graphs and Maps
64 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
Visuals
No ratings yet
Visuals
3 pages
Guide To Maths For Scientists
100% (1)
Guide To Maths For Scientists
67 pages
Construction management-IX Assignment-05
No ratings yet
Construction management-IX Assignment-05
4 pages
Pie Chart Bar Chart Exercises Answer 12
100% (2)
Pie Chart Bar Chart Exercises Answer 12
4 pages
Inquiries, Investigation, and Immersion Quarter 4: Module 3
No ratings yet
Inquiries, Investigation, and Immersion Quarter 4: Module 3
3 pages
MATH 121 Chapter 2 Frequency Distribution Graphs
No ratings yet
MATH 121 Chapter 2 Frequency Distribution Graphs
22 pages
Year 4 Summer Block 5 Statistics
No ratings yet
Year 4 Summer Block 5 Statistics
14 pages
Stacked Bar Graph Stata
100% (1)
Stacked Bar Graph Stata
30 pages
Camoes Reading Worksheet #2
No ratings yet
Camoes Reading Worksheet #2
3 pages
Introductory Statistics (Chapter 2)
No ratings yet
Introductory Statistics (Chapter 2)
3 pages
Lesson 2 Frequency Distribution and Data Presentation 18
No ratings yet
Lesson 2 Frequency Distribution and Data Presentation 18
11 pages
Types of Visuals in Power BI
No ratings yet
Types of Visuals in Power BI
9 pages
Quiz
No ratings yet
Quiz
16 pages
Frequency Distribution - Data Management PDF
No ratings yet
Frequency Distribution - Data Management PDF
69 pages
Math-4 - Bar Graph
No ratings yet
Math-4 - Bar Graph
5 pages
Ch17 Data Handling
No ratings yet
Ch17 Data Handling
13 pages

Notes of Week-1 and Week-2

Uploaded by

Notes of Week-1 and Week-2

Uploaded by

Notes

3 Describing categorical data: Frequency distribution 13

1.1 Population and Sample

1.2 Major branches of statistics

2. Inferential Statistics Statistics

1.3 Purpose of statistical analysis

(1) To know about temperatures in a particular month in Chennai, India.

(2) To know about the marks obtained by students in their Class X.

2.1 Unstructured and Structured Data

Name Gender Date of Birth Marks in class 10th Board

Table 1: Student dataset

(2) Dataset of fertilizers:

Fertilizers Types of Fertilizers Area of fields Types of Crops Amount of fertilizers

Table 2 : Fertilizers dataset

• Rows represent cases: For each case, same attribute is recorded.

2.2 Classification of Data

2.2.1 Categorical Data and Numerical Data

2.2.1.2 Numerical Data

2.2.2 Time-series and cross-sectional Data

2.2.3 Scales of measurement

2.2.3.1 Nominal scale of measurement

• There is no ordering in the variable.

2.2.3.2 Ordinal scale of measurement

• In short “ Ordinal scale is just categories or labels which contain an order.”

2.2.3.3 Interval scale of measurement

(a) It is a cross-sectional data.

(4) What kind of data is “Social media posts”?

(a) Unstructured data

(a) Numerical/ Quantitative

(7) Which of the following variable(s) have nominal scale of measurement?

(a) Education qualification of a person.

3 Describing categorical data: Frequency distribution

Example: Let’s construct a frequency table for the following data.

Category Tally mark Frequency

Category Tally mark Frequency

Category Tally mark Frequency

3.2 Relative frequency

Category Frequency Relative Frequency

Category Frequency Relative Frequency

3.3 Charts of categorical data

3.3.1 Pie Chart

Category Frequency Relative Frequency

Figure 3.1: Pie chart representation

3.3.2 Bar Chart

Category Frequency Relative frequency

Figure 3.2: Bar chart representation

3.3.3 Pareto Chart

Figure 3.3: Pareto chart representation

Size Frequency Relative frequency

Figure 3.4: Bar chart of Ordinal data

Purpose of using charts

(1) We are not excluding any data.

(2) We have a significant number that comes from smaller categories.

3.4.1 Misleading graphs: violating area principle

Figure 3.5: Decorated graph

(2) Violation of area principle in a pie chart

3.4.2 Misleading graphs: truncated graphs

(1) Consider the following two bar chart:

3.4.4 Indicating a y-axis break

3.4.5 Round-off errors

3.5 Summarizing Categorical Data

3.5.1.1 Bimodal and Multimodal data

(1) When number of observations is odd:

(2) When number of observations is even:

(3) Consider the grades of 15 students which is listed as A, B, B, C, A, D, A, B, A, C, B,

District Relative Frequency

(5) How many cases were registered in Pune on 5th May?

You might also like