Unit - I: Descriptive Statistics and Methods For Data Science
Unit - I: Descriptive Statistics and Methods For Data Science
Descriptive Statistics
And
contents
➢ Descriptive Statistics
• 1.2 Statistics
• 1 . 5 Ty p e s o f v a r i a b l e s
• 1 . 6 D a t a Vi s u a l i z a t i o n 2
1.1 Data Science
Data science is the study of data and it is an interdisciplinary field that uses, statistics,
scientific computing, scientific methods. Processes, algorithms and systems to extent or
explore knowledge and insights from noisy, structed and unstructured data.
It involves developing methods of recording, storing and analyzing data
effectively extract useful information.
1.2 STATISTICS
Definition: Statistics is a tool in the hands of mankind to translate complex facts into simple
The word statistics is derived from the Italian word “Stato” and it means a political state.
Statistical methods are applied for investigation in every important fields of science.
6
STATISTICAL METHODS
• Collection of data: The first step of an investigation is the collection of data. Careful
collection is needed because further analysis is based on this.
• Organization of data: The large mass of figures that are collected from a survey needs
organization.
• Presentation of data: The collected data must be edited so that irrelevant and wrong
computations must be corrected or adjusted.
The collected data must be classified and tabulated before they can be analyzed.
Types of Statistics
Statistics
It consists of the methods for It consists of the methods for drawing and
organizing and summarizing the measuring the reliability of conclusions
information about population based on the information
obtained
• Statistics is widely used in education. Research has become a common feature in all
branches of activities.
• Statistics is necessary for the formulation of policies to start a new course, consideration of
facilities available for new courses etc.
• There are many people engaged in research work to test the past knowledge and evolve new
knowledge. These are possible only through statistics.
10
Statistics and Modern applications
11
Limitations of statistics
12
1.3 Collection of data
Data and the main sources of collecting data
➢In statistics the data means mass of information collected from different sources.
➢One should take care while collecting the data otherwise it leads to wrong conclusions and
faulty decisions.
➢According to the basic sources of collecting data may be classified into two types.
1. Primary data 2. Secondary data.
14
Methods of collecting primary data or sources of primary data
For the collection of primary data the investigator may choose any one of the following
methods.
1. Direct personal observation.
2. Indirect oral interviews.
3. Information from correspondents.
4. Mailed questionnaire method.
5. Schedules sent through enumerators.
15
16
2. Indirect Oral Interviews
17
18
4. Mailed Questionnaire Method
➢Under this method a list of questions is prepared and is sent to all the informants by post.
The list of questions is technically called questionnaire.
➢A covering letter accompanying the questionnaire explains the purpose of the investigation
and the importance of correct information and request the informants to fill in the blank
spaces provided and to return the form within a specified time.
➢ This method is appropriate in those cases where the informants are literates and are spread
over a wide area.
19
20
Sources of secondary data
The secondary data can be divided into two categories
1) Published sources
2) Unpublished sources.
1) Published sources:
Under this method the data is previously collected and published.
Example: government publications, statistical reports, journal and news paper, census
reports, and report on national sample surveys conducted in India etc.
2) Unpublished sources:
In this method the data is not published or data kept as personal use or departmental
use.
Example: books of accounts, work of research of various institutions and
universities, bank account details.
21
Primary data are not readily available. Secondary data are readily available.
It is required to take much care at the time of It is not required to take much care at the time of
collecting data. collecting secondary data.
22
1.4 Population Vs Sample
Population is a complete set of all possible observations of the type which is to be
investigated.(the term population does not necessarily refer to people).
The entire group of individuals is called the population.
Finite and Infinite population: Population can be either finite population or infinite population.
When the number of observations can be counted and definite, it is known as “finite
population”. When the number of observations cannot be counted and is infinite, it is known as
“infinite population”.
Examples: Number of workers in a factory, production of articles in a particular day for a
company-Finite population.
The number of stars in the sky, the number of people seeing the Television programs-Infinite
population. 23
24
b. Sample Method: In the case of sample enquiry, only a part of the whole group of population
will be studied. We can study the characteristics of a population from sampling.
➢Statisticians use the word sample to describe a portion chosen from the population.
➢A finite subset of statistical individuals defined in a population is called a sample.
➢The number of units in a sample is called the sample size.
Example: Take one table spoon of rice whether it is cooked or not.
Merits:
➢It saves time because fewer items are collected and processed.
➢When the results are urgently required, this method is very helpful.
➢It reduces the cost of the enumeration.
➢More variables results can be obtained, since there are fewer chances of sampling errors.
➢Expert and trained people can be employed for scientific processing and analysis. 25
‘Independent variables’.
• Dependent variables: It is something that depends on other factors.
• A dependent variable depends on an independent variable, while an independent
variable depends on external manipulation.
26
1.6 Data Visualization
Visualize the data using visual elements like charts, graphs and maps etc., to provide an
accessible way to see and understand trends, outliers and patterns in data is called Data
visualization.
➢Moreover even a layman who has nothing to do with numbers can also understands
diagrams.
➢It is the graphical representation of information and data.
➢By using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
➢Diagrammatic Presentation of data gives an immediate understanding of the real situation to
be defined by data in comparison to the tabular presentation of data or textual
representations. 27
1. Bar chart
2. Line chart
3.Scatter plot
4. Spark line
5.Pie chart
28
1. BAR GRAPH
It is a pictorial representation of the numerical data by a number of bars of uniform width with
different heights, erected horizontally or vertically with equal spacing between them.
130
NO. OF BOOKS
120
100
80
MATHEMATICS PHYSICS CHEMISTRY GEOGRAPHY 29
Example 2
30
2. Line Graph/Chart
It is a type of chart which displays information as a series of data points called markers
No. of Books
130
120
100
80
3. Scatter Graph/Chart
It is a type of mathematical diagram using cartesian coordinates to display values for two
variables for a set of data. With scatter graphs we often talk about how the variables relate to
Example:
No. of Books
140 130
Subject Mathem Physics Chemistr Geograp 120
120
atics y hy 100
100
80
80
60
No. of 100 120 80 130 40
Books 20
0
0 1 2 3 4 5
32
4. Sparkline
It is very small line chart drawn without axes or co ordinates. It represent the general shape of
the variation in some measurement, such as temperature or stock market price in a simple and
Example:
140
Subject Mathem Physics Chemistr Geograp
120
atics y hy 100
80
60
No. of 100 120 80 130 40
Books 20
0
0 1 2 3 4 5
33
34
35
No. of Books
23%
30% Mathematics
Physics
Chemistry
Geography
19% 28%
36
Advantages of Diagrammatic Presentation:
(1) Diagrams Are Attractive and Impressive:
• Data presented in the form of diagrams are able to attract the attention of even a common man.
(2) Easy to Remember
• Diagrams have a great memorizing effect.
(3) Diagrams save Time
• It presents complex mass data in a simplified manner.
Limitations of diagrammatic presentation:
1. They do not provide detailed information.
2. Diagrams can be easily misinterpreted.
3. Diagrams can take much time and labour.
4. Exact measurement is not possible in diagrams.
37
1. Histogram : A two dimensional diagram whose length shows frequency and the breadth shows size of
class interval.
38
Example 1:
39
Example 2:
40
2. Frequency Polygon
• A histogram becomes frequency polygon when a line is drawn joining
midpoints of tops of all rectangles in a histogram.
Example :
41
42
Example :
43