0% found this document useful (0 votes)
1K views22 pages

Unit - I: Descriptive Statistics and Methods For Data Science

The document discusses descriptive statistics and methods for data science. It covers topics like introduction to data science, statistics, data collection and visualization, population vs sample, types of variables, and role of data science in various sectors. Statistical methods like descriptive statistics, inferential statistics, and limitations of statistics are also explained.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views22 pages

Unit - I: Descriptive Statistics and Methods For Data Science

The document discusses descriptive statistics and methods for data science. It covers topics like introduction to data science, statistics, data collection and visualization, population vs sample, types of variables, and role of data science in various sectors. Statistical methods like descriptive statistics, inferential statistics, and limitations of statistics are also explained.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT - I

Descriptive Statistics

And

Methods for Data Science

contents
➢ Descriptive Statistics

• 1.1Introduction to Data Science

• 1.2 Statistics

• 1.3Collection of Data – Primary and Secondary data

• 1.4Population Vs. Sample

• 1 . 5 Ty p e s o f v a r i a b l e s

• 1 . 6 D a t a Vi s u a l i z a t i o n 2
1.1 Data Science

Data science is the study of data and it is an interdisciplinary field that uses, statistics,
scientific computing, scientific methods. Processes, algorithms and systems to extent or
explore knowledge and insights from noisy, structed and unstructured data.
It involves developing methods of recording, storing and analyzing data
effectively extract useful information.

➢Role of Data Science in banking sectors


Applying data science technologies like AI, NLP, and machine learning algorithms
can help banks in several areas like fraud detection, risk management, customer
sentiment analysis, and personalized marketing.

➢Role of Data Science in the educational institutions


Using IBM Cognos Analytics, the university is able to analyze and predict student
performances. It uses various variables like student background, demographics, high
school grades, the economic background to assess the dropout probability for the students.
4
➢Role of Data Science in the decision making process.
The self-driving cars collect the live data from sensors, incorporating radars, cameras,
and lasers to generate a map of its surroundings. Based upon this information, it takes
decisions like when to speed up, when to speed down, where to take a turn, when to
overtake– making the usage of advanced machine learning algorithms.

➢How Data Science can be used in predictive analytics.


Data from different ships, aircraft, radars, satellites can be collected and also analyzed to
build models. These intelligent models will not only forecast the weather but also enable in
predicting the occurrence of any natural calamities.
5

1.2 STATISTICS
Definition: Statistics is a tool in the hands of mankind to translate complex facts into simple

and understandable statement of facts.

The word statistics is derived from the Italian word “Stato” and it means a political state.

Statistic is defined as a science which deals with scientific methods of collection,

organization, summarization, presentation, analysis and interpretation of numerical data.

Statistical methods are applied for investigation in every important fields of science.

6
STATISTICAL METHODS

• Collection of data: The first step of an investigation is the collection of data. Careful
collection is needed because further analysis is based on this.

• Organization of data: The large mass of figures that are collected from a survey needs
organization.

• Presentation of data: The collected data must be edited so that irrelevant and wrong
computations must be corrected or adjusted.

The collected data must be classified and tabulated before they can be analyzed.

Types of Statistics

Statistics

Descriptive statistics Inferential statistics

It consists of the methods for It consists of the methods for drawing and
organizing and summarizing the measuring the reliability of conclusions
information about population based on the information
obtained

Ex: Mean, mode, Point estimation,


median, Graphs Interval estimation,
Testing of hypothesis 8
Scope of Statistics
Statistics and Agriculture
Analysis of variance (ANOVA) is one of the statistical tools developed by Professor R.A.
Fisher, plays an important role in agriculture experiments.
In tests of significance based on small samples, it can be shown that statistics is
sufficient to test the significant difference between two sample means. In analysis of variance,
we are concerned with the testing of equality of several population means.
Example: five fertilizers are applied to five plots each of wheat and the yield of wheat on each
of the plots are given. In such a situation, we are interested in finding out whether the effect of
these fertilizers on the yield is significantly different or not.
In other words, whether the samples are drawn from the same normal population or
not. The answer to this problem is provided by the technique of ANOVA and it is used to test
9
the homogeneity of several population means.

Statistics and Education

• Statistics is widely used in education. Research has become a common feature in all
branches of activities.
• Statistics is necessary for the formulation of policies to start a new course, consideration of
facilities available for new courses etc.
• There are many people engaged in research work to test the past knowledge and evolve new
knowledge. These are possible only through statistics.

10
Statistics and Modern applications

• Recent developments in the fields of computer technology and information technology


have enabled statistics to integrate their models and thus make statistics a part of decision
making procedures of many organizations.
• There are so many software packages available for solving design of experiments,
forecasting simulation problems etc.
• SYSTAT, a software package offers mere scientific and technical graphing options than
any other desktop statistics package.

11

Limitations of statistics

• Statistics is not suitable to the study of qualitative phenomenon.


• It does not study individuals.
• Statistical laws are not exact.
• It is only, one of the methods of studying a problem.

12
1.3 Collection of data
Data and the main sources of collecting data

➢ In the view of a layman data means “Information”.

➢In statistics the data means mass of information collected from different sources.

➢The collection of data is an important task in a statistical enquiry.

➢One should take care while collecting the data otherwise it leads to wrong conclusions and
faulty decisions.

➢Definition: Collection of data is the process of enumeration (process of making or stating a


list of things one after another) together with the proper recording of results. The success of
an enquiry on the proper collection of data. 13

➢According to the basic sources of collecting data may be classified into two types.
1. Primary data 2. Secondary data.

1. Primary data: In an individual or an officer collects the data to study a particular


problem, the data are the raw materials of the enquiry. They are the primary data
collected by the investigator himself to study any particular problem.
2. Secondary data: The data which are already collected by someone for some purpose
and are available for the present study.

14
Methods of collecting primary data or sources of primary data

For the collection of primary data the investigator may choose any one of the following
methods.
1. Direct personal observation.
2. Indirect oral interviews.
3. Information from correspondents.
4. Mailed questionnaire method.
5. Schedules sent through enumerators.

15

1. Direct personal observations

➢The persons from whom information is collected are known as informants.


➢The investigator personally meets them and asks questions to gather the necessary
information.
➢ It is the suitable method for intensive rather than extensive field surveys. It suits best for
intensive study of the limited field.

16
2. Indirect Oral Interviews

➢Under this method the investigator contacts witnesses or neighbors or friends or


some other third parties who are capable of supplying the necessary information.
➢This method is preferred if the required information is on addiction or cause of fire
or theft or murder etc.,
➢If a fire has broken out a certain place, the persons living in neighborhood and
witnesses are likely to give information on the cause of fire.
➢ In some cases, police interrogated third parties who are supposed to have
knowledge of a theft or a murder and get some clues.

17

3. Information from correspondents

➢The investigator appoints local agents or correspondents in different places and


compiles the information sent by them.
➢ Information to Newspapers and some departments of Government come by this
method.
➢The advantage of this method is that it is cheap and appropriate for extensive
investigations. But it may not ensure accurate results because the correspondents are
likely to be negligent, prejudiced and biased.

18
4. Mailed Questionnaire Method

➢Under this method a list of questions is prepared and is sent to all the informants by post.
The list of questions is technically called questionnaire.
➢A covering letter accompanying the questionnaire explains the purpose of the investigation
and the importance of correct information and request the informants to fill in the blank
spaces provided and to return the form within a specified time.
➢ This method is appropriate in those cases where the informants are literates and are spread
over a wide area.

19

5. Schedules sent through enumerators

➢It is the most widely used method of collecting primary data.


➢In this method a number of enumerators are selected and trained.
➢They are provided with a standardized questionnaire and specific training and instructions
are given to them for filling up the schedules.
➢Each enumerator will be the in charge of a certain area.
The investigator goes to the informants along with the questionnaire and gets replies to the
questions in the schedules and records their answers. They explain the object and purpose
of the enquiry.

20
Sources of secondary data
The secondary data can be divided into two categories
1) Published sources
2) Unpublished sources.
1) Published sources:
Under this method the data is previously collected and published.
Example: government publications, statistical reports, journal and news paper, census
reports, and report on national sample surveys conducted in India etc.
2) Unpublished sources:
In this method the data is not published or data kept as personal use or departmental
use.
Example: books of accounts, work of research of various institutions and
universities, bank account details.

21

Difference between primary and secondary data


Primary data Secondary data
Primary data are those data which are collected Secondary data are those data which are collected
from the primary sources. from the secondary sources.
Primary data are known as basic data. Secondary data are known as subsidiary data.
The collection of primary data is more expensive. The collection of secondary data is comparatively
less expensive .
It takes more time to collect data. It takes less time to collect data.
Primary data are more accurate. Secondary data are less accurate than the primary
data.
Primary data are known as first hand data. Secondary data are known as second hand data.

Primary data are not readily available. Secondary data are readily available.

It is required to take much care at the time of It is not required to take much care at the time of
collecting data. collecting secondary data.
22
1.4 Population Vs Sample
Population is a complete set of all possible observations of the type which is to be
investigated.(the term population does not necessarily refer to people).
The entire group of individuals is called the population.
Finite and Infinite population: Population can be either finite population or infinite population.
When the number of observations can be counted and definite, it is known as “finite
population”. When the number of observations cannot be counted and is infinite, it is known as
“infinite population”.
Examples: Number of workers in a factory, production of articles in a particular day for a
company-Finite population.
The number of stars in the sky, the number of people seeing the Television programs-Infinite
population. 23

Information on population can be collected in two ways.

a. Census method b. Sample method

a. Census method: The object of census or complete enumeration is to collect information


for each and every unit of the population. In this method every element of the population is
included in the investigation. When we make a complete enumeration of all items in the
population, it is known as census method of collection of data.
Example: If we study the average expenditure of the students of University, which has
20, 000 students we must study the expenditure of all 20,000 students.

24
b. Sample Method: In the case of sample enquiry, only a part of the whole group of population
will be studied. We can study the characteristics of a population from sampling.
➢Statisticians use the word sample to describe a portion chosen from the population.
➢A finite subset of statistical individuals defined in a population is called a sample.
➢The number of units in a sample is called the sample size.
Example: Take one table spoon of rice whether it is cooked or not.
Merits:
➢It saves time because fewer items are collected and processed.
➢When the results are urgently required, this method is very helpful.
➢It reduces the cost of the enumeration.
➢More variables results can be obtained, since there are fewer chances of sampling errors.
➢Expert and trained people can be employed for scientific processing and analysis. 25

1.5 Types of variables


In general, there are two types of variables. They are ‘Dependent variables’ and

‘Independent variables’.
• Dependent variables: It is something that depends on other factors.
• A dependent variable depends on an independent variable, while an independent
variable depends on external manipulation.

Independent variables: These variables are sometimes called an experimental or predictor

variable. It is a variable that is being manipulated in an experiment in order to observe the

effect on a dependent variable.

26
1.6 Data Visualization
Visualize the data using visual elements like charts, graphs and maps etc., to provide an
accessible way to see and understand trends, outliers and patterns in data is called Data
visualization.
➢Moreover even a layman who has nothing to do with numbers can also understands
diagrams.
➢It is the graphical representation of information and data.
➢By using visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.
➢Diagrammatic Presentation of data gives an immediate understanding of the real situation to
be defined by data in comparison to the tabular presentation of data or textual
representations. 27

Five types of Big data visualization categories:

1. Bar chart

2. Line chart

3.Scatter plot

4. Spark line

5.Pie chart

28
1. BAR GRAPH
It is a pictorial representation of the numerical data by a number of bars of uniform width with

different heights, erected horizontally or vertically with equal spacing between them.

Example 1: Subject Mathematics Physics Chemistry Geography

No. of Books 100 120 80 130

130
NO. OF BOOKS
120
100

80
MATHEMATICS PHYSICS CHEMISTRY GEOGRAPHY 29

Example 2

30
2. Line Graph/Chart
It is a type of chart which displays information as a series of data points called markers

connected by straight line segment.

Example: Subject Mathematics Physics Chemistry Geography

No. of Books 100 120 80 130

No. of Books
130
120
100
80

MATHEMATICS PHYSICS CHEMISTRY GEOGRAPHY


31

3. Scatter Graph/Chart
It is a type of mathematical diagram using cartesian coordinates to display values for two

variables for a set of data. With scatter graphs we often talk about how the variables relate to

each other. This is called correlation.

Example:
No. of Books
140 130
Subject Mathem Physics Chemistr Geograp 120
120
atics y hy 100
100
80
80
60
No. of 100 120 80 130 40
Books 20
0
0 1 2 3 4 5
32
4. Sparkline
It is very small line chart drawn without axes or co ordinates. It represent the general shape of

the variation in some measurement, such as temperature or stock market price in a simple and

highly condensed way.

Example:
140
Subject Mathem Physics Chemistr Geograp
120
atics y hy 100
80
60
No. of 100 120 80 130 40
Books 20
0
0 1 2 3 4 5

33

5. Pie chart/Pie graph

Meaning of Pie graph:


• A Pie Diagram is a circle divided into sections. The size of the section indicates the
magnitude of each component as a part of the whole.
Steps Involved in Constructing Pie graph
• Convert the given values in percentage form and multiply it with 3.6’ to get the amount of
angle for each item.
• Draw a circle and start the diagram at 12’O clock position.
• Take the highest angle first with protector (D) and mark lower angles successively.
• Shade different angles differently to show distinction in each item.

34
35

Subject Mathem Physics Chemistr Geograp


Example 2:
atics y hy

No. of 100 120 80 130


Books

No. of Books

23%
30% Mathematics
Physics
Chemistry
Geography
19% 28%

36
Advantages of Diagrammatic Presentation:
(1) Diagrams Are Attractive and Impressive:
• Data presented in the form of diagrams are able to attract the attention of even a common man.
(2) Easy to Remember
• Diagrams have a great memorizing effect.
(3) Diagrams save Time
• It presents complex mass data in a simplified manner.
Limitations of diagrammatic presentation:
1. They do not provide detailed information.
2. Diagrams can be easily misinterpreted.
3. Diagrams can take much time and labour.
4. Exact measurement is not possible in diagrams.

37

Some more important graphs


1. Histogram
2. Frequency Polygon
3. Ogive

1. Histogram : A two dimensional diagram whose length shows frequency and the breadth shows size of
class interval.

38
Example 1:

39

Example 2:

40
2. Frequency Polygon
• A histogram becomes frequency polygon when a line is drawn joining
midpoints of tops of all rectangles in a histogram.
Example :

41

3. Ogive A curve obtained by plotting frequency data on the graph paper.

42
Example :

43

You might also like