100%(1)100% found this document useful (1 vote) 1K views279 pagesApplied Data Science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
WL sem 8 Computer Engineering
(Course Code : CSDC8013) (Department Optional Course
Applied Data Science
EA. cae
Government COE, Pune (C.0.E.P)
(Teaching M. Tech & Ph.D Students)
Prof. K. S. Londhe
JSPM's Imperial COE and Research,
Wagholi, Pune
Tecu-Neo cm
pe PUBLICATIONS } y
ee commas (MUIR
Ege ?
eet rea)
CSDCB013 : Applied Data ScienceUniversity of Mumbai
Applied Data
Science
Strictly as per the New Syllabus (REV-2019 ‘C’ Scheme) of
Mumbai University w.e.f. academic year 2022-2023
Semester 8 : Computer Engineering
(Course Code : CSDC8013) (Department Optional Course -5)
Prof. R. M. Baphana Prof. K. S. Londhe
Adjunct Faculty, Formerly, Assistant Professor
Government College of Engineering, Department of Computer Engineering
Pune (C.0.E.P) JSPM’s Imperial College of Engineering
(Teaching M. Tech & Ph.D Students) and Research, Wagholi, Pune
& Tecu-Neo —
PUBLICATIONS
Cee eee il
A Sachin Shah VenturePreface
Dear students,
We are extremely happy to come out with this edition of
“Applied Data Science” for you. We have divided the subject into
small chapters so that the topics can be arranged and understood
properly. The topics within the chapters have been arranged in a
proper sequence to ensure smooth flow of the subject.
We are thankful to Shri. Sachin Shah for the
encouragement and support that they have extended. We are also
thankful to the staff members of “Tech-Neo Publications” and
others for their efforts to make this book as good as it is. We have
made every possible efforts to eliminate all the errors in this book.
However if you find any, please let us know, because that will help
us to improve further.
We are also thankful to family members and friends for
patience and encouragement.
nlSyllabus...
University of Mumbai
Applied Data Science
(Course Code : CSDC8013) (Department Optional Course -5)
Prerequisite : Machine Learning, Data Structures & Algorithms
Course Objectives
To introduce students to the basic concepts of data science.
To acquire an in-depth understanding of data exploration and data visualization.
To be familiar with various anomaly detection techniques.
To understand the data science techniques for different applications.
Course Outcomes
After successful completion of the course students will be able to :
To gain fundamental knowledge of the data science process.
To apply data exploration and visualization techniques.
To apply anomaly detection techniques.
To gain an in-depth understanding of time-series forecasting.
Apply different methodologies and evaluation strategies.
To apply data science techniques to real world applications..
Introduction to Data Science
Introduction to Data Science, Data Science Process
‘Motivation to use Data Science Techniques: Volume, Dimensions and
Complexity, Data Science Tasks and Examples
Overview of Data Preparation, Modeling, Difference between data
science and data analytics, (Refer Chapter 1)Data Exploration
Types of data, Properties of data
Descriptive Statistics : Univariate Exploration: Measure of Central
Tendency, Measure of Spread, Symmetry, Skewness: Karl Pearson
Coefficient of skewness, Bowley's Coefficient, Kurtosis Multivariate
Exploration: Central Data Point, Correlation, Different forms of
correlation, Karl Pearson Correlation Coefficient for bivariate
distribution.
Inferential Statistics : Overview of Various forms of distributions:
Normal, Poisson, Test Hypothesis, Central limit theorem, Confidence
Interval, Z-test, t-test, Type-I, Type-II Errors, ANOVA.
(Refer Chapter 2)
ee a
Methodology and Data Visualization
Methodology : Overview of model building, Cross Validation, K-fold
cross validation, leave-1 out, Bootstrapping
Data Visualization
Univariate Visualization: Histogram, Quartile, Distribution Chart
Multivariate Visualization: Scatter Plot, Scatter Matrix, Bubble chart,
Density Chart Roadmap for Data Exploration.
33. | Self-Learning Topics : Visualizing high dimensional data: Parallel chart,
Deviation chart, Andrews Curves. (Refer Chapter 3)
4 Anomaly Detection 6
4.1 | Outliers, Causes of Outliers, Anomaly detection techniques, Outlier
Detection using Statistics
4.2 | Outlier Detection using Distance based method, Outlier detection using
density-based methods, SMOTE. (Refer Chapter 4)
Time Series Forecasting
Taxonomy of Time Series Forecasting methods, Time Series
Decomposition
Smoothening Methods: Average method, Moving Average smoothing,
Time series analysis using linear regression, ARIMA Model,
Performance Evaluation: Mean Absolute Error, Root Mean Square Error,
Mean Absolute Percentage Error, Mean Absolute Scaled ErrorContents
Self-Learning Topics : Evaluation parameters for Classification,
regression and clustering. (Refer Chapter 5)
Applications of Data Science
Predictive Modeling : House price prediction, Fraud Detection
Clustering: Customer Segmentation.
Time series forecasting : Weather Forecasting
Recommendation engines : Product recommendation.
(Refer Chapter 6)
Assessment
Internal Assessment
Assessment consists of two class tests of 20 marks each. The first-class testis to be conducted when
approx. 40% syllabus is completed and second class test when additional40% syllabus is completed.
Duration of each test shall be one hour.
End Semester Theory Examination:
Question paper will comprise a total of six questions.
Alll questions carry equal marks.
Questions will be mixed in nature (for example supposed Q.? has part (a) from module 3 thea
part (b) will be from any module other than module 3).
Only Four questions need to be solved.eas
Introduction to Data Science 1-1 to 1-25
Data Exploration 2-1 to 2-112
Methodology and Data 3-1 to 3-55
Visualization
Anomaly Detection 4-1 to 4-21
5-1 to 5-25
APPLIED DATA SCIENCE LAB
Please Download LAB PRACTICALS
from Tech-Neo Website
e www.techneobooks.in e1
12
13
14
15
1.6
Introduction to
CHAPTER 1 Data Science
Introduction to Data Science, Data Science Process
Motivation to use Data Science Techniques: Volume, Dimensions and Complexity,
Data Science Tasks and Examples.
Overview of Data Preparation, Modeling, Difference between data science and
data analytics.
Introduction to Data Science and Big Data.....
1.1.1. Introduction to Data Science.
GQ. Explain data science in brief. (4 Marks).
1.1.2 Introduction to Big Data
GQ. —_ Write a short note on Big Data? (4 Marks).
Defining Data Science and Big Dat
GQ. _Define the term data science. (2 Marks)
GQ. Define Big Data. (2 Marks).
‘The Requisite Skill Set in Data Science ....
GQ. Explain requisite skill set in data science. (4 Marks)...
5V's of Big Data.
Data Science Life Cycle...
Data : Data Types, Data Collection17
1.8
19
1.10
Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-2)
1.6.1 Methods of Collecting Primary Data
Data Analytic Life Cycle : Overview.
1.7.1 Phase 1 - Discovery Phase...
1.7.2 Phase 2 - Data Preparation.
1.7.3 Phase 3 - Model Planning
1.7.4 Phase 4 - Model Building..
1.7.5 Phase 5 - Communicate Results ..
1.7.6 Phase 6 - Operationalize ..
Modeling.
1.8.1 Purpose of Data Modeling
1.8.2 _ Different types of Data Models
Difference between data science and data analytics.
Case Study - GINA : Global Innovation Network and Analysis.
GQ. Write a short note on Case of GINA. (8 Marks).
1.10.1 Phase 1 - Discovery
1.10.2 Phase 2 - Data Preparation
1.10.3 Phase 3 - Model Planning ..
> Chapter Ends
(MU - New Syllabus w.ef academic year 22-23)(M8-79)
& Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-3)
eS ee
> 1.1 INTRODUCTION TO DATA SCIENCE AND BIG DATA
%& 1.1.1 Introduction to Data Science
ts
* Data science is a process in which it is examined that from where the information can be taken,
what it signifies and how it can be converted into a useful resource in the creation of business and
IT strategies,
© Mining huge quantity of structured and unstructured data to recognize pattems can help out an
organization to reduce costs, raise efficiencies, identifies new market opportunities and enhances
the organization's competitive benefit.
* The data science field manipulates the mathematics, statistics and computer science regulations,
and includes methods like machine learning, cluster analysis, data mining and visualization.
Data Scientific
engineering method
Domain /
expertise \ Math
—~, —_—
Hacker) a
mindset / \ Suatiatice:
‘Advanced
Visualization) | computing
Fig. 1.
5 Data scientists
© As we know that the amount of data generation can be increased by the typical modem businesses.
Because of this the importance of data scientists can be increased.
‘©The task of data scientists is to convert the organization's raw data into the useful information.
‘© Data extraction is a method of retrieving particular data from unstructured or badly structured data
sources for advance processing and analysis.
(MU - New Syllabus wef academic year 22-23)(M8-79) Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (introduction to Data Science)
Page No. (1-4)
Data scientists must acquire a mixture of analytic, machine learning, data mining and statistical
skills, as well as familiarity with algorithms and coding.
Another task for data scientists along with managing and understanding large amounts of data is to
create data visualization models that facilitates demonstrating the business value of digital
information.
Data scientists must acquire an emotional intelligence in addition to education and experience in
data analytics to make it effective.
With the help of Smartphone's, Internet of Things (IoT) devices, social media, internet searches
and behavior, the data scientists can illustrates the digital information very easily because they are
studying them on regular basis.
Definition : The data mining is the process of identifying the patterns to solve the problems by
data scientists when such a large data sets are sorted with the help of data analysis.
Data science and machine learning
Machine learning is often integrated in data science. Machine learning is an Artificial Intelligence
(AD) tool that basically automates the data-processing piece of data science.
Machine learning includes advanced algorithms that can be self learned and can process huge
amounts of data within a fraction of time.
‘After gathering and processing the structured data from the machine learning tools, data scientists
takes data, transform it and summarize the data so it becomes useful for the company's decision-
makers.
Example : Examples of Machine learning applications in the data science field are image
recognition and speech recognition, self-driving vehicles etc.
%_1.1.2 Introduction to Big Data
‘RQ. What is Big data ? Explain characteristics of big data.
1.6Q, Write short note on Bi
Data?
Now a day the amount of data created by various advanced technologies like Social networking
sites, E-commerce etc. is very large. It is really difficult to store such huge data by using the
traditional data storage facilities.
Until 2003, the size of data produced was 5 billion gigabytes. If this data is stored in the form of
disks it may fill an entire football field. In 2011, the same amount of data was created in every two
days and in 2013 it was created in every ten minutes. This is really tremendous rate.
In this topic, we will discuss about big data on a fundamental level and define common concepts
related to big data. We will also see in deep about some of the processes and technologies currently
being used in this field.
(MU - New Syllabus w.e. academic year 22-23)(M8-79) [al rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-5)
«= Big Data
1 Definition : Big data means huge amount of data, itis a collection of large datasets that cannot be
Processed using traditional computing techniques. Big Data is complex and difficult to store,
maintain or access in regular file system. Big Data becomes a complete subject, which involves
different techniques, tools, and frameworks
5 Sources of big data
‘There are various sources of big data. Now a days in number of fields such huge data get created,
Following are the some of fields :
Sources of big data
1. Stock Exchange |
2. Social Media Data
3. Video sharing portals
il
4, Search Engine Data
5. Transport Data
6. Banking Data
Fig. 1.1.2 : Sourees of big data
> (@) Stock Exchange : The data in the share market regarding information about prices and status
details of shares of thousands of companies is very huge.
> (2) Social Media Data : The data of social networking sites contains information about all the
account holders, their posts, chat history, advertisements etc. On topmost sites like facebook
and whatsapp, there are literally billions of users.
> (3) Video sharing portals : Video sharing portals like youtube, Vimeo etc. contains millions of
Videos each of which requires lots of memory to store.
> (Search Engine Data : The search engines like Google and Yahoo holds lot much of
metadata regarding various sites.
> (8) Transport Data : Transport data contains information about model, capacity, distance and
availabilty of various vehicles.
» (6) Banking Data : The big giants in banking domain like SBI or ICICI hold large amount of
data regarding huge transactions of account holders.
(MU - New Syllabus w.ef academic year 22-23)(M8-79) & Tech-Neo PublicationsPage No. (1-6)
Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Scienc
=F Categories of Data
The data can be categorized in three types :
Categories of data
1. Structured data
2. Semi-structred data
3, Unstructured data
Fig. 1.1.3 : Categories of data
> (4) Structured Data
This type of data is stored in relations (tables) in Relational Database Management System.
> (2) Semi-structured Data
This type of data is neither raw data nor typed data in a conventional database system. A lot of data
found on the web can be described as semi-structured data. This type of data does not have any
standard formal model. This data is stored using various formats like XML and JSON.
> (3) Unstructured Data
This data do not have any pre-defined data model. The data of video, audio, Image, text, web logs,
system logs etc. comes under this category.
1 Important issues regarding data in traditional file
In general there are some important issues regarding data in traditional file storage system.
Important issues regarding data
In traditional file
[rine]
[Econo]
Fig. 1.1.4: Important issues regarding data in regarding data in traditional file
& Tech-Neo Publications
(MU - New Syllabus w.e.f academic year 22-23)(MB-79)Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-7)
SSS
> (1) Volume
Now a day the volume of data regarding different fields is high and potentially increasing day by
day. Organizations collect data from a variety of sources, including business transactions, social
media and information etc.
> (2) Velocity
The configuration of system with single processor, limited RAM and limited storage capacity
cannot store and manage high volume of data.
>» (3) Variety
The form of data from different sources is different.
> (@) Variability
The flow of data coming from sources like social media is inconsistent because of daily emerging
new trends. It can show sudden increase in size of data which is difficult to manage.
> (5) Complexity
As the data is coming from various sources, it is difficult to link, match and transform such data
across systems. It is necessary to connect and correlate relationships, hierarchies and multiple data
linkages of the data.
All these issues are solved by the new advanced Big Data Technology.
> 1.2 DEFINING DATA SCIENCE AND BIG DATA
1 6Q_ Define the term data science.
5 Defining Data science
Q Definition : Data science is a field of Big Data which searches for providing meaningful
information from huge amounts of complex data.Data science is a system used for retrieving the
information in different forms, either in structured or unstructured.
Data Science unites different fields of work in statistics and computation in order to understand the
data for the purpose of decision making.
Defining Big Data
oO
Definition : Big Data is described as volumes of data available in changing level of complexity,
Produced at different velocities and changing level of ambiguity, that cannot be processed using
conventional technologies, processing methods, algorithms, or any commercial off-the-shelf
solutions,
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) [Bl rech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-8)
‘© Data that can be defined as Big Data comes from variety of fields such as machine-generated data
from sensor networks, nuclear plants, airplane engines, and consumer-driven data from social
media.
The producers of the Big Data that resides within organizations include legal, sales, marketing,
Procurement, finance, and human resources departments.
De 1.3 THE REQUISITE SKILL SET IN DATA SCIENCE
Data science is a combination of skills consisting of three most important areas.
They are explained in brief in Fig. 1.3.1.
Fig. 1.3.1 : Requisite skill set in data science
(1) Mathematics Expertise
«The most important thing required while constructing the data products and data mining
insights is the capability 10 view the data via a quantitative way. There are texture,
measurement, and relationship in data that can be illustrated mathematically.
‘© Solutions to numerous business problems occupies building analytic models grounded in the
hard math, where being able to recognize the underlying mechanics of those models is key to
success in building them.
© Also, a misunderstanding is that data science contains all about statistics. While statistics is
important, itis not the only type of math utilized in the data science.
© There are two branches of statistics as given in Fig. 1.3.2.
(MU - New Syllabus w.ef academic year 22-23)(MB-79) eb reci-neo PublicationsFig. 13.2 : Branches of statistics
* Having knowledge of both classical and Bayesian statistics is helpful but when the majority of
peoples refer to stats they are normally preferring to classical statistics.
(2) Technology and Hacking
* Here the term “hacking” is related to the innovation not to the tempering any confidential data.
* We are going to refer hacking as a programmer's creativity and the cleverness to solve the
problems that arise while building the things.
* The hacking is important because data scientists make use of technology in order to handle
huge data sets and work with composite algorithms, and it needs tools far more difficult than
Excel,
* Data scientists have to know the fundamentals of programming language to find out the quick
solutions for complex data as well as to integrate that data.
¢ But having only fundamental knowledge is not sufficient for data sciemtists because data
science hackers are very creative and they can find a way with the help of technical challenges
to work their code in desired manner.
* Algorithmic thinking of data science hacker is very high, so that it can have the ability to
break down confused problems and recompose them in ways that are solvable.
(3) Strong Business Acumen
* For the data scientists, it is necessary to behave like a tactical business consultant. As the
data scientists working are very close to the data so they can works like no one can do it.
* This will make a responsibility to transform observations to shared knowledge, and contribute
to strategy on how to solve core business problems.
* This means a core ability of data science is using data to clearly inform a story. No data~
puking — rather, present a unified description of problem and solution, with the help of data
insights as supporting pillars, that lead to guidance.
Dy 1.4 5 V’S OF BIG DATA — sabe noha aie
© Big datais a collection of data from many different sources and is often describe by five
characteristics: volume, value, variety, velocity, and veracity.
(MU - New Syllabus we.f academic year 22-23)(M8-79) Tabreci-neo PublicationsApplied Data Science (MU-Sem
VOLUME
Huge amount
VERACITY
Inconsistencies
and uncertainty
in data
VARIETY
various
sources
VELOCITY
High speed of VALUE
accumulation Extract useful
ofdata data
Fig. 1.4.1
Volume : The size and amounts of big data that companies manage and analyze. The name Big,
Data itself is related to a size which is enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a particular data can actually be considered as a Big
Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which
needs to be considered while dealing with Big Data solutions.
Variety : The diversity and range of different data types, including unstructured data, semi-
structured data and raw data. Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured.
Earlier days, most of the application was using database as a spreadsheet (Structured data). Now a
day's data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.(unstructured
data) are also being considered in the analysis applications.
‘Value : The most important “V" from the perspective of the business, the value of big data usually
comes from insight discovery and pattern recognition that lead to more effective operations,
stronger customer relationships and other clear and quantifiable business benefits. This refers to the
value that big data can provide, and it relates directly to what organizations can do with that
collected data. Being able to pull value from big data is a requirement, as the value of big data
increases significantly depending on the insights that can be gained from them.
Velocity : The speed at which companies receive, store and manage data ~ eg., the specific
‘umber of social media posts or search queries received within a day, hour or other unit of time
Veracity : The “wut” or accuracy of data and information assets, which often determines
executive-level confidence.
(MU - New Syllabus wef academic year 22-23)(MB-79) [ed rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-11)
cee ee
bbl 1.5 DATA SCIENCE LIFE CYCLE
Fig. 1.5.1 : Data Science Lifecycle
> (4) Business Understanding
* The complete cycle revolves around the enterprise goal. What will you resolve if you do no longer
have a specific problem? It is extraordinarily essential to apprehend the commercial enterprise goal
sincerely due to the fact that will be your ultimate aim of the analysis.
* After desirable perception only we can set the precise aim of evaluation that is in sync with the
enterprise objective. You need to understand if the customer desires to minimize savings loss, or if
they prefer to predict the rate of a commodity, etc.
> (2) Data Understanding
* After enterprise understanding, the subsequent step is data understanding. This includes a series of
all the reachable data. Here you need to intently work with the commercial enterprise group as they
are certainly conscious of what information is present, what facts should be used for this
commercial enterprise problem, and different information.
* This step includes describing the data, their structure, their relevance, their records type. Explore
the information using graphical plots. Basically, extracting any data that you can get about the
information through simply exploring the data.
> (3) Preparation of Data
* Next comes the data preparation stage. This consists of steps like choosing the applicable data,
integrating the data by means of merging the data sets, cleaning it, treating the lacking values
through either eliminating them or imputing them, treating inaccurate data through eliminating
them, additionally test for outliers the use of box plots and cope with them.
* Constructing new data, derive new elements from present ones. Format the data into the preferred
Structure, eliminate undesirable columns and features,
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) Tbrech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-12)
ae ee
‘© Data preparation is the most time-consuming but arguably the most essential step in the complete
existence cycle. Your model will be as accurate as your data,
> (4) Exploratory Data Analysis
¢ This step includes getting some concept about the answer and elements affecting it, earlier than
constructing the real model.
* Distribution of data inside distinctive variables of a character is explored graphically the usage of
bar-graphs, Relations between distinct aspects are captured via graphical representations like
scatter plots and warmth maps.
Many data visualization strategies are considerably used to discover each and every characteristic
individualiy and by means of combining them with different features.
> (5) Data Modeling
* Data modeling is the coronary heart of data analysis. A model takes the organized data as input and
gives the preferred output.
* This step consists of selecting the suitable kind of model, whether the problem is a classification
problem, or a regression problem or a clustering problem. After deciding on the model family,
amongst the number of algorithms amongst that family, we need to cautiously pick out the
algorithms to put into effect and enforce them.
© We need to tune the hyper parameters of every model to obtain the preferred performance. We
additionally need to make positive there is the right stability between overall performance and
generalizability. We do no longer desire the model to study the data and operate poorly on new
data.
> © Model Evaluation
© Here the model is evaluated for checking if it is geared up to be deployed. The model is examined
on an unseen data, evaluated on a cautiously thought out set of assessment metrics. We additionally
‘need to make positive that the model conforms to reality.
© If we do not acquire a quality end result in the evalu: ; n
‘modelling procedure until the preferred stage of metrics is achieved. Any data science solution, @
machine learning model, simply like a human, must evolve, must be capable to enhance itself with
new data, adapt to a new evaluation metric.
© We can construct more than one model for a certain phenomenon, bi
additionally be imperfect. The model assessment helps us select and construct an ideal model.
¥ (1) Model Deployment
© The model after a rigorous assessment is at the end deployed in the preferred structure and channel.
This isthe last step in the data science life cycle,
tation, we have to re-iterate the complete
jowever, a lot of them may
(MU - New Syllabus w.ef academic year 22-23)(MB-79) [el rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-13)
SSS TO
Each step in the data science life cycle defined above must be laboured upon carefully. If any step
is performed improperly, and hence, have an effect on the subsequent step and the complete effort
B0es to waste. For example, if data is no longer accumulated properly, you'll lose records and you
will no longer be constructing an ideal model.
If information is not cleaned properly, the model will no longer work. If the model is not evaluated
Properly, it will fail in the actual world. Right from Business perception to model deployment,
every step has to be given appropriate attention, time, and effort.
> 1.6 DATA: DATA TYPES, DATA COLLECTION
Data collection is the process of acquiring, collecting, extracting, and storing the voluminous
amount of data which may be in the structured or unstructured form like text, video, audio, XML
files, records, or other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to analyze the
patterns or useful infotmation in data. The data which is to be analyzed must be collected from
different valid sources.
Fig. 16.1
The data which is collected is known as raw data which is not useful now but on cleaning the
impure and utilizing that data for further analysis forms information, the information obtained is
known as “knowledge”. Knowledge has many meanings like business knowledge or sales of
enterprise products, disease treatment, etc. The main goal of data collection is to collect
information-rich data.
Data collection starts with asking some questions such as what type of data is to be collected and
What is the source of collection, ;
Most of the data collected are of two types known as “qualitative data which is a group of non-
numerical data such as words, sentences mostly focus on behavior and actions of the group and
another one is “quantitative data” which is in numerical forms and can be calculated using different
scientific tools and sampling data.
(MU - New Syllabus w.ef academic year 22-23)(M8-79) Tal rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-14)
The actual data is then further divided mainly into two types known as:
ean
Fig. 1.6.2
> (A) Primary data
* The data which is Raw, original, and extracted diretly from the official sources is known as
Primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys.
¢ The data collected must be according to the demand and requirements of the target audience
on which analysis is performed otherwise it would be a burden in the data processing.
2 1.6.1 Methods of Collecting Primary Data
(1) Interview method
© The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee.
«Some basic business or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing. These can be both structured and
vinstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
(2) Survey method
© The survey method is the process of research where a list of relevant que
answers are noted down in the form of text, audio, or video.
«The survey method can be obtained in both online and offline mode like through website
forms and email.’Then that survey answers are stored for analyzing data. Examples are online
surveys or surveys through social media polls.
(3) Observation method
The observation method is a method of data collection in which the researcher keenly
observes the behavior and practices of the target audience using some data collecting (ool and
stores the observed data in the form of text, audio, video, or any raw formats.
stions are asked and
(MU - New Syllabus w.ef academic year 22-23)(MB-79) Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Si
1ce)...Page No. (1-15)
* In this method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behavior towards the products. The data
obtained will be sent for processing.
(4) Experimental method
»
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD, LSD,
FD.
@ CRD : Completely Randomized design is a simple experimental design used in data analytics
which is based on randomization and replication. It is mostly used for comparing the
experiments.
(ii) RBD : Randomized Block Design is an experimental design in which the experiment is
divided into small units called blocks. Random experiments are performed on each of the
blocks and results are drawn using a technique known as analysis of variance (ANOVA). RBD
was originated from the agriculture sector.
(iii) LSD : Latin Square Design is an experimental design that is similar to CRD and RBD blocks
but contains rows and columns. It is an arrangement of NxN squares with an equal amount of.
rows and columns which contain letters that occurs only once in a row. Hence the differences
can be easily found with fewer errors in the experiment. Sudoku puzzle is an example of a
Latin square design,
(iv) FD : Factorial design is an experimental design where each experiment has two factors each
with possible values and on performing trail other combinational factors are derived.
(B) Secondary data
Secondary data is the data which has already been collected and reused again for some valid
Purpose. This type of data is previously recorded from primary data and it has two types of sources
‘named internal source and external source.
a
(2)
Internal source
* These types of data can easily be found within the organization such as market record, a sales
Tecord, transactions, customer data, accounting resources, etc, The cost and time consumption
is less in obtaining internal sources.
External source
© The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data.
© ~The cost and time consumption is more because this contains a huge amount of data.
Examples of external sources are Government publications, news publications, Registrar
General of India, planning commission, international labor bureau, syndicate services, and
other non-governmental publications.
(MU - New Syllabus w.e academic year 22-23)(M8-79) Ta rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-16)
eee
(3) Other sources
(Sensors data : With the advancement of 1oT devices, the sensors of these devices collect data
which can be used for sensor data analytics to track the performance and usage of products,
(ii) Satellites data : Satellites collect a lot of images and data in terabytes on daily b:
surveillance cameras which can be used to collect useful information
through
(iii) Web traffic : Due to fast and cheap internet facilities many formats of data which is uploaded
by users on different platforms can be predicted and collected with their permission for data
analysis. The search engines also provide their data through keywords and queries searched
mostly
hv
bb 1.7 DATA ANALYTIC LIFE CYCLE : OVERVIEW
{RQ Explain different phases of data analytics life cycle. [ Ref: - 0:'1(b), Aug. 18,.6.Marks [i
{RQ Explain Data Analytic Life cycle, COO |
' RQ. Draw Data Analytics Lifecycle & give brief description about all phases. ;
' Ref. -. 1(b), May 19, 5 Marks
© At this level we need to know more deep knowledge of specific roles and responsibilities of the
data scientist.
© The data scientist lifecycle is illustrated in Fig. 1.7.1 which gives the high-level overview of the
data scientist discovery and analysis process.
© It depicts the iterative behaviour of work performed by the data scientist's with several stages being
repetitive in order to make sure that the data scientist is utilizing the “right” analytic model to
locate the “right” insights.
Fig. 1.7.1; Data Scientist Lifecycle
(MU - New Syllabus w.ef academic year 22-23)(MB-79) al Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data
lence)
%® 1.7.1 Phase 1 - Discovery Phase
The following activities of data scientists can be focused by the Discovery :
Acquisition of a complete understanding of the business process and the business domain. This
consists of recognizing the key metrics and KPIs against which the business users will measure
success.
Recognizing the most vital business questions and business decisions that the business users are
attempting to answer in support of the targeted business process. This also should contain the
‘occurrence and optimal timeliness of those answers and decisions.
Evaluating available resources and going through the process of framing the business problem as
an analytic hypothesis. At this stage data scientist constructs the initial analytics development plan
that will be used to direct and document the resulting analytic models and insights.
It should be noticed that understanding into which production or operational environments the
analytic insights requires to.be published is somewhat that should be recognized in the analytics
development plan
Such information will be essential as the data scientist recognizes in the plan where to
“operationalize” the analytic insights and models.
This is a best opportunity for tight association with the BI analyst who likely has already defined
the metrics and processes required to support the business proposal.
Requirements and the decision making environment of the business users can be well understand
by the BI analyst to starts the data scientist's analytics development plan.
1.7.2 Phase 2 - Data Preparation
The following activities of data scientists can be focused by the data preparation :
Provisioning an analytic workspace, or an analytic sandbox, where the data scientist can work free
of the constraints of a production data warehouse environment. Preferably, the analytic
environment is set up such that the data scientist can self-provision as much data space and analytic
horsepower as required and can fine-tune those requirements throughout the analysis process.
Obtaining, cleaning, aligning, and examining the data. This contains use of data visualization
techniques and tools to get an understanding of the data, recognizing outliers in the data and
calculating the gaps in the data to decide the overall data quality; determine if the data is “good
enough.”
Transforming and enhancing the data, The data scientist will look to use analytic techniques, such
as logarithmic and wavelet transformations, to sort out the potential skewing in the data. The data
scientist will also look to use data enhancement techniques to create new composite metrics such as
frequency, recentness, and order, The data scientist will make use of standard tools like SQL and
Java, as well as both commercial and open source extract, transform, load (ETL) tools to transform
the data,
(MU = New Syllabus w.es academic year 22-23)(M8-79) & Tech-Neo Publications(Introduction to Data Sclence)...Page No. (1-18)
Applied Data Science (MU
+ After this stage ix completed, the data scientist wants to feel comfortable enough with the quality
and prosperity of the data to move ahead to the next stage of the analytics development process.
%_ 1.7.3 Phase 3 - Model Planning
The following activities of data scientists can be focused by the model planning :
* Determining the numerous analytical models, methods, techniques and workflows to discover as
part of the analytic model development. The data scientists knows in advance that which of the
analytic models and methods are suitable but it is good thing to plan to check at least one to make
sure that the opportunity to build a more predictive model is not missed.
Determine association and co-linearity between variables in order to select key variables to be used
in the model development. The data scientist desires to estimate the cause-and-effect variables as
early as possible. Keep in mind, association does not provides assurance causation, so care must be
taken in choosing variables that can be calculated while going forward.
7 1.7.4 Phase 4 - Model Building
The following activities of data scientists can be focused by the model building :
Manipulating the data sets for testing, training, and production. Whatever new transformation
techniques are developed can be tested to observe if the quality, reliability, and predictive
capabilities of the data can be enhanced or not.
Calculating the feasibility and reliability of data to use in the predictive models. Decision calls are
depends on quality and reliability of the data to check; is the data “good enough" to be used in
developing the analytic models.
‘At the end, developing, testing, and filtering the analytic models is done. Testing is carrying out to
notice which variables and analytic models deliver the maximum quality, most predictive and
actionable analytic insights.
‘The model building stage is highly iterative step where manipulation of the data, calculating the
reliability of the data, and determining the quality and predictive powers of the analytic model will
be modified number of times.
© In this stage the data scientist may be unsuccessful many time:
modelling techniques before resolved into the “right” one.
1.7.5 Phase 5 - Communicate Results
scientists can be focused by the communicate resul
ytic model and statistical implication, ability of
jalytic insights. The data scientist wants to make
nd accomplishes the required analytic goal
sin testing different variables and
‘The following activities of data
© Determining the quality and reliability of the anal
measuring and taking the action of the resulting an:
sure that the analytic process and model was successful at
of the project.
(MU - New Syllabus wie academic year 22-23)(M8-79) lel rech-neo PublicationsApplied Data Science (MU-Sem &-Cornp) (Introduction to Data Science)...Page No. (1-19)
Se ————————————————————— ———————
To communicate with the insights of analytic model, results and the suggestions requires the use of
graphics and charts. It is significant that the business stakeholders such as business users, business
analysts, and the BI analysts should realize and obtain the resulting analytic insights.
The BI analysts are partner in this stage of the data science lifecycle. The BI analysts have the
strong understanding of what to present to their business users and how to present it.
%& 1.7.6 Phase 6 - Operationalize
The following activities of data scientists can be focused by the operationalize :
Providing the final suggestions, reports, meetings, code, and technical documents.
Optionally, running a pilot or analytic lab to validate the business case, and the financial return on
investment (ROI) and the analytic lift.
Carrying out the analytic models in the production and operational environments. This engross
working with the application and production teams to decide how best to surface the analytic
results and insights.
Combining the analytic scores into management dashboards and operational reporting systems, like
sales systems, procurement systems, and financial systems ete.
The operationalization stage is another area where association between the data scientist and the BI
analysts should be very useful.
Numerous BI analysts have the experience of combining reports and dashboards into the
operational systems, as well as establishing centers of excellence to spread analytic learning and
skills across the organization.
bi 1.8 MODELING
Data modeling is the process of creating a simplified diagram of a software system, and the data
elements it contains.
It uses text and symbols to represent the data and how it flows.
Data models provide a blueprint for designing a new database.
Thus, data modeling helps an organisation to use its data effectively to meet business needs for
information.
Actually, a data ~ model is a flowchart that illustrates data entities and the relationships between
entities,
Tt enables data management to document data requirements for applications in development plans.
It also helps to identify errors before any code is written.
(MU - New Syllabus w.e academic year 22-23)(MB8-79)
Tech-Neo Publications2A 1.8.1 Purpose of Data Modeling
© Data modeling is a cove data management discipline. It provides visual representation of data sets
and their business content.\
‘© Ithelps to locate information that is needed for different business processes.
© It mentions the characteristics of the data elements and then these elements are included in the
datasets and then are processed.
© Data modeling plays an important role in data architecture processes.
‘© It maps how data moves through IT systems and create a conceptual data management framework.
* Earlier, data models were built by data architects and other data management professionals.
© They used to take input from business analysts, executives and users.
© But, nowadays data-modeling is an important skill for data scientists and analysts.
© They develop “business intelligence applications’ and more complex ‘data science and advanced
analytics’
Y\ 1.8.2 Different types of Data Models
© Data models use three types of models to separately represent business concepts and workflows
and technical structures for managing the data.
The models are created in progression since organisations plan new applications and databases.
The different types of data models are as follows
> (1) Conceptual data model
«This is a high-level visualisation of the analytics processes that a system will suppor.
¢ _Itgives the kinds of data that are required.
© Itshows how different business entities interrelate.
© These conceptual data-models helps to them to see how
meets business needs.
© These conceptual models
¥ (2) Logical data-model
Logical data-models show how data entities are related and describe the data from a technical
perspective.
© They define data structures and then they provide detail
other important characteristics,
© The technical side of an organisation uses logical models to help understand required
application and database designs. Again they are not related to a particular technology
platform.
a system will work and ensure that it
are not connected to specific database or application technologies.
Is on attributes, data types, keys and
(MU - New Syllabus w.ef academic year 22-23)(M8-79) la rech.neo Publications
ooApplied Data Science (MU-Sem 8-Comp) (introduction to Data Science)...Page No. (1-21)
> (3) Physical data - model
* A logical model acts as the basis for the creation of physical model. Physical models are
specific to the application software that will be implemented
¢ — They define the structures that the database or a file system will use to store and manage the
data.
It includes tables, columns fields, indexes, constraints, triggers and other DBM’s elements,
© Database designers use physical data models to create designs and generate schema for
databases.
>| 1.9 DIFFERENCE BETWEEN DATA SCIENCE AND DATA ANALYTICS
= Data science
© Data science deals with extracting meaningful information and insights by applying various
algorithms, processes, scientific methods from structured and unstructured data.
* This field of data science is related to big data and is one of the most required skills at present.
* Data science consists of mathematics, computations, statistics, programming etc. to gain important
and relevant insights from the large amount of data, that is provided in various formats.
Data Analytics
* Data analytics gets conclusions by processing the raw data.
+ Ithelps the company to make decisions based upon the conclusions from the data.
It converts a large number of figures in the form of data into simple English, and these conclusions
are further helpful in making the required decisions.
* We mention below the table between Data Science and Data Analytics.
‘Table 1.9.1
Feature Data science Data analytics _
Coding Python is the commonly used language for | The knowledge of python and R
language data science along with the use of other | Language is essential for data
languages such as C++, Java, etc. analytics.
Programming | In depth knowledge of programming is | Basic programming skills is necessary
skills required for data science. for data analytics.
Use of machine | Data science makes use of machine | Data analytics does not make use of
learning earning algorithms to get insights. machine learning.
Other skills Data science makes use of data mining | Hadoop based analysis is used for
activities for getting meaningful insights | getting conclusions from raw data
Scope ‘The scope of data science is very large ‘The scope of data analytics is very
small, i.e, micro,
L
Goals Data science deals with explorations and | Data analytics makes use of existing
new innovations. resources.
(MU - New Syllabus w.e.f academic year 22-23)(M8-79)
Dabrech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (introduction to Data Science)...Page No. (1-22)
py 1.10 CASE STUDY - GINA : GLOBAL INNOVATION NETWORK AND
ANALYSIS
i ‘re. Write a case study on Global Innovation Network & Analysis (GINA).
' GONE '
' 6. _Writea short note on Case of GINA (6 Marks) + :
* EMC’ GINA (Global Innovation Network and Analytics) team is a group of senior technologists
placed in centers of excellence (COES) all over the world.
* The main goal of team is to connect employees all over the world to drive innovation, research as
well as university partnerships.
* The basic consideration of GINA team was that its approach would offer an interface to share ideas
globally and enhance sharing of knowledge between GINA members who are not at one place
geographically.
* A data repository has been created to store both structured and unstructured data to achieve three
important goals :
(1) Store formal as well as informal data.
(2). Keep track of research from technologists all over the world.
(3) To enhance the operations and strategy, extract data for patterns and insights.
The case study of GINA illustrates an example of the way by which a team applied the Data
Analytics Lifecycle for the purpose of analyzing innovation data at EMC.
Innovation is generally considered as a hard concept to measure, and this team is going to use
advanced analytical methods so as to identify key innovators within the company.
YA 1.10.1 Phase 1 - Discovery
In this phase, identification of data sources is started by the team.
Even though GINA has technologists which are skilled in several different aspects of engineering,
it had few data and ideas regarding what it needs to explore but do not have a formal team which
could perform these analytics.
+ They consults with various experts and decided to outsource the work to the volunteers within
EMC.
list of roles is as follows on the working team which were fulfilled :
(User of Business, Sponsor of Project, Manager of Project : Vice President
(W) Business Intelligence Analyst + Representatives from IT Field
(it) DBA (Data Engineer and Database Administrator) : Representatives from IT
(iv) Data Sclentist : Distinguished Engineer who are able to develop social graphs.
(MU = New Syllabus w.ef academic year 22-23)(MB-79) Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-23)
* The approach of project sponsor is to influence social media and blogging for the purpose of
accelerating the set of innovation as well as research data across the world and to inspire teams of
data scientists who can work as “volunteer” globally.
«© — The data scienti
hould show passion about data, and the project sponsor should have ability to
tap into this passion of greatly talented people to achieve challenging work in a creative way.
‘The data regarding the project is divided into two important categories. The first category regards
with the idea submissions of near about five years from EMC's internal innovation contests, called
as the Innovation Roadmap or Innovation Showcase.
The Innovation Roadmap is nothing but an organic innovation process in which ideas are submitted
by employees globally which are then judged.
+ For further incubation, rest out of these ideas are selected.
Consequently the data is combination of structured data, like idea counts, submission dates,
inventor names, and unstructured content, like the textual descriptions regarding the ideas
themselves.
The second category of data consists of encompassed minutes as well as notes which represents
innovation and research activity globally
Additionally it represents combination of structured and unstructured data. The structured data
consists of attributes like dates, names as well as geographic locations.
In the unstructured documents data is regarding “who, what, when, and where” which represents
rich data regarding knowledge growth and transfer inside the company.
There are 10 important IHs which are developed by GINA team :
(1) THI : It is possible to map innovation activity in dissimilar geographic locations to corporate
strategic directions.
(2) IH2: The delivery time of ideas minimizes by the transfer of global knowledge as part of the
idea delivery process.
(3) IH3 : Innovators participating in global knowledge are able to deliver ideas fast as compared
to those who do not.
(4) IBG4 : It is possible to analyze and evaluate an idea submission for the likelihood of receiving
funding.
(5) IHS : Knowledge invention and increase for a specific topic can be measured as well as
compared across geographic locations.
(6) IHG : Research-specific boundary can be identified by the knowledge transfer activity
spanners in different regions.
(1) THT = Itis possible to map strategic corporate themes to geographic locations.
(8) IH8 : Continuous knowledge growth and transfer events minimize the time required to create
corporate asset from an idea.
(MU - New Syllabus we.f academic year 22-23)(M8-79) fe Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-24)
(9) TH9 : Lineage maps get revealed when corporate asset is not generated by the knowledge
expansion and transfer.
(10) TH10 : It is possible to classify and map emerging research topics to particular ideators,
innovators, boundary spanners, and assets.
%& 1.10.2 Phase 2 - Data Preparation
‘* Anew analytics sandbox is set up by the team with its IT department for the purpose of storing and
experimenting on the data.
* Inthe process of data exploration exercise, the data scientists and data engineers come to know that
specific data require conditioning and normalization.
‘* Also they come to know that various missing datasets were difficult to testing some of the analytic
hypotheses.
‘* As data is explored by the team, it promptly realized that without good quality data, it would not be
able to carry out the subsequent steps in the lifecycle process.
Consequently it was essential to conclude for project what level of data quality and cleanliness was
necessary.
«In the case of the GINA, the team realizes that several of the names of the researchers and people
who are communicating with the universities were misspelled or had spaces at leading and trailing
side in the data-store.
© Such little problems must be addressed in this phase to enable better analysis as well as data
aggregation in subsequent phases.
% 1.10.3 Phase 3 - Model Planning
© In the GINA project, for large amount of dataset, it looks viable to use social network analysis
techniques to observe the networks regarding innovators.
© Inother cases, it was hard to provide appropriate methods to test hypotheses because of the lack of
data.
* Inone case (1H9), a decision is made by the team to begin a longitudinal study to start tracking data
points over time about people who are developing new intellectual property.
‘© This data collection support the team to test the next two ideas later :
(8) THB : Continuous knowledge growth and transfer events minimize the time required to create
‘a corporate asset from an idea?
(il) TH9 : Lineage maps get revealed when corporate asset is not generated by the knowledge
‘expansion and transfer.
» For the longitudinal study being proposed, there is need to team to establish goal criteria for the
Purpose of study.
(MU - New Syllabus w.e. academic year 22-23)(MB-79) a Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-25)
ee EE
Particularly, it required to decide the end goal of a successful idea which had traversed the entire
journey. The parameters regarding the scope of the study consist of the following considerations:
(i) Identify the correct milestones for the purpose of accomplishing this goal.
(ii) Trace the way by which people shift ideas from each and every milestone towards the goal.
(iii) After this, trace ideas which unable to reach the goals, and trace others which are able to reach
the goal. Compare the journeys of both types of ideas.
‘Make comparison regarding the times and the outcomes with the help of a few different methods
based on the way by which data is collected and assembled.
Chapter Ends...
goaSure Marks
Notes and Paper Solutions
Elevating Excellence
ewe la
Guide & University Paper Solutions
Written, Edited by most experienced faculty.
Chapterwise & Topicwise Paper Solutions.
Most Likely question also included.
Answers exactly as per the weightage of marks given in
exam.
All Latest Q. Papers included.MODULE.2
24
22
23
24
25
Data Exploration
CHAPTER 2
Types of data, Properties of data Descriptive Statistics : Univariate Exploration:
Measure of Central Tendency, Measure of Spread, Symmetry, Skewness: Karl Pearson
Coefficient of skewness, Bowley's Coefficient, Kurtosis Multivariate Exploration: Central
Data Point, Correlation, Different forms of correlation, Karl Pearson Correlation Coetticient
for bivariate distribution.
Inferential Statistics : Overview of Various forms of distributions: Normal, Poisson, Test
Hypothesis, Central limit theorem, Confidence Interval, Z-test, t-test, Type-I, Type-t
Errors, ANOVA.
Introduction to Statistics...
Measures of Central Tendency.
Review of Basic Results in the Theory of Statistics.
2.3.1 Range and Mid-ranage :
2.3.2 Variance and Standard Deviation...
2.3.3 Arithmetic Mean.
2.3.4 Moments about Mean...
2.3.5 Relation between Moments about Mean (j,) and Moments about Origin (
2.3.6 Kari Pearson's Coefficients of Kurtosis........
The Expected Value of x : (Mean Value of x).
Testing of Hypothesis...
2.5.1 Statistical Hypothesis...
2.5.2 Test of Hypothesis...
2.5.3 Tests of Significance.
2.5.4 Null Hypothesis.Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-2)
255
256
257
2.5.7(A)
258
2.5.9
2.5.10
25.11
2.6 Chi-square test of goodness of fit
261
2.6.2
2.7 Chi-Square Test
2.7.1 Probability Density Function (p.d.t.) of Chi-square Distribution,
27.2 Remark
2.7.3 Applications of x°- Distribution
2.7.4 — Chi-Square Test of Goodness of Fit.
27.5 Steps to Compute x° and Drawing Concl
2.7.6 Conditions for the Validity of Chi-Square Test
2.7.7 Examples
2.7.8 Levels of Significance
2.7.9 Method of Solving the Problem
2.7.10 Student's Distribution
2.7.11 Properties of t-distribution
2.7.12 Applications of t-distribution.
‘f — test for Significance of Sample Correlation Coefficient
281 Examples
t-Test for Difference of Mears
2.9.1 Assumptions for Difference of Means Test .
2.92 Examples
Z-Test..
2.101 Use of Zest
2.10.2 Hypothesis Testing
2.10.3 Steps of Performing Z-test
2104 Type of Zest...
2.105 Solved Example
2.10.6 Two - Sampled Z - Tos
211 Comelation...
211.1 Types of Correlation.
211.2 — Scatter Diagram.
2.42 Central Data Point
2.12.1 Data Point,
Altemate Hypothesis...
Types of Errors.
Type | Error and Type Il Error
Comparison between Type | and Type Il Errors .
Power of Test.
Level of Significance
Critical Region
Examples...
Contingency Table...
Degrees of Freedom
(MU - New Syllabus w.ef academic year 22-23)(M8-79) al Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-3)
an ieeeereneeceeceeeeeeeeeeeeee ee
213
244
245
(MU
2.12.2 Requirements for a Good Data Point......
2-49
2.12.3 Use of Data ~ Points... 2-50
2.12.4 Collection of Data Points....... 12-50
2.12.5 Examples of Data Point Collection Methods 2-50
2.12.6 Analysis of Data Points 2-50
2.12.7 Examples of Data Points..
251
2.12.8 Karl Pearson's Coefficient of Correlation.
2-51
2.12.9 Properties of Coefficient of Correlation 2-52
VEX. 2.12.1 won252
TMEEEE 110-19, 3 Marks
2.12.10 Examples on Correlation Coefficient.
2-53
Rank Correlation..
2.13.1 Spearman's Rank Correlation Coefficient...
2.13.2 Tied Ranks..
UEx. 2.13.3 [IU EOREIET .....
UEx. 2.13.4 EEO ...
Bowleys Coefficient...
(Qg + Qy-20,)
2.14.1 Bowl
ley Skewmess “>
2.14.2 Why Bowley Skewness Works.
2.14.3 Limitations of Bowley Skewness..
Poisson Distribution......
UQ. —_ Write short note on Rayleigh distribution.
(Reo M eR CAO MTT
PERC CN REA ORCA SS
2.15.1 We Derive Poisson's Distribution from Binomial Distribution 2-62
2-64
2.15.2 Moments of the Poisson Distribution.
2.15.3 Moment Generating Function 2-65
2.15.4 Additive Property of Poisson Distribution... 22-65,
~ New Syllabus w.e.f academic year 22-23)(M8-79) Dab rech-neo Publications‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-4)
Deplied Data Scie A
U@. —_ Explain Poisson distribution and its properties. [Iq
Prom
2.16 Examples on Poisson Distribution
2.17 Normal Distribution (or Gaussian Distribution...
ua. Discuss normal distribution and its characteristics. Or
Ua. —_ State importance of normal distribution .
TOMA ETD)
Dec. 16. Q. 6(c). May 19, 05 Marks [i
2.17.1 Characteristics of Normal Distribution.
2.17.2 Properties of Normal Distribution (N. D.)....
ua. What is the importance of standard normal variate 7
MU-- Dec. 2020
2.18 Importance of Normal Distribution
2.19 Solved examples Normal Distribution
2.20 Analysis of Variance (ANOVA)
2.20.1 Definition of ANOVA
2.20.2 Assumptions for ANOVA Test
2.21 Hypothesis Testing for More than Two Means (Anova).
2.21.1 Altemative for Computation of Various Sums of Squares...
2.22 Solved Examples.
2.23 Central Limit Theorem
va. State central limit theorem and explain.
Ua. _Explain the central limit theorem.
MUO. 1(b). Dec. 14. Q. 3(b). May 15. Q. 1(a), Dec, 15,0, Nic)
CORT EA MIME EOL
223.1 Examples on Central Limit Theorem.
2.24 Confidence Intervals...
> Chapter Ends
(MU - New Syllabus w.ef academic year 22-23)(MB-79) Te recn-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-5)
ee EEE eee
DM 2.1 INTRODUCTION TO STATISTICS
———
Q Definition (1): A Variate is any quantity or attribute whose value varies from one unit of
investigation to another.
GQ Definition (2) : An observation is the value taken by a variate for a particular unit of
investigation.
Variates differ in nature, and the methods of analysis of a variate depend on its nature. And we can
distinguish between quantitative variates (like the birth-weight of the baby, etc.) and qualitative
variates (such as the sex of the baby etc.)
G_Definition (3) : A quantitative variate is a variate where values are numerical.
Q- Definition (4) : A qualitative variate or attribute is a variate whose values are not numerical.
Qualitative variates can also be divided into two types :
(i) They may be continuous, if they can take any value we specify in some range or
(ii) Discrete if their values change by steps or jumps.
Q Definition (5) : A continuous variate is a variate which may take all values within a given range.
Q Definition (6) :A discrete variate is a variate whose values change by steps.
The choice of which variates to record is important in any investigation. Once choice is made, the
information can be summarized by the frequency distribution of the possible ‘values’.
Q Definition (7) : The frequency distribution of a (discrete) variate is the set of possible values of
the variate, together with the associated frequencies.
Q Definition (8) : The frequency distribution of a (continuous) variate is the set of class-intervals
for the variate, together with the associated class-frequencies.
If we classify the whole population according to birth-weights, then instead of looking at the
frequency of each variate, we first group the values into intervals, which is the sub-division of the
total range of possible values of the variate.
In this example, the variate may be classified as 1-500, 500-1000, 4501-5000, 5001-5500 grams.
Q Definition (9) : A class-interval is a sub-division of the total range of values which a (continuous)
variate may take.
Q Definition (10) : The class-frequency is the number of observations of the variate which fall in a
given interval.
Q Definition (11) : Cumulative frequency distribution is the sum of all observations which are less
than the upper boundary of a given class interval : o¢ this number is the sum of the frequencies
Upto and including that class to which the upper class boundary corresponds.
* For example, consider the heights of 50 students. We prepare the Table 2.1.1.
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) Tech-Neo PublicationsTable 2.1.1 : Cumulative frequency (more than) Table
Class (cm) interval | Frequency | Cumulative frequency more than
145-146 2 2
147-148 5 7
149-150 8 Is
151-152 15 30
153-154 9 39
155-156 6 45
157-158 4 49
159-160 1 50
Total 50
Table 2.1.2 : Cumulative frequency (less than) Table
Class (em) | Frequency | Cumulative frequency
interval more than
145-146 2 50
147-148 5 48
149-150 iE 8 43
151-152 15 35
153-154 9 20
155-156 6 MM
157-158 4 5
159-160 1 1
Total 50
Q Definition (12)
Points to note while constructing the Tables,
(1) Make the table self-explanatory provide a title, a brief description of a source of the data, St
‘what units the figures are expressed, label rows and columns where appropriate.
(2) Keep the table as simple as possible.
(3) Distinguish between zero values and missing observations.
(4) Make alternations clearly.
(S) Give the calculations of logical pattern on the sheet.
tate in
(MU - New Syllabus wef academic year 22-23)(M8-79) wl Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-7)
ee io oe
‘Dp 2.2 MEASURES OF CENTRAL TENDENCY
* One of the most important aspects of describing a distribution is the central value around which the
observations are distributed.
¢ Any arithmetical measure which gives the centre or central value of a set of observations is known
as a measure of central tendency or measure of location.
» 2.3 REVIEW OF BASIC RESULTS IN THE THEORY OF STATISTICS
2.3.1 Range and Mid-ranage
One way to measure the variability in a sample is simply to look at the highest and the lowest of the
observations in a set, and calculate the difference between them.
Q_ Definition 1 : The range of a set of observations is the difference in values between the largest
and smallest observations in the set.
Q Definition 2 : The mid-range is the average of the largest and smallest values in the data set.
eg, for X=(1,3,5,7,9, 11, 13}
1+13
midrange =
%® 2.3.2 Variance and Standard Deviation
Pursuing the idea of measuring how closely a set of observations cluster round their mean, we
square each deviation ( x; -X ) instead of taking its absolute value.
‘The next measure of variability is the
Variance : It is the mean of the squared deviations.
Q Definition (1) : Variance
(The variance of a set of observations x}, x,
a
A 1 <-y
their mean and equals 2) ( x;-*)
+ Xq is the average of the squared deviations from
On simplification it is equal to
2
Lye (23)
a Ua
(ID) For grouped data
Bie
a
Variance = 2 6 (%-%)
i
(MU - New Syllabus w.e,f academic year 22-23)(M8-79) Bal rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration}
Page No. (2-8)
Definition : Standard deviation
(D_ The standard deviation is the positive square root of the variance, and is equal to
fa
2 YD (x-x)? and denoted by 6
(I) For grouped data
& 2.3.3 Arithmetic Mean
Iff,, ff, are frequencies of the variates x,, x9, .., X, then,
a
Dax
M = Arithmetic Mean (A.M. a
2 fi
ist
Short-cut method of finding mean =
x-
Let, x = “~~ ; where Xp is assumed mean and h is length of the class interval
Then M = x+hA
De-x
Where, A
Xr
\ 2.3.4 © Moments about Mean
Let, (K, f,) be the given frequency distribution, then the r’” moment about mean M is given by,
Lew-”
we SE
Do-M) LAs
we Se
= M-M-1=0
and for, r=2
LAw-MY a
= Square of standard deviation
(OU - New Syllabus wet academic year 22-23)(MB8-79) cl Tech-Neo Publication’Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-9)
a eeaeaeaeaeueua_n SS
Definition : r” moment about origin are given by,
» | Zh
SE
, _ Zhx
1, ad yy, Me™
, Lax , _ Zax, , Lax,
Again Wai Ha Se OSE
%_ 2.3.5 Relation between Moments about Mean (u,) and Moments about
Origin (1)
@ w=n,-(u,)
@) y= wh -3 up Hy +2y
Gi) a= wy -4 0) wy +oun, -3 “
Also note that,
, , Lax!
If x = Then, w= ze
2.3.6 Kari Pearson’s Coefficients of Kurtosis
@_B, = Measure of skewness 7
D
Gi) B= Measure of flatness of single humped distribution =—>
Note : For normal distribution, Bp = 3. a ae ea aaa boc oh aey fen the. pommel
Cure and is known as lepto-kurtic.
If B <3, the distribution is flat compared to normal curve and is known as plato-kurtic.
} (MU - New Syllabus w.e academic year 22-23)(M8-79) Bal rech-Neo Publications
aApplied Data Science (MU-Sem 8-Comp) (Data Explorati
Lepto kurtic
Normal
Plato kurtio
Fig. 2.3.1
Ya. 2.3.7 The Expected Value of x : (Mean Value of x)
If X is a random variable then the expected value of X is denoted by E (X) and means the value, on
average, that X takes.
Definition : If X =x, i= to n, isa discrete random variable with frequencies f,, i= 1 to n, then
EX) = XY f(x)
i=
jm Note : The expected value of X is also called as the Mean value of X and is also denoted by M. ie.
M=E()
5 Properties
(i) The r™ moment about origin is also written as,
wl =EG) = z (ft
Clearly, EG) = =D Xfi
i=l
j
E(@) = waDx fi
EW) = =D x, hand
|
u, =D x, fyand soon.
EG’)
(Gi) Moment about the mean % are defined as,
1
uy = EL -x] =X (x\-%) 4 and iscalled as r" moment about mean x .
ist
(MU - New Syllabus wef academic year 22-23)(MB-79) fl Tech-Neo Publicatio™Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-11)
Clearly, py = 0
ty = E(x-%) =n, -y
:
Hy = E(x, wy-3uju, 2p"
‘
and py = E(x,-x)
: ee ak
= By 4H + 6H, B, - 3H,
YA 2.3.8 Covariance
+ In probability theory and statistics, covariance is a measure of the joint variability of two random
variables,
«If the greater values of one variable correspond with the greater values of the other variable, and
the same holds for the lesser values (that is, the variables tend to show similar behaviour), the
covariance is positive.
+ _ In the opposite case, when the greater values of one variable correspond to the lesser values of the
other, (that is, the variables tend to show opposite behaviour), the covariance is negative.
‘* The sign of the covariance shows the tendency in the linear relationship between the variabies.
Y Y
o x
-x
cov (x, y) <0 cov (x,y) =O cov (x, y) >0
Fig.23.2
"= Formulae of covariance
If X and Y are two random variables, then covariance between them is defined as :
cov (X, Y)= E ([X-E QO} [Y-E(¥)])
E {XY -XE(Y)-Y E(X) +E(X) E(¥)}
= E(XY)-E(X)E(Y)-E(Y)E(X)+E(X)-E(Y)
cov (X, Y)= E (XY) -E(X)-E(¥) .-i)
TEX and Y are independent, then
E(XY) = B(X)-E(Y)
and hence in this case,
cov (X,Y) = E(X)E(Y)-E(X)E(Y)
=0
(MU - New Syllabus w.ef academic year 22-23)(M8-79) [al recn-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-12)
"Remarks
(i) cov(aX, bY) = E {[aX -E (ax)] [bY -E (bY)])
= E {aX ab (X)] [bY ~bE (¥)]) =E (a(X-E (X)] b[Y-E(Y)])
= abE ([X~E (X)] [Y -E (¥)]) =ab cov (X, Y)
(ii) cov (X +a, Y +b) =cov (X, Y)
(iii) cov (aX +b, (Y + d)) = ac cov (X, Y)
(iv) cov (X + Y, Z) = cov (X, Z) + cov (Y, D
(v) IfX and Y are independent, then cov (X, Y) = 0 but the converse is not true.
w 2.3.9 Examples
Ex. 2.3.1 : For the following distribution, find :
() Arithmetic mean (ji) Standard derivation
(iii) First 4 moments about the mean (iv) By and Bp.
*12)25/3 |35]4 |45]5
£15| 38 | 65 | 92 | 70} 40 | 10
Mson
x? |
2 | 5 |-3]-15| 45 |-135 | 40s
25 | 38 |-2|-76 | 152 | -304| 608
3 | 65 |-1|-65] 65 | -65 | 65
4 | 70} 1 | 70 | 70] 70 | 70
5 10 | 3 | 30 | 90 | 270 | 810
24 | 582| 156 | 2598
Total | 320
(MU - New Syllabus w.ef academic year 22-23(M8-79) fel Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration
(i) Arithmetic mean : Using result of A we have,
_ Stk
yy 7320 = 0.075
A
and arithmetic mean = xp + hA = 3.5 + (0.5) (0.075) =- 3.538
(ii) Standard deviation : Using the result of B, we have
oat (eRe (0.5) {= (3)] = 0.453
o = 0673
(ili) Moments about the mean M
When assumed mean is A = xy = 3.5 and using C, we have,
, Lik’
wy = hose aos 3 = 0.0375
F Dt? 582
moa Se = (05) (#8) = 0.4546
yo yp 156
wos pee, =(05)° (8) = 0.0609
. we ix 2598
n= ee 0s)" a) = 0.5074
Using result D, we have for moments about the mean M.
w= 0
He = wi, -n = (0.4546) — (0.0375)* = 0.0453
poe
Hs = H, —3H,m, +2p) = (0.0609) 3 (0.4546) (0.0375) + 2 (0.0375)°
= 0.0600
rw
My = B,-3h,m, +2H,
= (0.5074) - 4 (0.0609) (0.0375) + 6 (0.0375)° (0.4546) ~ 3 (0.0375)*
= 3.2385
(iv) By definition of B, and B, we get,
Since B, > 3, the distribution is lepto-curtic i.e. it is peaked up sharply than the normal distribution,
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) [el Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-14)
UEx. 2.3.2
From the following frequency distribution compute the standard deviation of 100 students.
Table P. 2.3.2
Mass in kg. | No. of students
60-62 5
63-65 18
66-68 42
69-71 27
72-74 8 |
:We construct the table,
Let X = 67 be assumed mean
and h = 3=class width
Let uz
Table P. 2.3.2(a)
Class of masses | Midpoint of Classes x| Number of students f] x= 67 | fu |fu’
a3
60-62 61 5 -2 -10}20
63-65 64 18 = |—18]18
66-68 67 42 0 0 0
9-71 70 27 1 27 |27
72-74 B 8 2 16 |32
Total (2) 100 0 is |97
Wehave Df-u = 15, Lfu'=97
h = 3andN=)f=100
1 1 15,
By definition, o = h\ ty at-(§Ls,) 3 tg 0n- (ap) 92292
>>| 2.4 SAMPLING DISTRIBUTIONS
* A group of pupils in a school plan to investigate how long it takes to travel between home and
school. There are 2000 pupils in their school, and they realise that they do not have the time t0
collect and analyse such a large amount of data.
(MU - New Syllabus w.e academic year 22-23)(MB-79) i] Tech-Neo Publications
NeeApplied Data Science (M
mp) (Data Exploration)...Page No. (2-15)
* The argue that information from some of the pupils should give them what they want provided
these pupils are chosen properly. So they decide to collect data from only a part of the complete
school population.
We call this part a sample-of the school population.
Q Definition : A sample is any subset of a population.
An investigation of this type is said to be a survey of a population.
Definition : It the above example, the information is collected by sampling; such an investigation
is called is sample survey.
Q__ Definition : If a survey plans to collect information from every member of a population, it is
called a census of that population,
* The sample chosen should be reflection of the whole population, it should reproduce characteristics
of the population. In our problem, the mean journey time is the characteristic for the population of
school children.
%& 2.4.1 Random sampling
There are many sampling schemes that may be called random. We shall only define a simple
random sample, which is very straight forward. Other, more complex random sampling schemes are of
particular use in certain special types of problem.
© Definition of a simple random sample
Q__ Definition : A (simple) random sample is a sample which is chosen so that every member of the
population is equally likely to be a member of the sample, independently of which other members
of the population are chosen.
| Some useful terms
For practical reasons the investigator often has to settle for obtaining information about a
population which has similar properties to the population.
It is convenient to distinguish these two populations by giving them separate names.
(Definition : The target population
‘The target population is the population about which we want information.
(i) Definition of study population
The study population is the population about which we can obtain information.
Gil) Definition of a sample unit
A sample unit is a potential member of the sample.
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) TBhrech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-16)
p>] 2.5 TESTING OF HYPOTHESIS
ee —._—_000. OO
UQ Explain Hypothesis testing with example. (Ref. - Q. 3(b), Aug. 18. 4 Marks)
1
Explain hypothetical testing in detait
ith example. (Ref. - Q. 3(b), Oct. 19, 5 Marks}
« Inference based on deciding about the characteristics of the population on the basis of sample stud:
is called the inductive inference. ,
. a involve a of ee wrong decisions. For example, a pharmaceutical concern may
a es in if ey ig is really effective for the particular ailment, say, in reducing
. ns ce ot nh baeesrtal of probability pays a very vital role in decisions making and the
ton statistics which helps us in arriving at the criterion for such decisions is known as testing
«The theory of testing of hypothesis employs statistical techniques to arrive at decisions in certain
— where there is an element of uncertainty on the basis of sample whose size is fixed in
YS 2.5.1 Statistical Hypothesis
+ A statistical hypothesis is some assumption or statement, which may or may not be true, about a
population, or about the probability distribution which characteristics the given population.
© We are supposed to test it on the basis of the evidence from a random sample.
© If the hypothesis completely specifies the population, then it is known as simple hypothesis,
otherwise it is known as composite hypothesis.
YW 2.5.2 Test of Hypothesis
‘A test of a statistical hypothesis is a two-action decision-after observing a random sample from the
given population. The two actions are the acceptance or rejection of the hypothesis under
consideration.
© The truth or falsity of a statistical hypothesis is based
may be consistent or inconsistent with the hypothesi
accepted or rejected.
©The acceptance of a statistical b
reject it and does not necessarily imply that it is true.
TR 2.5.3 Tests of Significance
From the knowledge of the sampling distribution
that a sample statistic would differ from a given hypo!
sample value, by more than a certain amount and then to answer the question of significance, between
two independent statistics. It is known as test of significance.
nthe information contained in the sample. It
is and accordingly the hypothesis may be
ypothesis is due to insufficient evidence provided by the sample to
of a statistic, it is possible to find the probability
thetical value of the parameter or from another
(MU - New Syllabus wef academic year 22-23)(MB-79) Bal rech.neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-17)
Se
‘Thus we can say that :
(i) The difference between a statistic and the corresponding population, or
(ii) The difference between two independent statistics is not significant if it depends on fluctuations of
sampling, otherwise it is said to be significant.
%& 2.5.4 Null Hypothesis
: YQ. _ Explain Null Hypothesis, EXECEOATy rs) |
+ For applying any test of significance, we set up a bypothesis‘a definite statement about the
population parameter (s)’.
* In the words of Prof. A. R. Fisher : “Null hypothesis is the hypothesis which is tested for possible
rejection under the assumption that it is true.”
5 Setting up a null hypothesis
As the name suggests, it is always taken as a hypothesis of no difference.
5 To set the null hypothesis
(Express the claim or hypothesis to be tested in the symbolic form.
(ii) Identify the null hypothesis and the alternate hypothesis as :
* Take the expression involving equality sign as the null hypothesis (Hp) and the other as the
alternative hypothesis (H,).
* Thus, depending on the wording of the original claim, the original claim can be regarded as Hy (if it
contains equality sign) and sometimes it can be regarded as H, (if it does not contain the equ:
* Any hypothesis which is complementary to the null hypothesis is called an alternative hypothesis.
Itis usually denoted by Hy.
* Alternative hypothesis Hj is stated in respect of any null hypothesis Hp, because the acceptance or
ejection of Hy is meaningful only if it is being tested against a rival hypothesis.
WA 2.5.6 Types of Errors
DONA IANIAAQMNMMADAMAD rae reaaeeeess
1UQ. Explain the following (() Type 1 and 2 errors 1
{ ¢ (Ref. - Q. 4a), Aug. 18, Q. 3(b), May 19 4 Marks)
* The decision to accept or reject the null hypothesis Hy is made on the basis of the information
supplied by the observed sample observations.
(MU - New Syllabus wef academic year 22-23)(M8-79) fech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-18)
* The four possible situations that arise in any test procedure are as follows
Decision from sample
Jes
Reject Hy ‘Accept Ho
True state | Ho true Wrong (Type Terror) | Correct
Ho False (Hj true) | Correct Wrong,
(Type Il error)
From the above table, it is clear that we may commit two types of errors.
3 2.5.7 Type | error and Type Ill error
> (i) Type Terror
Type I error has occurred when we reject the null hypothesis, even when the null hypothesis is true.
This error is denoted by «.
> Gi) Type error
Type error occurs when we did not reject the null hypothesis, even when the hypothesis is false.
This error is denoted by B.
Null hypothesis is true Null hypothesis is false
Reject null hypothesis Type I error Correct decision
(False positive)
Fail to reject the Null hypothesis Correct decision (False negative)
5 Difference of Means
Now suppose that we have not just a single sample but two samples from different populations and
that we wish to compare the separate means. .
© We also assume that the variances of the two populations are equal but unknown (the most
common situation).
2a. 2.5.7(A) Comparison between Type I and Type I Errors
' YQ. Compare Type -1and Type - Il errors.
Q. 4(a), Oct. 19, 5 Marks)
© We make type J error by rejecting a true null hypothesis. And,
© We make Type Il error by accepting a wrong null hypothesis.
If we write :
P [rejecting Hp when it is true]
= P {type I error] =o.
(MU - New Syllabus w.ef academic year 22-23)(M8-79) Dad rech-neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-19)
eee SSS RE
and P [accept Hp when it is wrong]
= p [type If error] = B;
then o and B are also called as sizes of Type I
and Type Il errors respectively.
In the terminology of Industrial quality control, the type I error amounts to rejecting a good lot and
type Il error amounts to accepting a bad lot. Hence
a
B
The sizes of type I and type II errors are also known as producer’s risk and consumer's risk
respectively.
P [rejecting a good lot] and
P [accepting a bad lot]
Practically it is not possible to minimise both the errors simultaneously.
An attempt to decrease a results in an increase in B and vice-versa.
And it is more risky to accept a wrong hypothesis than to reject a correct one; i.e. consequences of
type Il error are likely to be more serious than the consequences of type I error.
So for a given sample, a compromise is made by minimising more serious errors after fixing up the
less serious error.
Thus we fix 0, the size of type I error and then try to obtain a criterion which minimises 8, the size
of type Il error.
We have
P [type I error]
P [accepting Hp when Ho is false or H, is true]
Since,
P [accept Hy when Hy is wrong] + P [accept Hy when Hy is true) = |
+. P [accept Hp when Hg is true]
= 1-P [Accept Ho when Hp is wrong]
-B
2.5.8 Power of Test
1-B = P [Accept Hp when Ho is true]
is called the power of test.
Naturally, when Ho is true, it ought to be accepted. Hence, minimizing 8 amounts to maximizing
(1 ~B), which is called the power of the test.
Hence, the usual practice in testing hypothesis isto fix 0, the size of type I error and then try to
obtain a criterion which minimizes B, the size of the type II error or maximises (1 ~ f), the power
of the test.
(MU - New Syllabus wef academic year 22-23)(M8-79) Dib! tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration}
2A 2.5.9 Level of Significance
* The maximum size of type I error, which we prepare to risk is known as level of significance. It is
denoted by a and is
P [rejecting Ho when Ho is true] =
* Commonly used levels of significance in practice are 5% (0.05) and 1% (0.01).
+ If we adopt 5% level of significance, it implies that we are 95% confident that our decision to reject
Hp is correct.
¢ Level of significance is always fixed in advance before collecting the sample information.
WS 2.5.10 Critical Region
‘* Suppose we take several samples of the same size from a given population and compute some
statistic t, (say X, p etc.), for each of these samples.
© — Letty, ta, ..., be the values of the statistic for these samples. Each of these values may be used to
test some null hypothesis Ho.
© These sample statistics t, t, .... ty (comprising the sample space), may be divided into two
mutually disjoint groups, one leading to rejection of Ho and other leading to acceptance of Ho.
© The statistics which lead to the rejection of Hy give us a region called critical region (C) or
Rejection Region (R), while those which lead to the acceptance of Ho give us a region called
Acceptance Region (A).
Thus, if the statistic t € C, Hy is rejected and if t € A, Hp is accepted.
‘The sizes of type (1) and type (II) errors in terms of the critical region are defined as
a = P [Rejecting Hy when Hp is true]
= P [Rejecting Ho/ Ho]
= P[te C/Hy)
B= P [Accepting Hp when Hg is wrong]
= P [accepting Hp when Hy is true]
= P [Accepting Ho / Hy]
= Pfte A/H)
ection) region, A is acceptance region and CW A= CUA=S
* Where C is the critical (rej
(sample space)
(MU - New Syllabus wef academic year 22-23)(M8-79) &l Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-21)
—SSSSSSSsS9S93939393935
2.5.11 Examples
Ex. 2.5.1 : In order to test whether a coin is perfect, it is tossed 5 times. The null hypothesis of
perfection is rejected if and only if more than 4 heads are obtained. Obtain the
(Critical region,
(ii) Probability of type-I error, and
Gi) Probability of type Il error, when the corresponding probability of getting a head is 0.2.
Soin. :
Let X be the number of heads obtained in 5 tosses of a coin.
Ho : The coin is perfect ie. unbiased,
Let Ho: p=4
We use binomial distribution.
under Hy: X~B(n .P=4)
P(X=x1Hp)= "Cyp'q?*
Sones
P(X=x1H)="¢,(4) =4%q,; x=0,1,23,45
@ Critical region or region of rejection
Reject Hy if more than 4 heads are obtained
Critical region = (x>4} = (x=5}
Gi) Probability of type I error («) is
@ = P [Reject Hp! Ho] =P [X=51Ho]
5 Hx°e5=F= 003125
iii) The probability of type I error ) is
B = P [Accept Ho! Hy]
= 1-P [Reject Hy Hy ]
= 1-P[X=SIP=0.2)
= 1-Ueyp'g?* )x=5,P=02
= 1-°C5(0.2)°- 1 =0.99968
(MU - New Syllabus w.e.f academic year 22-23)(M8-79) ab rech-neo Publications‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page N
22)
Ex, 2.6.2 : In order to test whether a coin is perfect, it is fussed 5 times, The null hypothesis of
perfectness of the coin is accepted if an only if atmost 3 heads are obtained. Then the power of the test
corresponding to the alternative hypothesis that probability of head is 0.4 is
272. 2853... _56
@ F725 Gi) 3725 Citi) F]QE Civ) none of these.
Msom. :
Let X be the number of heads in n = 5 tosses of a coin and let P = probability of a head in a random
toss of the coin.
Null Hypothesis : Hy :
Alternate hypothesis : Hy : P= 0.4,
Critical region : X >3
Power of the test for testing Ho against Hy is given by :
1-B = P [reject Hy when H; is true]
P [reject Hy | Hy]
= P(X>31P=0.4)
5
= X *c,04' 06°"
r=4
[. binomial distance X ~ B (n= 5, P = 0.4, under Hj)
= 5c, 0.4)* 0.6)! +°c50.4)° 0.6)"
ay 6) (4)
= 5-(35) Gis) +G5)
2y¥ 2] 16x17_272
= (3) [+3] -asxs73ns
(@ is correct answer.
‘Dal 2.6 CHI-SQUARE “TEST OF GOODNESS OF FIT.
Before going into the details of the chi-square test, we study some terms used in this connection:
‘2.6.1 Contingency Table
Let A and B be the attributes of the given data. Let the data be classified into $ classes Ay) Azy As
according to attribute A and into t classes By, B,, .-B, according to the attribute B. Let Oy be the
observed frequency of the cell belonging to the classes Aj (i= 1, 2,» 8) and Bj = 1, 2,» €). The data
can be set into a s xt contingency table of s rows and t columns as follows :
(MU - New Syllabus w.e academic year 22-23)(MB-79) fal Tech-Neo Publications‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration).
Classes | By | Bz B By | Total
Aj O14 | O12 O71 Lj On} Ar
Az | Or | On 035 On | Az
Ai
As | On | Os O55 On | As
Total | By | Bz B BI ON
Z& 2.6.2 Degrees of Freedom
23)
The term degrees of Freedom refers to the number of “independent constraints” in a set of data. We
explain this concept with a few examples :
1) If the data are given in a contingency table, then the degree of freedom is calculated by the formula
y=(e-D(r-1).
‘Where y stands for degree of freedom, c for number of columns and r for the number of rows. Thus
in a2 x2 table, degrees of freedom are (2 — 1) - (2-1) = 1 and so on.
@
If the data are not given in the form of contingency table but given in the shape of a series of
individual observations then the degrees of freedom are calculated in a different way.
Consider the following distributions :
Number of heads
0
1
wclolrafasulalol[r
Total
(MU - New Syllabus w.e.f academic year 22-23)(M8-79)
‘Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-24)
Here, if we write down the expected frequencies we have freedom to write any ten figures we
choose but the eleventh figure must be equal {0 1024 minus the total of the ten figures we have written,
because the total of the expected frequencies must be equal to the (otal of the actual frequencies. Thus,
there are ten degrees of freedom in the above question. In such cases the degrees of freedom are equal tg
(ii 1) where n is number of frequencies.
Sera
DH 2.7 CHI-SQUARE TEST
EEE ee
The square of a standard normal variable is called a chi-square (pronounced as Sky, without §)
variate with I degree of freedom (d.f.)
Thus if X is a random variable following normal distribution with mean p and standard deviation
o, then A) isa standard normal variate.
x — 2
ey is a chi-square (°) variate with 1 d.f.
If X,, Xo, ...X; are r independent random variables following normal distribution with means 4,
Hoye Hy and standard deviations 6}, 6>,... 6, respectively then the variate
Which is the sum of the squares of r independent standard normal variates, follows Chi-square
distribution with r d.f.
Ys. 2.7.1 Probability Density Function (p.d.f.) of Chi-square Distribution
1f 7 is a random variable following Chi-square distribution with y 4-f. then its probability function
is given by, ; 1__ in (2 -) gegcw
PO) = Tape Gye”.
Where [7/2 is Gamma function
W 2.7.2 Remark
2
(1) The probability function P (17) depends on degrees of freedom. As y changes, P (x) changes.
(2) Constants of 77 distribution with y df.
Mean = y ; Mode=y-2
Variance = 27
(MU - New Syllabus w.ef academic year 22-23)(M8-79) Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration). No. (2-25)
HY Wo= 2 Wy=8% my =48 y+ 127.
(3) Pearson;s Co-efficient of skewness :
s, = Mean — Mode ay iS -2)
'k sd
(i) Since coeff. of skewness > 0 we 21; ¢ ~ distribution is positively skewed.
(ii) Since skewness is inversely proportional to the square-root of d.f., it tends to symmetry as d.f.
increases.
‘Thus for large d.f, x-distribution tends to normal distribution.
(4) For large the standard variate
2 22 2
= X=EQ) _x=7
sx) — V2y
is a standard normal variate.
(5) Additive Property :
2 2
If xy .%g > ~ ty, ate independent 72 variates with ny, np... my dfs. then the sum
k
2 2 2 2. 2 . y
Dak ty tot % isax’ - variate with (nj +nj+...+m) df.
i=l
%_ 2.7.3. Applications of x’ - Distribution
Some of the applications of x”-distribution are :
(@ _ Chi- square test of goodness of fit
(i) 72 test for independence of attributes
2.7.4 Chi-Square Test of Goodness of Fit
2
X test of goodness of fit is used to test if the deviation between observation (experiment) and
theory may be attributed to chance or if it is really due to the inadequacy of the theory to fit the observed
data.
karl Pearson proved that the statistic
(Oj
0
2
in [ 5
iz
2 2 2.
> = Ey =E) af =F) oe Cab)
E Ey Ey
Follows x°- distribution with y= n—1 d.f, where 0}, Oy, ...Oq are observed frequencies and
By, Ep, ....Ey are corresponding expected frequencies under some theory of hypothesis.
(MU - New Syllabus w.e academic year 22-23)(M8-79) ce Tech-Neo Publications2S 2.7.5 Steps to Compute x” and Drawing Conclusion
(i) Compute the expected frequencies By, Ep, ... Ey comesponding to observed frequencies 01, Oy,
(0, under same theory.
(ii) Compute the deviation (0; ~E,) and square them to obtain (0; - EB)”.
(iii) Divide (0, — E)” by E, and add the values to compute x2 = 5 [s
(iv) Under the hypothesis Ho : the theory fits the data well, the ab iti ?.
the above statistic follows 7-distribut
Loser pote lows x -distribution
, 2
(») Look up tabulated (critical) values of x” at 5% or 1% level of significance and draw the conclusion
2 2.7.6 Conditions for the Validity of Chi-Square Test
The x -test can be used only if the following conditions are satisfied :
(i) The total frequency, N, should be large, say greater than 50.
(ii) The sample observations should be independent; i.e., no individual item should be included twice
or more in the sample.
Gii) The constraints should not involve square or higher powers of the frequencies.
(iv) No theoretical frequency should be small; i.e. it should be larger than 10 but not less than 5.
(v)_ The data should be in original units.
2 2.7.7 Examples
Ex 27.4 2 The number of scooter accidents per month in a certain town were as follows : 12, 8, 2,2,
14, 10, 15, 6.9, 4
‘Are these frequencies in agreement with the belie that accidently conditions were the same during thi
10 month period?
Soln. :
Null Hypothesis :
mber of accidents per month) are consistent with the belief that
Ho : The given frequencies (i.e. nu
the accident conditions were same during the 10-month period.
Now, total number of accidents =
1248420424 14410415 +649+4=100,
under the null hypothesis, co
M re ,
Expected number of accidents for each of the 10 months. = “jq = 10. (~" these accidents at
uniformly distributed)
Now, df.=10-1=9
(MU - New Syllabus wef academic year 22-23)(M8-79) Tal rech-Neo Publication’Applied Data Science (MU-Sem 8-Comp)
©. Tabulated %, 95 {0 9d.f. = 16.919
We prepare table for computation of 72.
[vionthjobserved no. of accidents (O)pExpected no. of accidents (EKO EO cr i
1. 12 10 2 4 04
2. 8 10 2 4 04
3. 20 10 10 100 10.0
4. 2 10 -8 64 64
5. 14 10 4 16 1.6
6. 10 10 0 cv) i)
7 15 10 5 25 25
8. 6 10 -4| 16 | 16
9. 9 10 -1 1 0.1
10. 4 10 -6 36 3.6
Total 100 100 0 - 26.6
¢ = F [OF 266
Since calculated value of x” = 26.6 is greater than tabulated value from (i) = 16.919, it is
significant and hence the null hypothesis is rejected at 5% level of significance.
ii)
Hence, we conclude that the accident conditions are certainly not uniform over the 10-month
period.
Ex. 2.7.2 : The theory predicts the proportion of beans, in the four groups A, B, C and D should be 9 : 3
23: 1. In an experiment among 1,600 beans, the numbers in the four groups were 882, 313, 287 and
118. does the experimental result support the theory ? (The table value of x" for 3 df. at 5% level of
significance is 7.81).
Soln. :
Null Hypothesis : Ho : There is no significant difference between the experimental values and the
theory; i.e. the theory supports the experiment.
* The proportion of beans in four groups A, B, C and D should be 9: 3 : 3: 1. Hence the theoretical
(expected) frequencies are as shown,
(MU - New Syllabus wef academic year 22-23)(M8-79) c Tech-Neo PublicationsApplied Data Science (MU-Sem 8-Comp) (Data Exploration)
Category | Expected frequency (E)
3
io 715% 1600 = 300
3
C Fg * 1600 = 300
1
D 7g * 1600 = 100
2
Computation of x
|Category| Observed | Expected Jo-xlo-»'(0-n?
frequency | frequency z
() ®
A 882 900 |-18| 324 | 0.360
B 313 300 [+13] 169 | 0.563
ic 287 300 |-13| 169 | 0.563
D 118 100 | 18 | 324 | 3.240
0 | 986 | 4.726
Now, df, =4~1=3 and tabulated "for 3 df
2
= 781
Tos
© Conctusion : Since calculated val
we accept the null Hypothesis at
the theory.
Ex. 2.7.3: A die is rolled 100 times with the following distribution.
314 |5 |6
20 | 17 | 17 | 15
‘At0.01 level of significance, determine whether the die is true
Usoin. :
We have number of categories = 6
N =Total Frequency = 17 4 14 420417 +17 + 15 = 100
Jue of 72 is less than tabulated value, itis not significant. Hence
'5% level of significance. Thus, the experimental results support
(or uniform).
(MU - New Syllabus w.ef academic year 22-23)(M8-79) fel Tech-Neo PublicationsApplied Data Science (MU-Sem &-Comp) (Data Exploration)...Page No. (2-29)
Sate Sclonce (Wu-Sem 6 )..-Page No_ (2-29)
* Nall Hypothesis : Hy : The die is true (uniform)
+ Under Ho, the probability of obtaining each of the six faces 1,
1
is same, ie. P=§
-. Expected frequency for each face = N. P
= 100-4 = 16.67
Computation of
‘Number | Observed frequency] Expected frequency | OE |.
(©) ®
1. 17 16.67 0.33 [0.1089 0.0065
2. 14 16.67 ~2.67 | 7.1289 | 0.4276
3. 20 16.67 3.33 [11.0889] 0.6652
4. 17 16.67 0.33 [0.1089 | 0.0065
5. 17 16.67 0.33 | 0.1089 | 0.0065
6. 15 16.67 — 1.67 | 2.7889| 0.1673
total : : 0 -_ | 12796
¢ = E [52]. 127% @
The degrees of freedom (d.f.) = 6-1=5
The critical or tabulated value of Chi-square for y = 5 and at 1% level of significance is :
x (0.01) = 15.086 ...ii)
Since calculated valued of 7 is less than critical value, itis not significant.
Hence Ho may be accepted at 1% level of significance; i.e. the die may be regarded as true or
uniform.
Ex. 2.7.4 : Records taken of the number of male and female births in 800 families having four children
are given in the Table P. 2.7.4
Table P. 2.7.4
‘No. of births
‘Male. | Female | F°Te7y
0 4 32
1 3 178
2 2 290
3 1 236
4 oO 64
‘Test whether the data are consistent with the hypothesis that the binomial law holds and the chance of a
male birth is equal to that of a female birth.
(MU - New Syllabus w.ef academic year 22-23)(M8-79) [al rech-Neo Publications