Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Uploaded by

Abdul Mueed Paracha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views44 pages

Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Uploaded by

Abdul Mueed Paracha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Doing Data Science-Rachel Schutt,Cathy O’Neil

 What is data ?
 Data uncertainty and randomness
 Understanding of Descriptive and Statistical Inference
 Populations and samples
 Fitting a model
 Data Science Process
 What is exploratory data analysis?(EDA)
 Python and its basic fuctions
 Numpy and Pandas
 Data represents the traces of the real-world processes.
OR
 A datum is an abstraction of real world entity.
 The terms variable, feature, and attribute are often used
interchangeably to denote an individual abstraction.
 Each entity is typically described by a number of attributes
 A data set consists of the data relating to a collection of
entities, with each entity described in terms of a set of
attributes
 Randomness and uncertainty are inherent aspects of data
analysis
 Effective data analysis involves understanding and
appropriately accounting for these factors to make informed
decisions and draw reliable conclusions.
 A mathematical model for uncertainty and randomness is
offered by probability theory.
 Uncertainty refers to the lack of complete Knowledge about
the outcome of a measurement or observation.
 It arises from various sources, including measurement error,
variability in the data, and incomplete information

 For instance if we're analyzing survey responses and some

participants didn't answer certain questions, there's
uncertainty about the true values of those missing responses.
 Randomness refers to the lack of predictability or pattern in
a sequence of events or observations.
 Randomness can arise from inherent variability in a system,
stochastic processes, or chance events

 Example Traffic patterns on roads and highways can exhibit

elements of randomness due to various factors such as
weather conditions, accidents, and human behavior.
 Statistics is a way to get information from data
 Descriptive statistics are used to summarize and describe a
variable or variables for a sample of data.
 For example, sample statistics such as the mean (x̅) and
standard deviation (s) are often used to summarize and
describe continuous variables
 Descriptive statistics help us understand the important
characteristics of a group of data, like its average, spread,
and shape, without having to examine each data point
separately.
 We can use descriptive statistics to describe both an entire
population or an individual sample.
 The mean or average is the sum of all values in dataset
divide by the number of values.
 It represent the central point around which the data is
distributed.
 Mean = (85 + 90 + 78 + 92 + 87) / 5 = 432 / 5 = 86.4
 Example Grading: In education, teachers often use the mean
to calculate students' average scores on assignments, tests,
or exams. This helps to assess overall performance and
determine grades.
 Median is the middle value of data set when it is ordered
9, 3, 1, 8, 3, 6
1, 3, 3, 6, 8, 9
median = 4
Example : Divide and conquer rule
 Mode is a value that appears most frequently in a dataset.
 Example : e-commerce
Person shop more will get more discounts .
9, 3, 1, 8, 3, 6
1, 3, 3, 6, 8, 9
Mode = 3
 Range a measure of spread or dispersion of dataset.
 Example : Shopping price range (budget friendly goods)
It is define as the difference between min and max value in a set of
observation
9, 3, 1, 8, 3, 6
range = 9 – 1 = 8
 Variance is a measure of spread or dispersion of a set of data
points. Its calculated as the average of the squared
difference from mean.

 Standard deviation is a statistical measure that quantifies the

amount of variation or dispersion in a set of data values. In
simpler terms, it tells you how spread out the values in a
dataset are from the mean (average) value.
 Statistical inference is the process of making conclusions or
predictions about a population based on data collected from
a sample of that population. OR
 In other words, it involves drawing inferences or
generalizations about a larger group (the population) using
information obtained from a smaller subset (the sample).
 Development of procedures, methods, and theorems that
allow us to extract meaning and information from data that
has been generated by stochastic (random) processes
 Population is complete set of traces/data points
 Sample is a subset of the complete set or population

 Population mathematical model sample (e.g Average height

of all adults in New york )
.
 No matter what we do, there will be sampling error or
variation due to sampling as we are looking at the part of a
population, not the whole population.
 In statistics, sampling errors are incurred when the
statistical characteristics of a population are estimated from
a subset, or sample, of that population. Since the sample
does not include all members of the population, statistics of
the sample (often known as estimators), such as means and
quartiles, generally differ from the statistics of the entire
population (known as parameters).
 The difference between the sample statistic and population
parameter is considered the sampling error.
 Definition: Fitting a model means finding the best
mathematical representation that describes the relationship
between the input variables and the output variable in your
dataset.

 Example: Suppose you have a dataset containing information

about house prices, including features like the size of the
house, number of bedrooms, and location. You can fit a
linear regression model to predict house prices based on
these features.
 Underfitting occurs when the model is too simple to capture
the underlying patterns in the data.
 Overfitting happens when the model is too complex and
captures noise or random fluctuations in the training data,
rather than the underlying relationships.
 Appropriate fitting occurs when the model captures the
underlying patterns in the data without being too simple
(underfitting) or too complex (overfitting).

 https://www.linkedin.com/pulse/model-fitting-i4data/
 “Exploratory data analysis” is an attitude, a state of
flexibility, a willingness to look for those things that we
believe are not there, as well as those we believe to be
there. — John Tukey

 It's a crucial step in the data science process where analysts

explore and summarize the main characteristics of a dataset,
Understand the data
OR
 A method used to analyze and summarize data sets.
 The basic tools and techniques used in EDA includes plots,
graphs and summary statistics(such as mean ,median and
standard deviation).
 Before you start data analysis or run your data through a
machine learning algorithm, you must clean your data and
make sure it is in a suitable form. Further, it is essential to
know any recurring patterns and significant correlations that
might be present in your data. The process of getting to
know your data in depth is called Exploratory Data Analysis
 In the end, EDA helps you make sure the product is
performing as intended.
 The data science process typically involves several stages
aimed at extracting insights and knowledge from data to
solve specific problems or make informed decisions.
 While the exact steps may vary depending on the context and
specific methodologies .
 NumPy is a Python library used for working with arrays.
 It has functions for working in domain of linear algebra, fourier
transform, and matrices too.
 NumPy was created in 2005 by Travis Oliphant.
 It is an open source project and you can use it freely.
 import numpy as np
 Calculate Standard Deviation and Variance using python
functions.

import numpy as np

 std_dev = np.sqrt(variance)
 variance = np.var(data)
 Takeinput string name from the user and display the names
having more than 5 letters

names = input("Enter names separated by spaces: ").split()

if len(name) > 5:
Write a Python function that checks whether a passed string is
palindrome or not.

#Compare the string with its reverse

s == s[::-1]
 Writea Python program to check whether the given integer is
a multiple of 5 and 7 both using function.

num % 5 == 0 and num % 7 == 0

 What is Pandas? Pandas is a Python library used for working
with data sets.
 It has functions for analyzing, cleaning, exploring, and
manipulating data.
 The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in
2008.
 Pandas is a software library written for the Python
programming language for data manipulation and analysis.
 In particular, it offers data structures and operations for
manipulating numerical tables and time series
 The Describe
function returns the
statistical summary of
the dataframe or
series.
 This includes count,
mean, median (or 50th
percentile) standard
variation, min-max,
and percentile values
of columns.

719 Final Syllabus Merged
No ratings yet
719 Final Syllabus Merged
200 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
Basic Statistics For Data Science
100% (1)
Basic Statistics For Data Science
45 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
1.1 CS3352-FDS - Unit 1
No ratings yet
1.1 CS3352-FDS - Unit 1
42 pages
SAS 2130 Statistics 2021
No ratings yet
SAS 2130 Statistics 2021
212 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Unit IV
No ratings yet
Unit IV
22 pages
Basic Stats Session
No ratings yet
Basic Stats Session
16 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
ESA - QP - UE19-20CS203 - SDS - Scheme and Solution
No ratings yet
ESA - QP - UE19-20CS203 - SDS - Scheme and Solution
12 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
Stats and Its Real World Applications.
No ratings yet
Stats and Its Real World Applications.
53 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
Data Exploration and Visualization Unit 1
No ratings yet
Data Exploration and Visualization Unit 1
4 pages
PRW Questions
No ratings yet
PRW Questions
31 pages
Unit 4
No ratings yet
Unit 4
66 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Zazzafar Kishi Complt by Mumy Islam-1
No ratings yet
Zazzafar Kishi Complt by Mumy Islam-1
34 pages
Examples Regression
No ratings yet
Examples Regression
19 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Unit 2
No ratings yet
Unit 2
20 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
DOM503 Session 1
No ratings yet
DOM503 Session 1
19 pages
Ds 5
No ratings yet
Ds 5
9 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Grade 9 Religious Studies Syllabus
No ratings yet
Grade 9 Religious Studies Syllabus
36 pages
SCS3250A - Module 1 - Introduction To Statistics and Analytics
No ratings yet
SCS3250A - Module 1 - Introduction To Statistics and Analytics
44 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (1) : Inteligência Artificial E Cibersegurança (Inacs)
33 pages
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
No ratings yet
Descriptive Analytics I: Nature of Data,: Statistical Modeling, and Visualization
76 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
DJ 14 Ai&ds 3
No ratings yet
DJ 14 Ai&ds 3
20 pages
Bny Sec Acr 2505191205 1100644398 1 1
No ratings yet
Bny Sec Acr 2505191205 1100644398 1 1
50 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
Ms Data Science S, 24 (WEEK# 1) Unlock
No ratings yet
Ms Data Science S, 24 (WEEK# 1) Unlock
31 pages
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
No ratings yet
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
11 pages
Ms Data Science S, 24 (WEEK# 1)
No ratings yet
Ms Data Science S, 24 (WEEK# 1)
30 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Course Outline - Political Philosophy
No ratings yet
Course Outline - Political Philosophy
10 pages
XSTK Project PDF
No ratings yet
XSTK Project PDF
26 pages
Unit .......
No ratings yet
Unit .......
45 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
01-c Plant Nursery Skill Development
No ratings yet
01-c Plant Nursery Skill Development
5 pages
Program-1
No ratings yet
Program-1
15 pages
Government Thesis Paper
100% (3)
Government Thesis Paper
6 pages
21st Module 5
No ratings yet
21st Module 5
5 pages
Achieving Goals
No ratings yet
Achieving Goals
19 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
8611 - Assignment 2 (AG)
100% (1)
8611 - Assignment 2 (AG)
14 pages
Six Steps Cheat Sheet and Template With Probe Revised 2015-10-19
No ratings yet
Six Steps Cheat Sheet and Template With Probe Revised 2015-10-19
3 pages
Ass-3 Ds
No ratings yet
Ass-3 Ds
7 pages
Student - Supervisor Expectations (Masters)
No ratings yet
Student - Supervisor Expectations (Masters)
14 pages
Addition, Subtraction, Multiplication, Division
No ratings yet
Addition, Subtraction, Multiplication, Division
8 pages
KubeDay Colombia 2024 - Demystifying Kubernetes Resource Management - v1
No ratings yet
KubeDay Colombia 2024 - Demystifying Kubernetes Resource Management - v1
25 pages
Ntu Min Subj Req
No ratings yet
Ntu Min Subj Req
11 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Editorial Writing Exercises
No ratings yet
Editorial Writing Exercises
5 pages
Qualities of A Good Researcher PDF
No ratings yet
Qualities of A Good Researcher PDF
42 pages
Business Analytics (MIS171) Summary Notes
No ratings yet
Business Analytics (MIS171) Summary Notes
6 pages
Python Programming Brochure
No ratings yet
Python Programming Brochure
7 pages
Group 6 - Ob Od - Bpa 2B
No ratings yet
Group 6 - Ob Od - Bpa 2B
8 pages
Conceptual Paper
No ratings yet
Conceptual Paper
15 pages
10 Great Relationship Principles1
No ratings yet
10 Great Relationship Principles1
2 pages
Multiple Intelligences in The Classroom, 4th Ed
No ratings yet
Multiple Intelligences in The Classroom, 4th Ed
8 pages
4'as LESSON PLAN
0% (1)
4'as LESSON PLAN
3 pages
6th Maths Paper (1st Term)
No ratings yet
6th Maths Paper (1st Term)
2 pages
Four Common Stages of Cultural Adjustment : STAGE 1: "The Honeymoon"-Initial Euphoria/Excitement
100% (1)
Four Common Stages of Cultural Adjustment : STAGE 1: "The Honeymoon"-Initial Euphoria/Excitement
2 pages
Gsu List English 15032018
No ratings yet
Gsu List English 15032018
13 pages
Drag Force Report
No ratings yet
Drag Force Report
8 pages
Language Day: Activity1: Fill in The Gaps With: He, Him, His. A-B - C
No ratings yet
Language Day: Activity1: Fill in The Gaps With: He, Him, His. A-B - C
11 pages
Using An Inquiry Approach To Teach Science To Seco
No ratings yet
Using An Inquiry Approach To Teach Science To Seco
7 pages
AQA GCSE Specimen Paper Business Studies Exam
No ratings yet
AQA GCSE Specimen Paper Business Studies Exam
16 pages

Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Uploaded by

Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Uploaded by

Doing Data Science-Rachel Schutt,Cathy O’Neil

 For instance if we're analyzing survey responses and some

 Example Traffic patterns on roads and highways can exhibit

 Standard deviation is a statistical measure that quantifies the

 Population mathematical model sample (e.g Average height

 Example: Suppose you have a dataset containing information

 It's a crucial step in the data science process where analysts

names = input("Enter names separated by spaces: ").split()

#Compare the string with its reverse

num % 5 == 0 and num % 7 == 0

You might also like