0% found this document useful (0 votes)
24 views44 pages

Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views44 pages

Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Doing Data Science-Rachel Schutt,Cathy O’Neil

 What is data ?
 Data uncertainty and randomness
 Understanding of Descriptive and Statistical Inference
 Populations and samples
 Fitting a model
 Data Science Process
 What is exploratory data analysis?(EDA)
 Python and its basic fuctions
 Numpy and Pandas
 Data represents the traces of the real-world processes.
OR
 A datum is an abstraction of real world entity.
 The terms variable, feature, and attribute are often used
interchangeably to denote an individual abstraction.
 Each entity is typically described by a number of attributes
 A data set consists of the data relating to a collection of
entities, with each entity described in terms of a set of
attributes
 Randomness and uncertainty are inherent aspects of data
analysis
 Effective data analysis involves understanding and
appropriately accounting for these factors to make informed
decisions and draw reliable conclusions.
 A mathematical model for uncertainty and randomness is
offered by probability theory.
 Uncertainty refers to the lack of complete Knowledge about
the outcome of a measurement or observation.
 It arises from various sources, including measurement error,
variability in the data, and incomplete information

 For instance if we're analyzing survey responses and some


participants didn't answer certain questions, there's
uncertainty about the true values of those missing responses.
 Randomness refers to the lack of predictability or pattern in
a sequence of events or observations.
 Randomness can arise from inherent variability in a system,
stochastic processes, or chance events

 Example Traffic patterns on roads and highways can exhibit


elements of randomness due to various factors such as
weather conditions, accidents, and human behavior.
 Statistics is a way to get information from data
 Descriptive statistics are used to summarize and describe a
variable or variables for a sample of data.
 For example, sample statistics such as the mean (x̅) and
standard deviation (s) are often used to summarize and
describe continuous variables
 Descriptive statistics help us understand the important
characteristics of a group of data, like its average, spread,
and shape, without having to examine each data point
separately.
 We can use descriptive statistics to describe both an entire
population or an individual sample.
 The mean or average is the sum of all values in dataset
divide by the number of values.
 It represent the central point around which the data is
distributed.
 Mean = (85 + 90 + 78 + 92 + 87) / 5 = 432 / 5 = 86.4
 Example Grading: In education, teachers often use the mean
to calculate students' average scores on assignments, tests,
or exams. This helps to assess overall performance and
determine grades.
 Median is the middle value of data set when it is ordered
9, 3, 1, 8, 3, 6
1, 3, 3, 6, 8, 9
median = 4
Example : Divide and conquer rule
 Mode is a value that appears most frequently in a dataset.
 Example : e-commerce
Person shop more will get more discounts .
9, 3, 1, 8, 3, 6
1, 3, 3, 6, 8, 9
Mode = 3
 Range a measure of spread or dispersion of dataset.
 Example : Shopping price range (budget friendly goods)
It is define as the difference between min and max value in a set of
observation
9, 3, 1, 8, 3, 6
range = 9 – 1 = 8
 Variance is a measure of spread or dispersion of a set of data
points. Its calculated as the average of the squared
difference from mean.

 Standard deviation is a statistical measure that quantifies the


amount of variation or dispersion in a set of data values. In
simpler terms, it tells you how spread out the values in a
dataset are from the mean (average) value.
 Statistical inference is the process of making conclusions or
predictions about a population based on data collected from
a sample of that population. OR
 In other words, it involves drawing inferences or
generalizations about a larger group (the population) using
information obtained from a smaller subset (the sample).
 Development of procedures, methods, and theorems that
allow us to extract meaning and information from data that
has been generated by stochastic (random) processes
 Population is complete set of traces/data points
 Sample is a subset of the complete set or population

 Population mathematical model sample (e.g Average height


of all adults in New york )
.
 No matter what we do, there will be sampling error or
variation due to sampling as we are looking at the part of a
population, not the whole population.
 In statistics, sampling errors are incurred when the
statistical characteristics of a population are estimated from
a subset, or sample, of that population. Since the sample
does not include all members of the population, statistics of
the sample (often known as estimators), such as means and
quartiles, generally differ from the statistics of the entire
population (known as parameters).
 The difference between the sample statistic and population
parameter is considered the sampling error.
 Definition: Fitting a model means finding the best
mathematical representation that describes the relationship
between the input variables and the output variable in your
dataset.

 Example: Suppose you have a dataset containing information


about house prices, including features like the size of the
house, number of bedrooms, and location. You can fit a
linear regression model to predict house prices based on
these features.
 Underfitting occurs when the model is too simple to capture
the underlying patterns in the data.
 Overfitting happens when the model is too complex and
captures noise or random fluctuations in the training data,
rather than the underlying relationships.
 Appropriate fitting occurs when the model captures the
underlying patterns in the data without being too simple
(underfitting) or too complex (overfitting).

 https://www.linkedin.com/pulse/model-fitting-i4data/
 “Exploratory data analysis” is an attitude, a state of
flexibility, a willingness to look for those things that we
believe are not there, as well as those we believe to be
there. — John Tukey

 It's a crucial step in the data science process where analysts


explore and summarize the main characteristics of a dataset,
Understand the data
OR
 A method used to analyze and summarize data sets.
 The basic tools and techniques used in EDA includes plots,
graphs and summary statistics(such as mean ,median and
standard deviation).
 Before you start data analysis or run your data through a
machine learning algorithm, you must clean your data and
make sure it is in a suitable form. Further, it is essential to
know any recurring patterns and significant correlations that
might be present in your data. The process of getting to
know your data in depth is called Exploratory Data Analysis
 In the end, EDA helps you make sure the product is
performing as intended.
 The data science process typically involves several stages
aimed at extracting insights and knowledge from data to
solve specific problems or make informed decisions.
 While the exact steps may vary depending on the context and
specific methodologies .
 NumPy is a Python library used for working with arrays.
 It has functions for working in domain of linear algebra, fourier
transform, and matrices too.
 NumPy was created in 2005 by Travis Oliphant.
 It is an open source project and you can use it freely.
 import numpy as np
 Calculate Standard Deviation and Variance using python
functions.

import numpy as np

 std_dev = np.sqrt(variance)
 variance = np.var(data)
 Takeinput string name from the user and display the names
having more than 5 letters

names = input("Enter names separated by spaces: ").split()


if len(name) > 5:
Write a Python function that checks whether a passed string is
palindrome or not.

#Compare the string with its reverse


s == s[::-1]
 Writea Python program to check whether the given integer is
a multiple of 5 and 7 both using function.

num % 5 == 0 and num % 7 == 0


 What is Pandas? Pandas is a Python library used for working
with data sets.
 It has functions for analyzing, cleaning, exploring, and
manipulating data.
 The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in
2008.
 Pandas is a software library written for the Python
programming language for data manipulation and analysis.
 In particular, it offers data structures and operations for
manipulating numerical tables and time series
 The Describe
function returns the
statistical summary of
the dataframe or
series.
 This includes count,
mean, median (or 50th
percentile) standard
variation, min-max,
and percentile values
of columns.

You might also like