Data Sciences Unit-I
Data Sciences Unit-I
ht
Instructor : Krishna Dutt,
r
py
March 23, 2024
Co
ft-
Disclaimer: The views expressed in this presentation are those of the
ra
author, and many open source content are referenced, with all authors
duly acknowledged.
D
Instructor : Krishna Dutt, [email protected] Data Science - Unit - I March 23, 2024 1 / 83
Unit-I-Introduction to Data Science
Data Science
ht
Data Objects and Attributes types
ig
Measuring Data similarity and dissimilarity
r
Data Preprocessing
py
Data Cleaning
Co
Data Integration
Data Reduction
Data Transformation
ft-
Data Discretization
ra
Traits of Big data
Hypothesis and inference
D
Analysis vs Reporting.
Instructor : Krishna Dutt, [email protected] Data Science - Unit - I March 23, 2024 2 / 83
Unit-I-Data Science Toolkits using Python
Matplotlib
NumPy
ht
Scikit-learn
ig
Visualizing Data: Bar
r
py
Charts
Line Charts
Co
Scatter plots.
ft-
Working with data
Reading Files
ra
ht
Volume: Large amounts of data (terabytes to petabytes).
ig
Velocity: Rapid generation and processing of data.
r
Variety: Different types of data (structured, unstructured).
py
Veracity: Reliability and trustworthiness of data.
Co
Variability: Inconsistency in data flow.
Value: Extracting meaningful insights from data.
ft-
Complexity: Complex data structures and relationships.
ra
ht
inconsistent due to various reasons. Processing of data requires that it is
ig
accurate, complete for the intended purpose and also inconsistent. Before
r
The following describe important requirements data quality.
py
Co
ft-
ra
D
ht
selection of instruments used may be faulty
for reasons of privacy etc., purposely submit incorrect data values by
ig
individuals or organizations.
r
Errors in data transmission due to technology limitations, like limited
py
buffer size for coordinating synchronized data transfer, etc.
Co
The data col
completeness : Attributes of interest may not always be available,
ft-
such as customer information in sales transaction data. Some
attributes may not be included as they were not considered important
ra
ht
Data preprocessing is crucial for ensuring the quality and reliability of
analysis results. Two key aspects of preprocessing are handling missing
ig
values and cleaning noise:
r
py
Missing Value Handling: Missing values can arise due to reasons
like data entry errors, sensor malfunctions, or intentional data
Co
omissions. Techniques for handling missing values include:
Imputation: Replace missing values with a suitable estimate, such as
ft-
mean, median, or mode of the column.
Deletion: Remove rows or columns with missing values, especially if
ra
ht
data that can obscure meaningful patterns. Techniques for noise
ig
cleaning include:
r
Smoothing: Apply filters or moving averages to remove
py
high-frequency noise while preserving the underlying trends.
Co
Outlier detection and removal: Identify and eliminate data points
that deviate significantly from the rest of the dataset. Transformation:
ft-
Transform the data using mathematical functions to reduce the
impact of noise and improve interpretability. Example: In a dataset of
ra
can be imputed using the median value of the column, while noise
from sensor errors can be cleaned using smoothing techniques.
ht
unified dataset for analysis. This process aims to provide a comprehensive
ig
view of the underlying phenomenon and enable more robust analysis.
Techniques for data integration include:
r
py
Schema matching: Identify and reconcile differences in data
schemas across sources to ensure consistency.Entity resolution:
Co
Resolve discrepancies in entity names or identifiers to avoid
duplication and ensure accurate integration.Data fusion: Merge
ft-
datasets based on common attributes or keys, taking into account
ra
data quality and reliability. Example: Integrating sales data from
different regions and time periods to analyze overall sales trends and
D
ht
Techniques for data reduction include:
ig
Principal Component Analysis (PCA): PCA identifies the principal
r
components of variation in the data and represents the dataset in a
py
lower-dimensional space while retaining most of its variance.
Attribute Subset Selection: Select a subset of relevant features or
Co
attributes from the dataset based on criteria such as correlation,
importance, or domain knowledge.
ft-
Parametric Reduction: Utilize techniques like feature engineering or
ra
ht
variables. Techniques for data transformation include:
ig
Standard Normalization: Standardize data to have zero mean and
r
unit variance, making it more amenable to certain algorithms like
py
linear regression or neural networks.
Co
Min-Max Normalization: Rescale data to a fixed range, typically [0,
1], to maintain relative differences while ensuring uniformity across
ft-
features.
Smoothing: Apply smoothing techniques such as moving averages or
ra
or patterns.
Example: Normalizing features such as income, age, and education level to
a common scale to avoid bias in a machine learning model based on
Euclidean distances.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 11 / 83
Data Discretization
ht
number of intervals or bins. This can help simplify analysis, reduce
ig
complexity, and facilitate pattern discovery. Techniques for data
r
discretization include:
py
Binning: Divide the range of a continuous variable into equal-width
Co
or equal-frequency bins to group similar values together.
Histogram Analysis: Construct histograms to visualize the
ft-
distribution of continuous variables and identify natural breakpoints
for discretization.
ra
Example: Discretizing the age variable into age groups (e.g., 0-18, 19-35,
D
ht
x1,1 x1,2 . . . x1,j . . . x1,m
ig
x2,1 x2,2 . . . x2,j . . . x2,m
X = . (1)
r
.. .. .. .. ..
.. .
py
. . . .
xn,1 xn,2 . . . xn,j . . . xn,m
Co
Column and row views: Each column represents a
feature/attribute/variable of the object. Each row depicts an object
ft-
of n features.
ra
x1,2
x2,1
D
..
X2 = .
O2 = x2,1 x2,2 . . . x2,j . . . x2,n (2)
..
.
xm,1
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 13 / 83
Attributes/features/variables
Attributes: Each column in data matrix is an attribute or feature or
variable. Ex. Age,Gender, Salary, Degree.
ht
r ig
py
Co
ft-
ra
D
ht
r ig
py
Co
ft-
ra
D
ht
quantitative attributes, with examples.
ig
Qualitative Attributes: Qualitative attributes, also known as categorical
r
or nominal attributes, represent characteristics or qualities that cannot be
py
measured numerically. These attributes describe the quality or category of
an object or entity. They are often represented by labels or names.
Co
Examples:
ft-
Gender : Male, Female, Other
Marital Status : Single, Married, Divorced
ra
ht
numerical values and can be further categorized into discrete or continuous
ig
attributes.
r
discrete: Discrete attributes can only take on specific, distinct values
py
within a finite or countable range. They are often counts or integers.
Examples :
Co
Number of Children in a Family (e.g., 0, 1, 2, 3)
Number of Employees in a Company (e.g., 10, 20, 30)
ft-
Number of Pets (e.g., 0, 1, 2, 3)
ra
continuous: Continuous Quantitative Attributes: Continuous
attributes can take on any value within a range and are often
D
ht
Attribute Type Definition Examples
Qualitative Attributes Characteristics or qualities that Gender: Male, Female, Other
ig
cannot be measured numerically.
They describe the quality or cate-
r
gory of an object or entity.
py
Marital Status: Single, Married, Divorced
Type of Vehicle: Sedan, SUV, Truck
Co
Colors: Red, Blue, Green
Quantitative Attributes Attributes representing quantities Discrete: Number of Children in a Family
or measurable quantities that can
ft-
be expressed numerically. Can be
discrete or continuous.
ra
(Discrete) Number of Employees in a Company
Number of Pets
D
ht
(Y) is given more weight age. Ex. has fever/Infection. Absence of fever
ig
may be less important than having it. So, binary attribute is further
classified as symmetric and asymmetric binary attribute.
r
py
Ex :
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Co
P1 M Y N P N N N
In
P2 M Y Y N N N N
ft-
P3 F Y N P N P N
the above table we see binary attribute for gender, Fever,Cough,
ra
in case of other attributes, having fever and cough, testing positive are
more important than negative. Hence, these can be considered as
assymmetric.
ht
In data sciences and ML/DL data is represented as matrix.
ig
Question Is table same as matrix?
r
py
A matrix, as defined in linear algebra, the cells belong to ∈ R, real
number set.
Co
Hence rows are columns are vectors ∈ R Vector space.
Hence both statistical techniques as well as algebraic techniques can
ft-
directly be used.
ra
described previously.
Hence, non-numeric data need to be mapped into numeric data
before representing the table as matrix.
ht
Example: In a database of student ages, if a student’s age is recorded
ig
as 25 instead of 23, it represents an inaccuracy in the data.
r
Data Consistency:
py
Definition: Consistency ensures uniformity and coherence of data
Co
across various databases or data sources.
Example: In a customer database, if the same customer is represented
ft-
as ”Raghava Ram” in one record and ”Ram, Raghava” in another, it
reflects inconsistency.
ra
Data Completeness:
D
ht
aiming to improve data quality by addressing issues like errors,
inconsistencies, and missing values.
ig
Numerical Example:
r
py
Consider a dataset with age values of individuals, and you observe
entries with negative values or outliers.
Co
Data Cleaning Steps:
Remove entries with negative age values as they are erroneous.
ft-
Address outliers by either replacing them with a reasonable value or
ra
removing them.
Common Data Cleaning Techniques:
D
ht
Data Integration involves combining data from different sources to
provide a unified view, enabling better analysis and decision-making.
ig
Example: Combining Customer Data
r
py
Assume you have customer information in two datasets - one
containing personal details and another with purchase history.
Co
Data Integration Steps:
Identify a common key, such as customer ID, in both datasets.
ft-
Merge datasets based on the common key to create a unified dataset
with both personal details and purchase history.
ra
ht
similar analytical results.
It involves techniques to minimize the amount of data while
ig
preserving its integrity.
r
py
Example: Principal Component Analysis (PCA)
Assume you have a dataset with many correlated variables.
Co
PCA can be applied to transform the dataset into a new set of
uncorrelated variables (principal components) while retaining most of
ft-
the original information.
ra
ht
preserving essential information. This can be achieved through techniques
like sampling and aggregation.
ig
Random Sampling: Selecting a random subset of the data points to
r
py
represent the entire dataset. This is useful when dealing with large
datasets.
Co
Stratified Sampling: Dividing the dataset into strata and then
applying random sampling within each stratum. It ensures
ft-
representation from different subgroups.
ra
ht
ig
Consider a dataset with individual daily sales data. To reduce numerosity,
we can aggregate this data into weekly sales totals.
r
py
Day Sales
Co
Day 1 100 Week Weekly Sales
Day 2 150 Week 1 370
ft-
Day 3 120
ra
D
ht
Introduction to Data Transformation:
Data Transformation involves converting data into a suitable format
ig
for analysis.
r
py
It may include normalization, standardization, encoding, or other
techniques.
Co
Numerical Example - Normalization:
ft-
Original Data: [2, 5, 10, 7]
Normalized Data: [0.1, 0.3, 0.6, 0.4]
ra
x−min(x)
Formula: xnormalized =
D
max(x)−min(x)
ht
Numerical Example - Standardization:
ig
Original Data: [10, 15, 20, 25]
r
Standardized Data: [−1.34, −0.45, 0.45, 1.34]
py
x−mean(x)
Formula: xstandardized = std(x)
Co
Benefits of Data Transformation:
Improves the performance and accuracy of machine learning
ft-
algorithms.
ra
ht
categories or bins.
It simplifies the data and makes it easier to analyze or apply certain
ig
algorithms.
r
py
Example 1 - Age Discretization:
Original Age Data: [25, 30, 35, 40, 45]
Co
Discretized Age Categories:
ft-
[Young, Young Adult, Adult, Middle-Aged, Senior]
Criteria: [0 − 30, 31 − 35, 36 − 40, 41 − 45, 46+]
ra
ht
X −µ
Standardized Feature =
ig
σ
The normalization process scales the data to have a mean of 0 and a
r
py
standard deviation of 1.
1
Co
X = 2 (3)
3
ft-
1 1
ra
µ= 1 1 1 2 = 2 (4)
3
3
D
1−2
(X − µ · 1) = 2 − 2 (5)
3−2
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 30 / 83
r
1
ht
σ= (X − µ · 1)T (X − µ · 1) (6)
N
ig
v
−1
u
r
u1
u
py
= t −1 0 1 0 (7)
3
1
Co
−1
1
ft-
=p 0 (8)
2/3 1
ra
D
ht
ig
Min-Max Scaling:
r
py
Scale the data within a specific range (e.g., [0, 1]):
Co
X − min(X )
Scaled Feature =
max(X ) − min(X )
ft-
ra
D
ht
different ranges in the numerical attributes.
ig
Name Age Income Expenses
r
A 30 50000 2000
py
B 25 60000 2500
C 35 75000 3000
Name Age
Co
Income Expenses
ft-
A 0 -1.188 -0.173
B -1.224 -1.688 0
ra
ht
consider columns X and Y, with numerical type, in a Data matrix. Define
ig
1 as a vector of size same that of X. Then
r
Mean:
py
1
Mean(X̄ ) = 1T · X
n
Co
Variance:
1
Variance(σ 2 ) = ( (X − 1X̄ )T · (X − 1X̄ )
ft-
n
Standard Deviation:
ra
r
1
D
Given data vector X = [4, 6, 8], the mean (X̄ ) in vector form is calculated
ht
as follows:
ig
1 T 1 4 18
X̄ = 1 · X = 1 1 1 · 6 = =6
r
3 3 3
py
8
Co
Given data vector X = [4, 6, 8], the standard deviation (σ) in vector form
is calculated as follows:
ft-
r
1
σ= (X − 1X̄ )T · (X − 1X̄ )
ra
3
D
v r
−2
u
u1 8
u
= t −2 0 2 · 0 =
3 3
2
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 35 / 83
Relationship between all features
ht
Given data vectors X = [1, 2, 3] and Y = [2, 3, 4], the covariance (Σ) in
ig
vector form is calculated as follows:
r
py
1
Σ = (X − 1X̄ )T · (Y − 1Ȳ )
3
Co
1 −1 2
ft-
= −1 0 1 · 0 =
3 3
1
ra
D
ht
1
Σ= (X − 1X̄ )T · (X − 1X̄ )
ig
N
r
py
Let’s take a numerical example with a 2x3 data matrix:
Co
1 2 3
X =
4 5 6
ft-
The mean (X̄ ) for each column is calculated as:
ra
T
1 · X1 1+4 2.5
D
1 T 1
X̄ = 1 · X2 =
2 + 5 = 3.5
2 T 2
1 · X3 3+6 4.5
ht
T
ig
−1.5 −0.5 0.5 −1.5 −0.5 0.5
1
Σ = −0.5 0.5 1.5 · −0.5 0.5 1.5
r
3
py
0.5 1.5 2.5 0.5 1.5 2.5
Co
2.25 0.25 −0.25
1
= 0.25 0.25 0.25
ft-
3
−0.25 0.25 0.25
ra
variables.
ht
ig
Covariance:
r
Measures the degree of joint variability between two variables.
py
Pn
i=1 (Xi −X̄ )(Yi −Ȳ )
Formula: cov(X , Y ) = n
Co
Unit of measurement is the product of the units of the two variables.
Scale is not standardized, making it difficult to compare across
ft-
different datasets.
ra
D
ht
cov(X ,Y )
Formula: corr(X , Y ) = σX σY
ig
Ranges from -1 to 1, where -1 indicates a perfect negative linear
r
relationship, 1 indicates a perfect positive linear relationship, and 0
py
indicates no linear relationship.
Co
Unitless, making it easier to compare across different datasets.
Key Differences:
ft-
Covariance can take any value, while correlation is normalized
between -1 and 1.
ra
Euclidean Distance:
ht
v
u n
ig
uX
Euclidean Distance(X , Y ) = t (xi − yi )2
r
i=1
py
p
(X − Y )T · (X − Y )
Co
Vector Form:
Manhattan Distance:
ft-
n
X
Manhattan Distance(X , Y ) = |xi − yi |
ra
i=1
D
Pn
Vector Form: i=1 |Xi − Yi |
XT · Y
Cosine Similarity(X , Y ) = p
(X T · X ) · (Y T · Y )
ht
X T ·Y
Vector Form: ∥X ∥·∥Y ∥ Example:
ig
1 2
r
py
X = 2 , Y = 3
3 4
Co Cosine Similarity(X , Y ) = √
20
ft-
406
ra
Jaccard Similarity:
D
|X ∪ Y | XT · Y
Jaccard Similarity(X , Y ) =
|X ∩ Y | X · X + Y T · Y − X T · Y
T
ht
ig
Similarity Metric Use Cases
r
Euclidean Distance Numerical data, continuous features
py
Cosine Similarity Text data, high-dimensional sparse vectors
Co
Jaccard Similarity Sets, binary data, categorical features
Pearson Correlation Linear relationships, continuous data
ft-
Hamming Distance Binary strings, categorical data
ra
D
ht
each dimension corresponds to a unique word in the vocabulary, and
ig
the value of each dimension represents the frequency or TF-IDF score
r
of the word in the document.
py
Cosine similarity measures the cosine of the angle between two
vectors, which reflects their similarity in direction regardless of their
Co
magnitude.
For text data, cosine similarity is used to calculate the similarity
ft-
between documents or text passages based on their word frequencies
ra
or TF-IDF scores.
Documents with similar content will have vectors pointing in similar
D
ht
D1 =[3, 2, 1,0]Document2 :D2 =[2,1,0,3]
ig
The cosine similarity between D1 and D2 is calculated as:
r
D1 · D2
py
similarity(D1 , D2 ) =
∥D1 ∥∥D2 ∥
Co
Substituting the values:
3×2+2×1
similarity(D1 , D2 ) = √ √
ft-
3 + 2 + 1 2 × 22 + 1 2 + 3 2
2 2
ra
6+2
=√ √
14 × 14
D
8
= ≈ 0.571
14
Thus, the cosine similarity between the two documents is approximately
0.571.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 45 / 83
Numerical Example: Cosine Similarity
Given vectors X = [1, 2, 3] and Y = [2, 3, 4], the cosine similarity in vector
ht
form is calculated as follows:
ig
XT · Y
Cosine Similarity =
r
∥X ∥ · ∥Y ∥
py
2
Co
1 2 3 · 3
ft-
4
=v v
1 u 2
u u
ra
u
t 1 2 3 · 2 · t 2 3 4 · 3
u u
D
3 4
ht
Given vectors X = [1, 2, 3] and Y = [4, 5, 6], the Euclidean distance in
ig
vector form is calculated as follows:
r
py
q
Euclidean Distance = (X − Y )T · (X − Y )
Co
v
−3
u
u
ft-
= t −3 −3 −3 · −3
u
−3
ra
D
ht
vector form is calculated as follows:
ig
Manhattan Distance = 1T · |X − Y |
r
py
|1 − 4|
Co
= 1 1 1 · |2 − 5|
|3 − 6|
ft-
3
ra
= 1 1 1 · 3 = 1 · 3 + 1 · 3 + 1 · 3 = 9
3
D
ht
maps to 1, B maps to 2, C maps to 3, arbitrary.
Object Color Shape Size
ig
1 Red Circle Small
r
2 Blue Triangle Medium
py
3 Green Square Large
Co
Object Color Shape Size
1 1 2 3
ft-
2 2 1 2
3 3 3 1
ra
O1 √ O2 √ O3
D
O1 √0 3 √6
O2 √3 √0 6
O3 6 6 0
The following shows the original data table, transformed data table, and
dissimilarity table. Let us map the nominal attributes into integers as A
ht
maps to 1, B maps to 2, C maps to 3, arbitrarily.
Object Color Shape Size
ig
1 Red Circle Small
r
2 Blue Triangle Medium
py
3 Green Square Large
Co
Object Color Shape Size
1 1 2 3
ft-
2 2 1 2
3 3 3 1
ra
O1 √ O2 √ O3
D
O1 √0 3 √6
Dissimilarity between O1 & O3 is same as
O2 √3 √0 6
O2 & O3, though their attributes differ.
O3 6 6 0
ht
measure is defined for obtaining dissimilarity of nominal attributes:
ig
m
d(i, j) = , (9)
r
p
py
where m is the count of differing attributes, and p is the total number of
Co
attributes. For the above table p = 3 and the dissimilarity table becomes
O1 O2 O3
ft-
O1 0 1 1
From the above it can be seen that all the three
ra
O2 1 0 1
O3 1 1 0
D
Ordinal attributes are ranked as they show an order in qualifying the real
world attributes, ex. Excellent, very good, good, fair, each of which is
ht
better than the subsequent, considered from the start excellent. Hence can
ig
be transformed into numerical by ranking them. Dissimilarity is obtained
in steps, as follows.
r
py
step 1. value of any ordinal attribute in any object is mapped to a
number r ϵ M, M is the number of ranks for that attribute, an integer.
Co
each rank r is mapped to a number between [0.0, 1.0] by the formula
ft-
r −1
Z= (10)
ra
M −1
D
ht
mapping the ordinal into numerical and obtain dissimilarity matrix with
Manhattan distance metric on the transformed values.
ig
Object Performance Rank Numerical value
r
(3−1)
1 Excellent (r=3) (3−1) = 1
py
(1−1)
2 Fair (r=1) (3−1) = 0
Co
(2−1)
3 Good (r=2) (3−1) = 0.5
ft-
(3−1)
4 Excellent (r=3) (3−1) = 1
O1 O2 O3 O4
ra
O1 0 1.0 0.5 0
D
ht
diagnosis.
ig
q = number of attributes that equal 1 for both objects i and j,
r
r = the number of attributes that equal 1 for object i but equal 0 for
py
object j,
s = the number of attributes that equal 0 for object i but equal 1 for
Co
object j,
t = the number of attributes that equal 0 for both objects i and j.
ft-
The total number of attributes p = q+ r + s + t.
ra
Dissimilarity is defined as
D
(r + s)
d(i, j) = forSymmetricalcase (11)
(q + r + s + t)
(r + s)
d(i, j) = forAsymmetricalcase (12)
(q + r + s)
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 54 / 83
Dissimilarity of binary attributes - example
Consider an example.
ht
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Ramu M Y N P N N N
ig
Prakash M Y N N N N N
r
Jyoti F Y N P N N N
py
The asymmetrical values are cast as (1) P = 1, Y = 1, (2) N=0, N = 0.
Co
(1 + 1) 1
d(Ramy, Prakash) = = ; r = 1, s = 1, q = 1
(1 + 1 + 1) 3
ft-
(0 + 1) 1
d(Ramy, Jyoti) = = ; r = 0, s = 1, q = 2 (13)
ra
(1 + 0 + 2) 3
D
(1 + 2) 3
d(Prakash, Jyoti) = = ; r = 1, s = 2, q = 1
(1 + 1 + 2) 5
ht
attributes discussed above. Dissimilarity of mixed type is given by
ig
Σδijf dijf
r
d(i, j) = (14)
py
Σδijf
Co
asymmetric binary attribute else (3) δif j = 1
ft-
|x −x |
numeric attribute dif j = (maxh xhfif −min
jf
hf xhf )
where h runs over all non
missing attributes for objects
ra
(rif −1
ordinal : compute rank rif map it as numeric zif = (Mif −1)
Consider
Fruit Color Sweets Weight
ht
F1 Orange Very Sweet 25
ig
F2 Red Sweet 20 In the above table maxh xh = 30
r
F3 Yellow Sour 15
py
F4 Orange Very Sweet 30
Co
and minh xh = 15. (maxh xh − minh xh ) = 15
For numerical attribute the dissimilarity matrix is
ft-
O1 O2 O3 O4
O1 0
ra
(25−20)
O2 15 = 0.33
D
(20−15) (25−15)
O3 15 = 0.33 15 = 0.66 0
(20−15) (20−30) (15−30)
O4 15 = 0.33 15 = −0.66 15 = −1 0
ht
O4 0 1 1 0
ig
Dissimilarity for Ordinal attribute : Sweetness
O1 O2 O3 O4
r
py
O1 0 1.0 0.5 0
O2 1 0 0.5 1 δijf for each of the three attributes = 1,
Co
O3 0.5 0.5 0 0.5
O4 0 1.0 0.5 0
ft-
Combined dissimilarity matrix is
ra
O1 O2 O3 O4
O1 0
D
(1+1+0.33)
O2 3 = 0.77 0
(1+0.5+0.33) (1+0.5+0.66)
O3 3 = 0.84 3 = 0.84 0
(0+0+0.33) (1+1+0.66) (1+0.5+1)
O4 3 = 0.11 3 = 0.84 3 = 0.84 0
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 58 / 83
Unbiased Estimates of Population Parameters
Definition
ht
An unbiased estimate of a population parameter is a statistic calculated
from a sample that, on average, tends to be neither consistently higher nor
ig
consistently lower than the true value of the parameter it estimates.
r
py
Population vs. Sample:
Population: Entire group you study
Co
Sample: Subset drawn from the population
Estimation vs. Parameter:
ft-
Estimate: Approximation of a parameter based on a sample
ra
Parameter: Numerical characteristic of the entire population
Bias vs. Unbiased:
D
ht
Example
ig
Sample mean: Unbiased estimate of population mean
r
py
Sample variance (n / (n-1)): Unbiased estimate of population
variance
Co
Sample proportion: Unbiased estimate of population proportion
ft-
Unbiased estimates are crucial for reliable inferences about the population.
ra
Unbiasedness is not always achievable, especially with small samples.
D
Key Idea
The grand mean, calculated by averaging individual sample means, is an
ht
unbiased estimate of the true population mean.
ig
Population mean µ: True average of the entire population.
r
py
Sample mean µi : Mean of a single sample drawn from the population.
Co
Multiple samples: Draw multiple random samples, get multiple µi
values.
ft-
Grand mean (µ): Average of all the individual sample means µ1 , µ2 .
Why Grand Mean Estimates :
ra
Grand Mean
ht
E[µ] = Expected value of the grand mean
ig
=µ (unbiased estimate)
r
py
Important Points:
Co
Grand mean is an estimate, not exact due to sampling error.
More samples lead to more accurate estimate.
ft-
Grand mean is one estimation method, others exist (e.g., confidence
intervals).
ra
ht
represent the true population mean µp , we construct a confidence
interval(CI) around µ to capture the range of values where µ is likely to
ig
lie. This interval is associated with a confidence level of 1 − α, where α is
r
py
the significance level.
Co
(µp ): True average of the entire population.
(µs ): Average of a single sample drawn from the population.
ft-
CI: Range of values likely to contain with a certain probability.
Confidence level (1 - α): Probability that the interval captures in
ra
repeated sampling.
D
Confidence Interval:
A range of values that is likely to contain the true value of an unknown
ht
parameter. It is often expressed as an interval with an associated
ig
confidence level, indicating the probability that the interval will contain the
true parameter value.
r
py
Example: A 95% confidence interval for the mean height of a population
might be (150 cm, 160 cm), indicating that we are 95% confident that the
Co
true mean height falls within this range.
Test Statistic:
ft-
Is a numerical summary of a set of data used in a hypothesis test. It is
calculated from sample data and is used to determine whether to reject
ra
T-Test:
ht
t-test is a statistical test used to compare the means of two groups to
determine if they are significantly different from each other. It is often
ig
used when the sample size is small, and the population standard deviation
r
is unknown. Example: Comparing the mean scores of two groups of
py
students who received different teaching methods.
Co
Z-Test:
z-test is a statistical test used to determine if there is a significant
ft-
difference between sample and population means or between the means of
two samples. It is often used when the sample size is large, and the
ra
ht
ig
A hypothesis is a testable proposition or educated guess.
r
py
Consists of a null hypothesis (H0 ) and an alternative hypothesis
(H1 or Ha ).
Co
The null hypothesis represents no effect or no difference.
ft-
The alternative hypothesis contradicts the null hypothesis.
ra
D
ht
r ig
Null Hypothesis (H0 ): There is no difference in the mean scores
py
between Group A and Group B.
Co
Alternative Hypothesis (H1 ): There is a significant difference in the
mean scores between Group A and Group B.
ft-
ra
D
152, 158, 149, 162, 155, 157, 160, 150, 154, 151, 156, 153, 157, 148, 159, 155, 162
ht
Step 1: Formulate Hypotheses
r ig
H0 : µ = 155
py
Co
H1 : µ ̸= 155
Step 2: Choose Significance Level
ft-
α = 0.05
ra
D
ht
n − 1 = 29 For α/2 = 0.025 in a two-sided test with 29 df,
tcritical ≈ ±2.462
ig
Step 5: Calculate the Test Statistic
r
py
x̄ = mean of the sample
Co
s = standard deviation of the sample
ft-
n = sample size
ra
ht
Step 1: Formulate Hypotheses
ig
H0 : µ ≤ 155
r
py
H1 : µ > 155
Co
Step 2: Choose Significance Level
ft-
α = 0.05
ra
Step 3: Select the Test Statistic
x̄ − µ0
D
Test Statistic: t = √
s/ n
Step 4: Determine the Critical Region Degrees of freedom (df ):
n − 1 = 29 For α = 0.05 in a one-sided test with 29 df, tcritical ≈ 1.699
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 70 / 83
One-Sided T-Test: Example on Heights of Girls
(Continued)
Step 5: Calculate the Test Statistic
ht
1T X
x̄ = =≈ 155.6
ig
30
r
r
1
(X − 1X̄ )T · (X − 1X̄ ) ≈ 4.58
py
σ=
n
Step 6: Make a Decision
Cot=
x̄ − µ0
√
ft-
s/ n
155.6 − 155
ra
t≈ √ ≈ 0.96
4.58/ 30
D
ht
0
alternative hypothesis (H1 ).
ig
2 Choose Significance Level (α): Select the level of significance to
r
determine the threshold for rejecting H0 .
py
3 Select Test Statistic: Choose an appropriate test statistic based on
Co
4 Collect Data: Gather data from the sample or population under
study.
ft-
5 Compute Test Statistic: Calculate the value of the test statistic
ra
ht
A data shows the average youngsters watches less television than the
Seniors. The population average is 30.0 hours per week, with a standard
ig
deviation of 4 hours. A sample of 65 youngsters has a mean of 25 hours.
r
py
Is there enough evidence to support the claim at alfa = 0.01?
step 1: State the Hypotheses
Co
Null Hypothesis (H0 ): The average college student watches the same
amount of television as the public (µ = 30).
ft-
Alternative Hypothesis (H1 ): The average youngster watches less
television than the seniors (µ < 30).
ra
ht
Z= σ
√
ig
n
Where:
r
py
x̄ is the sample mean,
µ is the population mean,
Co
σ is the population standard deviation, and
n is the sample size.
ft-
Given:
Sample mean (x̄) = 25 hours
ra
Population mean (µ) = 30 hours
Population standard deviation (σ) = 4 hours
D
ht
distribution table the critical value for α = 0.01 to be − 2.33.
ig
Step 5: Make a Decision If the test statistic (Z ) is less than the
r
critical value (-2.33), we reject the null hypothesis. Otherwise, we fail
py
to reject the null hypothesis. Let’s calculate the test statistic:
Co
25 − 30
Z=
√4
ft-
65
−5
Z≈ =≈ −6.406
ra
√4
65
D
ht
r ig
py
If 75% of a sample of 100 individuals prefer Product A, we may infer
that approximately 75% of the entire population has a preference for
Co
Product A. ft-
ra
D
ht
exploration, typically providing a snapshot of information.
ig
Numerical Example:
r
Consider a dataset of monthly sales for a product over a year:
py
Sales = [100, 120, 150, 130, 110, 90, 80, 100, 120, 140, 160, 180]
Analysis:
Co
ft-
Calculate monthly averages, identify seasonal trends, and forecast
future sales based on historical patterns.
ra
ht
Important Packages applicable to Data Science learning include, but not
ig
limited to, are
r
Pandas : Widely used for large data sets.
py
Data Manipulation
Co
Analysis
Data Frame structure simplifies
ft-
data handling
cleaning
ra
filtering
D
ht
Bar Chart: Suitable for displaying and comparing individual categories
ig
or groups.
Line Chart: Ideal for visualizing trends or patterns in data, especially
r
py
over a continuous variable like time.
Scatter Plot: Effective for showing relationships or correlations
Co
between two variables.
Pie Chart: Good for illustrating the proportion of each category in a
ft-
whole.
Histogram: Useful for displaying the distribution of a single
ra
continuous variable.
Box Plot: Helpful in visualizing the distribution and identifying
D
outliers in a dataset.
Heatmap: Great for representing the correlation between multiple
variables in a matrix format.
ht
Matplotlib : Widely used for visualization of data, before and after
ig
preprocessing, and analytics.
r
py
Scatter Plot in 3D: Visualizes individual data points in a 3D space.
Line Plot in 3D: Connects data points with lines in a 3D space.
Co
Surface Plot: Visualizes a surface in 3D.
ft-
Wireframe Plot: Represents a 3D surface with lines connecting the
data points.
ra
D
ht
Broadcasting: allows operations on arrays of different shapes and
sizes. Broadcasting automatically expands the smaller array to the
ig
shape of the larger array, making it easier to perform element-wise
r
py
operations.
Vectorized Operations: supports vectorized operations, allowing
Co
mathematical expressions to be applied element-wise on entire arrays,
which can significantly improve performance.
ft-
Indexing and Slicing: provides advanced indexing techniques,
ra
ht
eigenvalue decomposition, and solving linear systems.
ig
Random Number Generation: includes functions for generating
r
random numbers with different distributions, essential for simulations
py
and statistical applications.
Co
Integration with Other Libraries: a fundamental building block for
many other scientific computing libraries in Python, such as SciPy,
ft-
pandas, and scikit-learn.
Memory Efficiency: arrays are more memory-efficient than Python
ra
ht
1 Identify Target Website: Choose the website from which you want
ig
to extract data.
r
2 Inspect HTML Structure: Understand the structure of the HTML
py
to locate data elements.
Co
3 Use a Scraping Library: Utilize a scraping library like BeautifulSoup
or Scrapy in Python.
ft-
4 HTTP Requests: Send HTTP requests to the website to retrieve
HTML content.
ra