0% found this document useful (0 votes)
11 views83 pages

Data Sciences Unit-I

Uploaded by

reddyshadvalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views83 pages

Data Sciences Unit-I

Uploaded by

reddyshadvalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Data Science - Unit - I

ht
Instructor : Krishna Dutt,

ig
[email protected]

r
py
March 23, 2024

Co
ft-
Disclaimer: The views expressed in this presentation are those of the
ra
author, and many open source content are referenced, with all authors
duly acknowledged.
D

Copyright: This beamer is for private targetted circulation only. Content


in this beamer should not be copied either in part of full without prior
approval of the instructor.

Instructor : Krishna Dutt, [email protected] Data Science - Unit - I March 23, 2024 1 / 83
Unit-I-Introduction to Data Science

Data Science

ht
Data Objects and Attributes types

ig
Measuring Data similarity and dissimilarity

r
Data Preprocessing

py
Data Cleaning

Co
Data Integration
Data Reduction
Data Transformation
ft-
Data Discretization
ra
Traits of Big data
Hypothesis and inference
D

Analysis vs Reporting.

Instructor : Krishna Dutt, [email protected] Data Science - Unit - I March 23, 2024 2 / 83
Unit-I-Data Science Toolkits using Python

Matplotlib
NumPy

ht
Scikit-learn

ig
Visualizing Data: Bar

r
py
Charts
Line Charts

Co
Scatter plots.
ft-
Working with data
Reading Files
ra

Scraping the Web


D

Using APIs (Example: Using the Twitter APIs)


Cleaning and Munging
Manipulating Data.
Case Study: Getting Data -Twitter Data API
Instructor : Krishna Dutt, [email protected] Data Science - Unit - I March 23, 2024 3 / 83
Characteristics of Big Data

ht
Volume: Large amounts of data (terabytes to petabytes).

ig
Velocity: Rapid generation and processing of data.

r
Variety: Different types of data (structured, unstructured).

py
Veracity: Reliability and trustworthiness of data.

Co
Variability: Inconsistency in data flow.
Value: Extracting meaningful insights from data.
ft-
Complexity: Complex data structures and relationships.
ra

Accessibility: Efficient access and utilization of data.


D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 4 / 83


Data Pre-Processing - Tasks
Data Pre-Processing Below diagram shows the essential data
pre-processing steps required for making data accurate, complete and
consistent. Real world data may be inaccurate, incomplete, and

ht
inconsistent due to various reasons. Processing of data requires that it is

ig
accurate, complete for the intended purpose and also inconsistent. Before

r
The following describe important requirements data quality.

py
Co
ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 5 / 83


Data issues
Real world data suffers from many drawbacks, the major problems are
accuracy : Of the many reasons, for inaccuracy data (i.e., having
incorrect attribute values) can be due to

ht
selection of instruments used may be faulty
for reasons of privacy etc., purposely submit incorrect data values by

ig
individuals or organizations.

r
Errors in data transmission due to technology limitations, like limited

py
buffer size for coordinating synchronized data transfer, etc.

Co
The data col
completeness : Attributes of interest may not always be available,
ft-
such as customer information in sales transaction data. Some
attributes may not be included as they were not considered important
ra

at the time data recording. Some values of a set of attributes may be


D

missing. These make data incomplete. A complete data does not


suffer any of these issues.
consistency Data spread across different sources may contain
discrepancies, ex. different department codes used to categorize same
items, same person name or age differently in different tables.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 6 / 83
Data Pre-processing-Missing Value Handling and Noise
Cleaning

ht
Data preprocessing is crucial for ensuring the quality and reliability of
analysis results. Two key aspects of preprocessing are handling missing

ig
values and cleaning noise:

r
py
Missing Value Handling: Missing values can arise due to reasons
like data entry errors, sensor malfunctions, or intentional data

Co
omissions. Techniques for handling missing values include:
Imputation: Replace missing values with a suitable estimate, such as
ft-
mean, median, or mode of the column.
Deletion: Remove rows or columns with missing values, especially if
ra

they are a small proportion of the dataset.


Advanced methods: Utilize advanced techniques like K-nearest
D

neighbors (KNN) imputation or matrix completion algorithms for more


accurate estimation

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 7 / 83


Data preprocessing - data cleaning - contd..

Noise Cleaning: Noise refers to irrelevant or random fluctuations in

ht
data that can obscure meaningful patterns. Techniques for noise

ig
cleaning include:

r
Smoothing: Apply filters or moving averages to remove

py
high-frequency noise while preserving the underlying trends.

Co
Outlier detection and removal: Identify and eliminate data points
that deviate significantly from the rest of the dataset. Transformation:
ft-
Transform the data using mathematical functions to reduce the
impact of noise and improve interpretability. Example: In a dataset of
ra

patient health records, missing values in the ”Blood Pressure” column


D

can be imputed using the median value of the column, while noise
from sensor errors can be cleaned using smoothing techniques.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 8 / 83


Data prrprocessig - Data Integration

Data integration involves combining data from multiple sources into a

ht
unified dataset for analysis. This process aims to provide a comprehensive

ig
view of the underlying phenomenon and enable more robust analysis.
Techniques for data integration include:

r
py
Schema matching: Identify and reconcile differences in data
schemas across sources to ensure consistency.Entity resolution:

Co
Resolve discrepancies in entity names or identifiers to avoid
duplication and ensure accurate integration.Data fusion: Merge
ft-
datasets based on common attributes or keys, taking into account
ra
data quality and reliability. Example: Integrating sales data from
different regions and time periods to analyze overall sales trends and
D

identify patterns across diverse markets.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 9 / 83


Data Reduction
Data reduction techniques aim to reduce the dimensionality or size of the
dataset while preserving its essential information. This helps improve
computational efficiency and mitigate the curse of dimensionality.

ht
Techniques for data reduction include:

ig
Principal Component Analysis (PCA): PCA identifies the principal

r
components of variation in the data and represents the dataset in a

py
lower-dimensional space while retaining most of its variance.
Attribute Subset Selection: Select a subset of relevant features or

Co
attributes from the dataset based on criteria such as correlation,
importance, or domain knowledge.
ft-
Parametric Reduction: Utilize techniques like feature engineering or
ra

feature extraction to transform the dataset into a more compact


representation.
D

Example: Applying PCA to reduce the dimensionality of a dataset


containing features such as temperature, humidity, and air pressure to a
smaller set of principal components capturing the most significant
variation.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 10 / 83
Data Transformation
Data transformation involves converting raw data into a more suitable
format for analysis or modeling. This may include standardizing or
normalizing data, smoothing noisy signals, or discretizing continuous

ht
variables. Techniques for data transformation include:

ig
Standard Normalization: Standardize data to have zero mean and

r
unit variance, making it more amenable to certain algorithms like

py
linear regression or neural networks.

Co
Min-Max Normalization: Rescale data to a fixed range, typically [0,
1], to maintain relative differences while ensuring uniformity across
ft-
features.
Smoothing: Apply smoothing techniques such as moving averages or
ra

exponential smoothing to remove noise and reveal underlying trends


D

or patterns.
Example: Normalizing features such as income, age, and education level to
a common scale to avoid bias in a machine learning model based on
Euclidean distances.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 11 / 83
Data Discretization

Data discretization involves partitioning continuous variables into a finite

ht
number of intervals or bins. This can help simplify analysis, reduce

ig
complexity, and facilitate pattern discovery. Techniques for data

r
discretization include:

py
Binning: Divide the range of a continuous variable into equal-width

Co
or equal-frequency bins to group similar values together.
Histogram Analysis: Construct histograms to visualize the
ft-
distribution of continuous variables and identify natural breakpoints
for discretization.
ra

Example: Discretizing the age variable into age groups (e.g., 0-18, 19-35,
D

36-50, 51+) to analyze age demographics in a population dataset.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 12 / 83


Data Matrix - Column & Row View
A data matrix is an ordered collection of n columns/m rows of the same
dimension.
Data Matrix

ht
 
x1,1 x1,2 . . . x1,j . . . x1,m

ig
x2,1 x2,2 . . . x2,j . . . x2,m 
X = . (1)

r
 
.. .. .. .. .. 
 .. .

py
. . . . 
xn,1 xn,2 . . . xn,j . . . xn,m

Co
Column and row views: Each column represents a
feature/attribute/variable of the object. Each row depicts an object
ft-
of n features.
ra
 
x1,2
 x2,1 
D

 
 ..  
X2 =  . 

 O2 = x2,1 x2,2 . . . x2,j . . . x2,n (2)
 .. 
 . 
xm,1
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 13 / 83
Attributes/features/variables
Attributes: Each column in data matrix is an attribute or feature or
variable. Ex. Age,Gender, Salary, Degree.

ht
r ig
py
Co
ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 14 / 83


Attributes/features/variables..contd
Database engineers historically use the word attribute, while statistician
term it as variable. However, ML/DL engineers use the word feature.

ht
r ig
py
Co
ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 15 / 83


Qualitative and Quantitative Attributes

The following summarises the distinction between qualitative and

ht
quantitative attributes, with examples.

ig
Qualitative Attributes: Qualitative attributes, also known as categorical

r
or nominal attributes, represent characteristics or qualities that cannot be

py
measured numerically. These attributes describe the quality or category of
an object or entity. They are often represented by labels or names.

Co
Examples:
ft-
Gender : Male, Female, Other
Marital Status : Single, Married, Divorced
ra

Type of Vehicle : Sedan, SUV, Truck


D

Colors : Red, Blue, Green

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 16 / 83


Qualitative and Quantitative Attributes..contd..
Quantitative Attributes: Quantitative attributes, also known as
numerical attributes, represent quantities or measurable quantities that
can be expressed numerically. These attributes are typically represented by

ht
numerical values and can be further categorized into discrete or continuous

ig
attributes.

r
discrete: Discrete attributes can only take on specific, distinct values

py
within a finite or countable range. They are often counts or integers.
Examples :

Co
Number of Children in a Family (e.g., 0, 1, 2, 3)
Number of Employees in a Company (e.g., 10, 20, 30)
ft-
Number of Pets (e.g., 0, 1, 2, 3)
ra
continuous: Continuous Quantitative Attributes: Continuous
attributes can take on any value within a range and are often
D

measured using real numbers.


Height of Individuals (e.g., 150 cm, 175 cm, 180 cm)
Weight of Objects (e.g., 5.2 kg, 10.6 kg, 15.3 kg)
Temperature (e.g., 25.5°C, 30.2°C, 35.1°C)
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 17 / 83
Qualitative and Quantitative Attributes

ht
Attribute Type Definition Examples
Qualitative Attributes Characteristics or qualities that Gender: Male, Female, Other

ig
cannot be measured numerically.
They describe the quality or cate-

r
gory of an object or entity.

py
Marital Status: Single, Married, Divorced
Type of Vehicle: Sedan, SUV, Truck

Co
Colors: Red, Blue, Green
Quantitative Attributes Attributes representing quantities Discrete: Number of Children in a Family
or measurable quantities that can
ft-
be expressed numerically. Can be
discrete or continuous.
ra
(Discrete) Number of Employees in a Company
Number of Pets
D

Continuous: Height of Individuals


Weight of Objects
Temperature

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 18 / 83


Attributes/features/variables.. contd

An object is a set of attributes and this set can contain a mixture of


attributes, as stated previously. In case of binary attributes, sometimes Yes

ht
(Y) is given more weight age. Ex. has fever/Infection. Absence of fever

ig
may be less important than having it. So, binary attribute is further
classified as symmetric and asymmetric binary attribute.

r
py
Ex :
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Co
P1 M Y N P N N N
In
P2 M Y Y N N N N
ft-
P3 F Y N P N P N
the above table we see binary attribute for gender, Fever,Cough,
ra

Test-1/2/3 4.Here Gender is symmetric as there can be no preference, but


D

in case of other attributes, having fever and cough, testing positive are
more important than negative. Hence, these can be considered as
assymmetric.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 19 / 83


Data Table vs Matrix

Generally, while using database systems, the data is shown as table


structure.

ht
In data sciences and ML/DL data is represented as matrix.

ig
Question Is table same as matrix?

r
py
A matrix, as defined in linear algebra, the cells belong to ∈ R, real
number set.

Co
Hence rows are columns are vectors ∈ R Vector space.
Hence both statistical techniques as well as algebraic techniques can
ft-
directly be used.
ra

However, in table structure, each attribute can be any type as


D

described previously.
Hence, non-numeric data need to be mapped into numeric data
before representing the table as matrix.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 20 / 83


Data Quality Metrics
Data Accuracy:
Definition: Accuracy refers to how close the data values are to the
true values or the actual state of the real-world phenomenon.

ht
Example: In a database of student ages, if a student’s age is recorded

ig
as 25 instead of 23, it represents an inaccuracy in the data.

r
Data Consistency:

py
Definition: Consistency ensures uniformity and coherence of data

Co
across various databases or data sources.
Example: In a customer database, if the same customer is represented
ft-
as ”Raghava Ram” in one record and ”Ram, Raghava” in another, it
reflects inconsistency.
ra

Data Completeness:
D

Definition: Completeness measures the presence of all required data


elements in a dataset without any missing values.
Example: In a survey dataset, if some respondents’ income
information is missing, it indicates incomplete data.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 21 / 83
Data Cleaning
Introduction to Data Cleaning:
Data cleaning is a crucial step in the data preprocessing pipeline,

ht
aiming to improve data quality by addressing issues like errors,
inconsistencies, and missing values.

ig
Numerical Example:

r
py
Consider a dataset with age values of individuals, and you observe
entries with negative values or outliers.

Co
Data Cleaning Steps:
Remove entries with negative age values as they are erroneous.
ft-
Address outliers by either replacing them with a reasonable value or
ra
removing them.
Common Data Cleaning Techniques:
D

Handling missing values, correcting inaccuracies, dealing with outliers,


and ensuring consistency are common techniques used in data
cleaning.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 22 / 83
Data Integration

Introduction to Data Integration:

ht
Data Integration involves combining data from different sources to
provide a unified view, enabling better analysis and decision-making.

ig
Example: Combining Customer Data

r
py
Assume you have customer information in two datasets - one
containing personal details and another with purchase history.

Co
Data Integration Steps:
Identify a common key, such as customer ID, in both datasets.
ft-
Merge datasets based on the common key to create a unified dataset
with both personal details and purchase history.
ra

Benefits of Data Integration:


D

Enhances data quality, reduces redundancy, and provides a holistic


view for more informed analysis.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 23 / 83


Data Reduction
Introduction to Data Reduction:
Data Reduction aims to reduce the volume but produce the same or

ht
similar analytical results.
It involves techniques to minimize the amount of data while

ig
preserving its integrity.

r
py
Example: Principal Component Analysis (PCA)
Assume you have a dataset with many correlated variables.

Co
PCA can be applied to transform the dataset into a new set of
uncorrelated variables (principal components) while retaining most of
ft-
the original information.
ra

This reduces dimensionality and simplifies subsequent analysis.


D

Benefits of Data Reduction:


Improves efficiency in processing and analysis.
Removes redundant or irrelevant features, making the dataset more
manageable.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 24 / 83
Data Reduction - Numerosity Reduction

Numerosity reduction involves reducing the number of data instances while

ht
preserving essential information. This can be achieved through techniques
like sampling and aggregation.

ig
Random Sampling: Selecting a random subset of the data points to

r
py
represent the entire dataset. This is useful when dealing with large
datasets.

Co
Stratified Sampling: Dividing the dataset into strata and then
applying random sampling within each stratum. It ensures
ft-
representation from different subgroups.
ra

Aggregation: Combining multiple data points into a single


D

representative point. For example, averaging values within a time


interval.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 25 / 83


Data reduction- Numerocity -Numerical Example

ht
ig
Consider a dataset with individual daily sales data. To reduce numerosity,
we can aggregate this data into weekly sales totals.

r
py
Day Sales

Co
Day 1 100 Week Weekly Sales
Day 2 150 Week 1 370
ft-
Day 3 120
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 26 / 83


Data Transformation

ht
Introduction to Data Transformation:
Data Transformation involves converting data into a suitable format

ig
for analysis.

r
py
It may include normalization, standardization, encoding, or other
techniques.

Co
Numerical Example - Normalization:
ft-
Original Data: [2, 5, 10, 7]
Normalized Data: [0.1, 0.3, 0.6, 0.4]
ra

x−min(x)
Formula: xnormalized =
D

max(x)−min(x)

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 27 / 83


Data Transformation (Contd.)

ht
Numerical Example - Standardization:

ig
Original Data: [10, 15, 20, 25]

r
Standardized Data: [−1.34, −0.45, 0.45, 1.34]

py
x−mean(x)
Formula: xstandardized = std(x)

Co
Benefits of Data Transformation:
Improves the performance and accuracy of machine learning
ft-
algorithms.
ra

Ensures that data is in a consistent and usable format for analysis.


D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 28 / 83


Data Discretization
Introduction to Data Discretization:
Data Discretization involves converting continuous data into discrete

ht
categories or bins.
It simplifies the data and makes it easier to analyze or apply certain

ig
algorithms.

r
py
Example 1 - Age Discretization:
Original Age Data: [25, 30, 35, 40, 45]

Co
Discretized Age Categories:
ft-
[Young, Young Adult, Adult, Middle-Aged, Senior]
Criteria: [0 − 30, 31 − 35, 36 − 40, 41 − 45, 46+]
ra

Example 2 - Income Discretization:


D

Original Income Data: [30000, 50000, 70000, 90000, 110000]


Discretized Income Categories: [Low, Medium, High]
Criteria: [0 − 50000, 50001 − 90000, 90001+]
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 29 / 83
Data Normalization
Standardization:
For each feature, subtract the mean and divide by the standard deviation:

ht
X −µ
Standardized Feature =

ig
σ
The normalization process scales the data to have a mean of 0 and a

r
py
standard deviation of 1.  
1

Co
X = 2 (3)
3
ft-
 
1  1
ra
µ= 1 1 1 2 = 2 (4)
3
3
D

 
1−2
(X − µ · 1) = 2 − 2 (5)
3−2
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 30 / 83
r
1

ht
σ= (X − µ · 1)T (X − µ · 1) (6)
N

ig
v  
−1
u

r
u1
u  

py
= t −1 0 1  0  (7)
3
1

Co
 
−1
1  
ft-
=p 0 (8)
2/3 1
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 31 / 83


Data Normalization

ht
ig
Min-Max Scaling:

r
py
Scale the data within a specific range (e.g., [0, 1]):

Co
X − min(X )
Scaled Feature =
max(X ) − min(X )
ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 32 / 83


Example of Data Normalization
Consider a data table as below. The attributes, all numerical, but have
have different ranges. This can lead to bias towards an attribute with
higher values. Normalization can mitigate the problem of bias due to

ht
different ranges in the numerical attributes.

ig
Name Age Income Expenses

r
A 30 50000 2000

py
B 25 60000 2500
C 35 75000 3000
Name Age
Co
Income Expenses
ft-
A 0 -1.188 -0.173
B -1.224 -1.688 0
ra

C 1.224 1.342 0.173


D

Name Age Income Expenses


A 0.5 0 0
B 0 0.6667 0.5
C 1 1 1
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 33 / 83
Statistical Measures
If a column or sets of columns in a data matrix representing a feature(s) of
data are numerical, statistical metrics can be defined.
Column based statistical metrics

ht
consider columns X and Y, with numerical type, in a Data matrix. Define

ig
1 as a vector of size same that of X. Then

r
Mean:

py
1
Mean(X̄ ) = 1T · X
n

Co
Variance:
1
Variance(σ 2 ) = ( (X − 1X̄ )T · (X − 1X̄ )
ft-
n
Standard Deviation:
ra

r
1
D

Standard Deviation(σ) = (X − 1X̄ )T · (X − 1X̄ )


n
Covariance:
1
Covariance(Σ) = (X − 1Ȳ )T · (X − 1Ȳ )
n
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 34 / 83
Numerical Example: Mean/Standard Deviation/Covariance

Given data vector X = [4, 6, 8], the mean (X̄ ) in vector form is calculated

ht
as follows:

ig
 
1 T 1  4 18
X̄ = 1 · X = 1 1 1 · 6 = =6

r
3 3 3

py
8

Co
Given data vector X = [4, 6, 8], the standard deviation (σ) in vector form
is calculated as follows:
ft-
r
1
σ= (X − 1X̄ )T · (X − 1X̄ )
ra

3
D

v   r
 −2
u
u1 8
u 
= t −2 0 2 ·  0  =
3 3
2
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 35 / 83
Relationship between all features

ht
Given data vectors X = [1, 2, 3] and Y = [2, 3, 4], the covariance (Σ) in

ig
vector form is calculated as follows:

r
py
1
Σ = (X − 1X̄ )T · (Y − 1Ȳ )
3

Co
 
1  −1 2
ft-
= −1 0 1 ·  0  =
3 3
1
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 36 / 83


Relationship between two features - example

Consider a data matrix X with dimensions M × N. The covariance matrix


(Σ) in vector form is calculated as follows:

ht
1
Σ= (X − 1X̄ )T · (X − 1X̄ )

ig
N

r
py
Let’s take a numerical example with a 2x3 data matrix:

Co
 
1 2 3
X =
4 5 6
ft-
The mean (X̄ ) for each column is calculated as:
ra

 T     
1 · X1 1+4 2.5
D

1 T 1
X̄ = 1 · X2 =
 2 + 5 = 3.5
 
2 T 2
1 · X3 3+6 4.5

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 37 / 83


Relationship between all features

Substituting into the covariance matrix formula:

ht
T 

ig
 
−1.5 −0.5 0.5 −1.5 −0.5 0.5
1
Σ = −0.5 0.5 1.5 · −0.5 0.5 1.5

r
3

py
0.5 1.5 2.5 0.5 1.5 2.5

Co
 
2.25 0.25 −0.25
1
=  0.25 0.25 0.25 
ft-
3
−0.25 0.25 0.25
ra

Note : The covariance matrix (Σ) is symmetrical, and its diagonal


elements are always positive, representing the variances of individual
D

variables.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 38 / 83


Covariance vs Correlation

ht
ig
Covariance:

r
Measures the degree of joint variability between two variables.

py
Pn
i=1 (Xi −X̄ )(Yi −Ȳ )
Formula: cov(X , Y ) = n

Co
Unit of measurement is the product of the units of the two variables.
Scale is not standardized, making it difficult to compare across
ft-
different datasets.
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 39 / 83


Correlation:
Standardized measure that indicates the strength and direction of a
linear relationship between two variables.

ht
cov(X ,Y )
Formula: corr(X , Y ) = σX σY

ig
Ranges from -1 to 1, where -1 indicates a perfect negative linear

r
relationship, 1 indicates a perfect positive linear relationship, and 0

py
indicates no linear relationship.

Co
Unitless, making it easier to compare across different datasets.
Key Differences:
ft-
Covariance can take any value, while correlation is normalized
between -1 and 1.
ra

Correlation is a more interpretable measure of linear association.


D

Correlation is not affected by the scale of the variables.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 40 / 83


Similarity and Dissimilarity Measures

Euclidean Distance:

ht
v
u n

ig
uX
Euclidean Distance(X , Y ) = t (xi − yi )2

r
i=1

py
p
(X − Y )T · (X − Y )

Co
Vector Form:
Manhattan Distance:
ft-
n
X
Manhattan Distance(X , Y ) = |xi − yi |
ra

i=1
D

Pn
Vector Form: i=1 |Xi − Yi |

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 41 / 83


Cosine Similarity:

XT · Y
Cosine Similarity(X , Y ) = p
(X T · X ) · (Y T · Y )

ht
X T ·Y
Vector Form: ∥X ∥·∥Y ∥ Example:

ig
   
1 2

r
py
X = 2 , Y = 3
3 4

Co Cosine Similarity(X , Y ) = √
20
ft-
406
ra
Jaccard Similarity:
D

|X ∪ Y | XT · Y
Jaccard Similarity(X , Y ) =
|X ∩ Y | X · X + Y T · Y − X T · Y
T

(For binary vectors)


Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 42 / 83
Comparison of Similarity Metrics

ht
ig
Similarity Metric Use Cases

r
Euclidean Distance Numerical data, continuous features

py
Cosine Similarity Text data, high-dimensional sparse vectors

Co
Jaccard Similarity Sets, binary data, categorical features
Pearson Correlation Linear relationships, continuous data
ft-
Hamming Distance Binary strings, categorical data
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 43 / 83


Cosine Similarity for Text Data
Cosine similarity is particularly useful for comparing text data
represented as high-dimensional sparse vectors.
In text analysis, documents are often represented as vectors where

ht
each dimension corresponds to a unique word in the vocabulary, and

ig
the value of each dimension represents the frequency or TF-IDF score

r
of the word in the document.

py
Cosine similarity measures the cosine of the angle between two
vectors, which reflects their similarity in direction regardless of their

Co
magnitude.
For text data, cosine similarity is used to calculate the similarity
ft-
between documents or text passages based on their word frequencies
ra

or TF-IDF scores.
Documents with similar content will have vectors pointing in similar
D

directions, resulting in a higher cosine similarity score.


Cosine similarity is widely used in information retrieval, document
clustering, recommendation systems, and natural language processing
tasks.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 44 / 83
Cosine Similarity: Numerical Example
Let’s consider two documents, each of them represented by a vectors
containing frequency of words, apple,banana,orange and grape,
[freq(apple), freq(banana), freq(orange), freq(grape)] Document 1:

ht
D1 =[3, 2, 1,0]Document2 :D2 =[2,1,0,3]

ig
The cosine similarity between D1 and D2 is calculated as:

r
D1 · D2

py
similarity(D1 , D2 ) =
∥D1 ∥∥D2 ∥

Co
Substituting the values:
3×2+2×1
similarity(D1 , D2 ) = √ √
ft-
3 + 2 + 1 2 × 22 + 1 2 + 3 2
2 2
ra
6+2
=√ √
14 × 14
D

8
= ≈ 0.571
14
Thus, the cosine similarity between the two documents is approximately
0.571.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 45 / 83
Numerical Example: Cosine Similarity

Given vectors X = [1, 2, 3] and Y = [2, 3, 4], the cosine similarity in vector

ht
form is calculated as follows:

ig
XT · Y
Cosine Similarity =

r
∥X ∥ · ∥Y ∥

py
 
 2

Co

1 2 3 · 3
ft-
4
=v   v  
1 u 2
u u
ra
u  
t 1 2 3 · 2 · t 2 3 4 · 3
u   u 
D

3 4

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 46 / 83


Numerical Example: Euclidean Distance in Vector Form

ht
Given vectors X = [1, 2, 3] and Y = [4, 5, 6], the Euclidean distance in

ig
vector form is calculated as follows:

r
py
q
Euclidean Distance = (X − Y )T · (X − Y )

Co
v  
 −3
u
u
ft-
= t −3 −3 −3 · −3
u
−3
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 47 / 83


Numerical Example: Manhattan Distance in Vector Form

Given vectors X = [1, 2, 3] and Y = [4, 5, 6], the Manhattan distance in

ht
vector form is calculated as follows:

ig
Manhattan Distance = 1T · |X − Y |

r
py
 
 |1 − 4|

Co

= 1 1 1 · |2 − 5|
|3 − 6|
ft-
 
 3
ra

= 1 1 1 · 3 = 1 · 3 + 1 · 3 + 1 · 3 = 9
3
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 48 / 83


Dissimilarity of Nominal attributes

The following shows original datatable, transformed data table and


dissimilarity tables. Let us map the nominal attributes into integers as A

ht
maps to 1, B maps to 2, C maps to 3, arbitrary.
Object Color Shape Size

ig
1 Red Circle Small

r
2 Blue Triangle Medium

py
3 Green Square Large

Co
Object Color Shape Size
1 1 2 3
ft-
2 2 1 2
3 3 3 1
ra

O1 √ O2 √ O3
D

O1 √0 3 √6
O2 √3 √0 6
O3 6 6 0

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 49 / 83


Dissimilarity of Nominal Attributes

The following shows the original data table, transformed data table, and
dissimilarity table. Let us map the nominal attributes into integers as A

ht
maps to 1, B maps to 2, C maps to 3, arbitrarily.
Object Color Shape Size

ig
1 Red Circle Small

r
2 Blue Triangle Medium

py
3 Green Square Large

Co
Object Color Shape Size
1 1 2 3
ft-
2 2 1 2
3 3 3 1
ra

O1 √ O2 √ O3
D

O1 √0 3 √6
Dissimilarity between O1 & O3 is same as
O2 √3 √0 6
O2 & O3, though their attributes differ.
O3 6 6 0

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 50 / 83


Dissimilarity of Nominal Attributes

In the context of the unsuitability of nominal attributes, the following

ht
measure is defined for obtaining dissimilarity of nominal attributes:

ig
m
d(i, j) = , (9)

r
p

py
where m is the count of differing attributes, and p is the total number of

Co
attributes. For the above table p = 3 and the dissimilarity table becomes
O1 O2 O3
ft-
O1 0 1 1
From the above it can be seen that all the three
ra
O2 1 0 1
O3 1 1 0
D

objects are dissimilar.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 51 / 83


Dissimilarity of ordinal attribute

Ordinal attributes are ranked as they show an order in qualifying the real
world attributes, ex. Excellent, very good, good, fair, each of which is

ht
better than the subsequent, considered from the start excellent. Hence can

ig
be transformed into numerical by ranking them. Dissimilarity is obtained
in steps, as follows.

r
py
step 1. value of any ordinal attribute in any object is mapped to a
number r ϵ M, M is the number of ranks for that attribute, an integer.

Co
each rank r is mapped to a number between [0.0, 1.0] by the formula
ft-
r −1
Z= (10)
ra
M −1
D

The above process maps an ordinal attribute to numerical type and


dissimilarity can then be obtained using any dissimilarity metric used for
numerical attributes, like Euclidean, Manhattan, etc.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 52 / 83


Dissimilarity of ordinal attribute-example

Let us consider performance as an attribute and rank the performance as


Excellent - 3, Good - 2, and Fair - 1. The following shows the process of

ht
mapping the ordinal into numerical and obtain dissimilarity matrix with
Manhattan distance metric on the transformed values.

ig
Object Performance Rank Numerical value

r
(3−1)
1 Excellent (r=3) (3−1) = 1

py
(1−1)
2 Fair (r=1) (3−1) = 0

Co
(2−1)
3 Good (r=2) (3−1) = 0.5
ft-
(3−1)
4 Excellent (r=3) (3−1) = 1
O1 O2 O3 O4
ra

O1 0 1.0 0.5 0
D

O2 1.0 0 0.5 1.0


O3 0.5 0.5 0 0.5
O4 0 1.0 0.5 0

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 53 / 83


Dissimilarity of binary attributes
Consider a table with both symmetric and and asymmetric binary
variables. Gender is symmetrical and all other attributes are asymmetrical,
as in Tests testing positive is more important than testing negative, for

ht
diagnosis.

ig
q = number of attributes that equal 1 for both objects i and j,

r
r = the number of attributes that equal 1 for object i but equal 0 for

py
object j,
s = the number of attributes that equal 0 for object i but equal 1 for

Co
object j,
t = the number of attributes that equal 0 for both objects i and j.
ft-
The total number of attributes p = q+ r + s + t.
ra

Dissimilarity is defined as
D

(r + s)
d(i, j) = forSymmetricalcase (11)
(q + r + s + t)
(r + s)
d(i, j) = forAsymmetricalcase (12)
(q + r + s)
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 54 / 83
Dissimilarity of binary attributes - example

Consider an example.

ht
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Ramu M Y N P N N N

ig
Prakash M Y N N N N N

r
Jyoti F Y N P N N N

py
The asymmetrical values are cast as (1) P = 1, Y = 1, (2) N=0, N = 0.

Co
(1 + 1) 1
d(Ramy, Prakash) = = ; r = 1, s = 1, q = 1
(1 + 1 + 1) 3
ft-
(0 + 1) 1
d(Ramy, Jyoti) = = ; r = 0, s = 1, q = 2 (13)
ra

(1 + 0 + 2) 3
D

(1 + 2) 3
d(Prakash, Jyoti) = = ; r = 1, s = 2, q = 1
(1 + 1 + 2) 5

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 55 / 83


Dissimilarity - Mixed type attributes

Previously we studied dissimilarity of different types of attributes,


separately. However, a data table can have a combination of all types of

ht
attributes discussed above. Dissimilarity of mixed type is given by

ig
Σδijf dijf

r
d(i, j) = (14)

py
Σδijf

δif j = 0 either if (1) xi f or xi j is missing or (2) xi f = xi f = 0 for

Co
asymmetric binary attribute else (3) δif j = 1
ft-
|x −x |
numeric attribute dif j = (maxh xhfif −min
jf
hf xhf )
where h runs over all non
missing attributes for objects
ra

nominal or binary : dijf = 0 if xif = xjf else dijf = 1


D

(rif −1
ordinal : compute rank rif map it as numeric zif = (Mif −1)

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 56 / 83


Numerical Example of Mixed attribute table

Consider
Fruit Color Sweets Weight

ht
F1 Orange Very Sweet 25

ig
F2 Red Sweet 20 In the above table maxh xh = 30

r
F3 Yellow Sour 15

py
F4 Orange Very Sweet 30

Co
and minh xh = 15. (maxh xh − minh xh ) = 15
For numerical attribute the dissimilarity matrix is
ft-
O1 O2 O3 O4
O1 0
ra
(25−20)
O2 15 = 0.33
D

(20−15) (25−15)
O3 15 = 0.33 15 = 0.66 0
(20−15) (20−30) (15−30)
O4 15 = 0.33 15 = −0.66 15 = −1 0

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 57 / 83


Dissimilarity for nominal attribute : Color
O1 O2 O3 O4
O1 0 1 1 0
O2 1 0 1 1
O3 1 1 0 1

ht
O4 0 1 1 0

ig
Dissimilarity for Ordinal attribute : Sweetness
O1 O2 O3 O4

r
py
O1 0 1.0 0.5 0
O2 1 0 0.5 1 δijf for each of the three attributes = 1,

Co
O3 0.5 0.5 0 0.5
O4 0 1.0 0.5 0
ft-
Combined dissimilarity matrix is
ra
O1 O2 O3 O4
O1 0
D

(1+1+0.33)
O2 3 = 0.77 0
(1+0.5+0.33) (1+0.5+0.66)
O3 3 = 0.84 3 = 0.84 0
(0+0+0.33) (1+1+0.66) (1+0.5+1)
O4 3 = 0.11 3 = 0.84 3 = 0.84 0
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 58 / 83
Unbiased Estimates of Population Parameters

Definition

ht
An unbiased estimate of a population parameter is a statistic calculated
from a sample that, on average, tends to be neither consistently higher nor

ig
consistently lower than the true value of the parameter it estimates.

r
py
Population vs. Sample:
Population: Entire group you study

Co
Sample: Subset drawn from the population
Estimation vs. Parameter:
ft-
Estimate: Approximation of a parameter based on a sample
ra
Parameter: Numerical characteristic of the entire population
Bias vs. Unbiased:
D

Bias: Systematic tendency to over/underestimate the true value


Unbiased: No such tendency, on average as close as possible to true
value

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 59 / 83


Unbiased Estimates of Population Parameters..example

ht
Example

ig
Sample mean: Unbiased estimate of population mean

r
py
Sample variance (n / (n-1)): Unbiased estimate of population
variance

Co
Sample proportion: Unbiased estimate of population proportion
ft-
Unbiased estimates are crucial for reliable inferences about the population.
ra
Unbiasedness is not always achievable, especially with small samples.
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 60 / 83


Grand Mean as an Estimate of Population Mean

Key Idea
The grand mean, calculated by averaging individual sample means, is an

ht
unbiased estimate of the true population mean.

ig
Population mean µ: True average of the entire population.

r
py
Sample mean µi : Mean of a single sample drawn from the population.

Co
Multiple samples: Draw multiple random samples, get multiple µi
values.
ft-
Grand mean (µ): Average of all the individual sample means µ1 , µ2 .
Why Grand Mean Estimates :
ra

Law of Large Numbers:As the number of samples increases, the grand


D

mean gets closer to µ .


Individual sampling errors: Tend to cancel each other out in the
average.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 61 / 83
Grand Mean as an Estimate of Population Mean.. contd

Grand Mean

ht
E[µ] = Expected value of the grand mean

ig
=µ (unbiased estimate)

r
py
Important Points:

Co
Grand mean is an estimate, not exact due to sampling error.
More samples lead to more accurate estimate.
ft-
Grand mean is one estimation method, others exist (e.g., confidence
intervals).
ra

Random sampling is crucial for unbiasedness.


D

Grand mean variance reflects estimate’s accuracy.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 62 / 83


Confidence Interval and Confidence Level
Key Idea
Since the expected value E [µs ] of the sample mean doesn’t perfectly

ht
represent the true population mean µp , we construct a confidence
interval(CI) around µ to capture the range of values where µ is likely to

ig
lie. This interval is associated with a confidence level of 1 − α, where α is

r
py
the significance level.

Co
(µp ): True average of the entire population.
(µs ): Average of a single sample drawn from the population.
ft-
CI: Range of values likely to contain with a certain probability.
Confidence level (1 - α): Probability that the interval captures in
ra

repeated sampling.
D

Significance level (α): Probability of incorrectly rejecting the null


hypothesis (µ falls outside the interval).
Example: With a 95% confidence level 1 − α = 0.95 and 5% significance
level α = 0.05, we are 95% confident that the interval contains the true
population mean.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 63 / 83
Hypothesis - Terminology

Confidence Interval:
A range of values that is likely to contain the true value of an unknown

ht
parameter. It is often expressed as an interval with an associated

ig
confidence level, indicating the probability that the interval will contain the
true parameter value.

r
py
Example: A 95% confidence interval for the mean height of a population
might be (150 cm, 160 cm), indicating that we are 95% confident that the

Co
true mean height falls within this range.
Test Statistic:
ft-
Is a numerical summary of a set of data used in a hypothesis test. It is
calculated from sample data and is used to determine whether to reject
ra

the null hypothesis.


D

Example: In a t-test, the t-statistic is a test statistic that measures the


difference between the sample mean and the population mean in terms of
the sample’s standard error.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 64 / 83


Hypothesis - Terminology....contd..

T-Test:

ht
t-test is a statistical test used to compare the means of two groups to
determine if they are significantly different from each other. It is often

ig
used when the sample size is small, and the population standard deviation

r
is unknown. Example: Comparing the mean scores of two groups of

py
students who received different teaching methods.

Co
Z-Test:
z-test is a statistical test used to determine if there is a significant
ft-
difference between sample and population means or between the means of
two samples. It is often used when the sample size is large, and the
ra

population standard deviation is known. Example: Testing whether the


D

mean weight of a sample of products differs significantly from the


population mean weight.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 65 / 83


Hypothesis

ht
ig
A hypothesis is a testable proposition or educated guess.

r
py
Consists of a null hypothesis (H0 ) and an alternative hypothesis
(H1 or Ha ).

Co
The null hypothesis represents no effect or no difference.
ft-
The alternative hypothesis contradicts the null hypothesis.
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 66 / 83


Example Hypothesis

ht
r ig
Null Hypothesis (H0 ): There is no difference in the mean scores

py
between Group A and Group B.

Co
Alternative Hypothesis (H1 ): There is a significant difference in the
mean scores between Group A and Group B.
ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 67 / 83


Hypothesis Testing: Example on Heights of Girls
Data: Heights of 30 girls in a region (in cm)

152, 158, 149, 162, 155, 157, 160, 150, 154, 151, 156, 153, 157, 148, 159, 155, 162

ht
Step 1: Formulate Hypotheses

r ig
H0 : µ = 155

py
Co
H1 : µ ̸= 155
Step 2: Choose Significance Level
ft-
α = 0.05
ra
D

Step 3: Select the Test Statistic


x̄ − µ0
Test Statistic: t = √
s/ n

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 68 / 83


Hypothesis Testing: Example on Heights of Girls.contd..

Step 4: Determine the Critical Region Degrees of freedom (df ):

ht
n − 1 = 29 For α/2 = 0.025 in a two-sided test with 29 df,
tcritical ≈ ±2.462

ig
Step 5: Calculate the Test Statistic

r
py
x̄ = mean of the sample

Co
s = standard deviation of the sample
ft-
n = sample size
ra

Step 6: Make a Decision If |t| > tcritical , reject H0 .


D

Step 7: Draw a Conclusion State the decision in the context of the


problem.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 69 / 83


One-Sided T-Test: Example on Heights of Girls
Data: Heights of 30 girls in a region (in cm)
152, 158, 149, 162, 155, 157, 160, 150, 154, 151, 156, 153, 157, 148, 159, 155, 162

ht
Step 1: Formulate Hypotheses

ig
H0 : µ ≤ 155

r
py
H1 : µ > 155

Co
Step 2: Choose Significance Level
ft-
α = 0.05
ra
Step 3: Select the Test Statistic
x̄ − µ0
D

Test Statistic: t = √
s/ n
Step 4: Determine the Critical Region Degrees of freedom (df ):
n − 1 = 29 For α = 0.05 in a one-sided test with 29 df, tcritical ≈ 1.699
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 70 / 83
One-Sided T-Test: Example on Heights of Girls
(Continued)
Step 5: Calculate the Test Statistic

ht
1T X
x̄ = =≈ 155.6

ig
30
r

r
1
(X − 1X̄ )T · (X − 1X̄ ) ≈ 4.58

py
σ=
n
Step 6: Make a Decision

Cot=
x̄ − µ0

ft-
s/ n
155.6 − 155
ra

t≈ √ ≈ 0.96
4.58/ 30
D

Since t < tcritical , we do not reject H0 .


Step 7: Draw a Conclusion At the 5% significance level, there is not
enough evidence to reject the null hypothesis. Not have enough evidence
to claim that the mean height of girls > 155 cm.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 71 / 83
Steps in Hypothesis Testing
Hypothesis testing involves several steps to determine the validity of a
statistical hypothesis.
1 Formulate Hypotheses: Define the null hypothesis (H ) and

ht
0
alternative hypothesis (H1 ).

ig
2 Choose Significance Level (α): Select the level of significance to

r
determine the threshold for rejecting H0 .

py
3 Select Test Statistic: Choose an appropriate test statistic based on

the type of data and hypothesis being tested.

Co
4 Collect Data: Gather data from the sample or population under

study.
ft-
5 Compute Test Statistic: Calculate the value of the test statistic
ra

using the collected data.


6 Determine Critical Region: Determine the critical region based on
D

the chosen significance level and test statistic.


7 Make Decision: Compare the test statistic to the critical region and

decide whether to reject H0 or not.


8 Draw Conclusion: Based on the decision, draw conclusions about
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 72 / 83
Hypothesis - testing example

ht
A data shows the average youngsters watches less television than the
Seniors. The population average is 30.0 hours per week, with a standard

ig
deviation of 4 hours. A sample of 65 youngsters has a mean of 25 hours.

r
py
Is there enough evidence to support the claim at alfa = 0.01?
step 1: State the Hypotheses

Co
Null Hypothesis (H0 ): The average college student watches the same
amount of television as the public (µ = 30).
ft-
Alternative Hypothesis (H1 ): The average youngster watches less
television than the seniors (µ < 30).
ra

Step 2: Significance Level Given: α = 0.01


D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 73 / 83


Hypothesis testing - example contd..
Step 3: Calculate the Test Statistic We’ll use the formula for
calculating the Z-test statistic:
x̄ − µ

ht
Z= σ

ig
n
Where:

r
py
x̄ is the sample mean,
µ is the population mean,

Co
σ is the population standard deviation, and
n is the sample size.
ft-
Given:
Sample mean (x̄) = 25 hours
ra
Population mean (µ) = 30 hours
Population standard deviation (σ) = 4 hours
D

Sample size (n) = 65


Substituting the values:
25 − 30
Z=
√4
65
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 74 / 83
Hypothesis Testing = Example - contd..
Step 4: Determine the Critical Value Since the alternative
hypothesis is one-tailed, find the critical value corresponding to the
given significance level (α = 0.01). Using a standard normal

ht
distribution table the critical value for α = 0.01 to be − 2.33.

ig
Step 5: Make a Decision If the test statistic (Z ) is less than the

r
critical value (-2.33), we reject the null hypothesis. Otherwise, we fail

py
to reject the null hypothesis. Let’s calculate the test statistic:

Co
25 − 30
Z=
√4
ft-
65

−5
Z≈ =≈ −6.406
ra

√4
65
D

Since -6.406 is less than -2.33, we reject the null hypothesis.


Conclusion Enough evidence to support the claim that the average
youngster watches less television than the seniors at a significance
level of 0.01.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 75 / 83
Example Inference

ht
r ig
py
If 75% of a sample of 100 individuals prefer Product A, we may infer
that approximately 75% of the entire population has a preference for

Co
Product A. ft-
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 76 / 83


Analysis vs Reporting
Analysis: Involves exploring data, deriving insights, and making
informed decisions based on the patterns and trends discovered.
Reporting: Involves summarizing and presenting data without deep

ht
exploration, typically providing a snapshot of information.

ig
Numerical Example:

r
Consider a dataset of monthly sales for a product over a year:

py
Sales = [100, 120, 150, 130, 110, 90, 80, 100, 120, 140, 160, 180]
Analysis:
Co
ft-
Calculate monthly averages, identify seasonal trends, and forecast
future sales based on historical patterns.
ra

Use statistical methods like moving averages or regression analysis to


D

understand underlying factors influencing sales.


Reporting:
Present a bar chart showing monthly sales without deep analysis.
Provide a summary table displaying total sales for the year.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 77 / 83
Python : Pandas

ht
Important Packages applicable to Data Science learning include, but not

ig
limited to, are

r
Pandas : Widely used for large data sets.

py
Data Manipulation

Co
Analysis
Data Frame structure simplifies
ft-
data handling
cleaning
ra
filtering
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 78 / 83


Python Matplotlib - 2D plots

Matplotlib : Widely used for visualization of data, before and after


prepossessing, and analytics.

ht
Bar Chart: Suitable for displaying and comparing individual categories

ig
or groups.
Line Chart: Ideal for visualizing trends or patterns in data, especially

r
py
over a continuous variable like time.
Scatter Plot: Effective for showing relationships or correlations

Co
between two variables.
Pie Chart: Good for illustrating the proportion of each category in a
ft-
whole.
Histogram: Useful for displaying the distribution of a single
ra
continuous variable.
Box Plot: Helpful in visualizing the distribution and identifying
D

outliers in a dataset.
Heatmap: Great for representing the correlation between multiple
variables in a matrix format.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 79 / 83


Matplotlib - 3D plots

ht
Matplotlib : Widely used for visualization of data, before and after

ig
preprocessing, and analytics.

r
py
Scatter Plot in 3D: Visualizes individual data points in a 3D space.
Line Plot in 3D: Connects data points with lines in a 3D space.

Co
Surface Plot: Visualizes a surface in 3D.
ft-
Wireframe Plot: Represents a 3D surface with lines connecting the
data points.
ra
D

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 80 / 83


Python Package - Numpy

Arrays: a powerful array object called numpy.ndarray. It represents a


grid of values, either numeric or other data types.

ht
Broadcasting: allows operations on arrays of different shapes and
sizes. Broadcasting automatically expands the smaller array to the

ig
shape of the larger array, making it easier to perform element-wise

r
py
operations.
Vectorized Operations: supports vectorized operations, allowing

Co
mathematical expressions to be applied element-wise on entire arrays,
which can significantly improve performance.
ft-
Indexing and Slicing: provides advanced indexing techniques,
ra

including slicing, masking, and fancy indexing, making it easy to


access and manipulate data within arrays.
D

Mathematical Functions: includes a wide range of mathematical


functions for performing operations like trigonometry, exponentiation,
logarithms, and more on arrays.
Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 81 / 83
Python Package Numpy - Contd..

Linear Algebra Operations: provides a rich set of functions for


linear algebra operations, such as matrix multiplication, inversion,

ht
eigenvalue decomposition, and solving linear systems.

ig
Random Number Generation: includes functions for generating

r
random numbers with different distributions, essential for simulations

py
and statistical applications.

Co
Integration with Other Libraries: a fundamental building block for
many other scientific computing libraries in Python, such as SciPy,
ft-
pandas, and scikit-learn.
Memory Efficiency: arrays are more memory-efficient than Python
ra

lists, and operations on NumPy arrays are faster due to their


D

contiguous memory layout.


Compatibility: compatible with a wide range of other libraries and
tools used in scientific computing and data analysis.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 82 / 83


Web Scraping Process

Web Scraping Steps:

ht
1 Identify Target Website: Choose the website from which you want

ig
to extract data.

r
2 Inspect HTML Structure: Understand the structure of the HTML

py
to locate data elements.

Co
3 Use a Scraping Library: Utilize a scraping library like BeautifulSoup
or Scrapy in Python.
ft-
4 HTTP Requests: Send HTTP requests to the website to retrieve
HTML content.
ra

5 Parse HTML: Parse the HTML content to extract relevant data.


D

6 Data Processing: Clean and process the extracted data as needed.

Instructor : Krishna Dutt, [email protected] Data Science March 23, 2024 83 / 83

You might also like