0% found this document useful (0 votes)

16 views44 pages

DMV - Unit I

The document provides an introduction to data modeling, focusing on random variables, their types (discrete and continuous), and key concepts like covariance, correlation, and the Central Limit Theorem. It also discusses probability distributions, including Bernoulli, binomial, geometric, and Poisson distributions, along with open addressing techniques in hash tables. Additionally, it compares separate chaining and open addressing methods for handling collisions in hash tables.

Uploaded by

qqwweerrttyyuuiioopp12345000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views44 pages

DMV - Unit I

Uploaded by

qqwweerrttyyuuiioopp12345000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

UNIT – I

Intoduction To Data Modelling

Prepared By
Prof. Garad Varsha
Random Variable
In statistics, a random variable is a variable whose possible values are outcomes of a random
phenomenon or experiment. It assigns a numerical value to each outcome in a sample space.
Types of Random Variables:
Discrete Random Variable
◦ Takes on countable values (often whole numbers).
◦ Example: Number of heads when flipping 3 coins (0, 1, 2, 3).
◦ Common distributions: Binomial, Poisson.

Continuous Random Variable

◦ Takes on an infinite number of possible values within a given range.
◦ Example: The time it takes for a train to arrive (e.g., 2.5 minutes).
◦ Common distributions: Normal, Exponential.
Random Variable
A random variable is a function that assigns a numerical value to each outcome of a random
experiment.
In simple terms, it's a variable whose value is determined by the outcome of a random process.
Example:
Tossing a coin 3 times:
Let XXX be the number of heads.
Possible outcomes of the experiment:
HHH → 3, HHT → 2, HTT → 1, TTT → 0
So XXX is a random variable taking values {0, 1, 2, 3}.
Types of Random Variable
here are two main types of random variables in statistics:
1. Discrete Random Variable
Definition: A random variable that takes on a finite or countable number of distinct values.
Example: Number of students in a class, number of heads in 3 coin tosses.
Probability Function: Uses a Probability Mass Function (PMF).
Common Distributions:
◦ Binomial
◦ Poisson
◦ Geometric
2. Continuous Random Variable
Definition: A random variable that takes on infinitely many values within a given range or
interval.
Example: Height of students, time taken to run a race, temperature.
Probability Function: Uses a Probability Density Function (PDF).
Common Distributions:
◦ Normal (Gaussian)
◦ Exponential
◦ Uniform
Covariance

Covariance measures the direction of the linear relationship between two random variables X
and Y
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]
Or equivalently:
Cov(X,Y)=E[XY]−E[X]⋅E[Y]
Difference between covariance
and correlation
Covariance
1)Measures how two variables vary together, indicating the direction of their linear relationship
2) Units depend on the product of the units of the two variables.
3) lies between −∞to∞−∞to∞
Correlation
1) Measures both the relationship and strength of two variables
2) Unitless, making it easier to compare across datasets or features.
3) lies between -1 and 1.
Central Limit Theorem
The Central Limit Theorem explains that the sample distribution of the sample mean resembles
the normal distribution irrespective of the fact that whether the variables themselves are
distributed normally or not. The Central Limit Theorem is often called CLT in abbreviated form.
Central Limit Theorem Definition:
The Central Limit Theorem states that:
When large samples usually greater than thirty are taken into consideration then the
distribution of sample arithmetic mean approaches the normal distribution irrespective of the
fact that random variables were originally distributed normally or not.
Central Limit Theorem
Central Limit Theorem Formula
Let us assume we have a random variable X. Let σ be its standard deviation and μ is the mean of
the random variable. Now as per the Central Limit Theorem, the sample mean X‾X will
approximate to the normal distribution which is given as X‾X ⁓ N(μ, σ/√n). The Z-Score of the
random variable X‾X is given as Z = given
Here x‾x is the mean of X‾X. The image of the formula is attached below.
Central Limit Theorem
Application of central limit
Theoram
Confidence Intervals
When you want to estimate the population mean based on a sample, you often construct a
confidence interval.
CLT ensures that the sampling distribution of the sample mean is approximately normal.
This allows us to use z-scores or t-scores to calculate the margin of error and build accurate
confidence intervals, even if the original data is not normal.
2. Hypothesis Testing
Many hypothesis tests, like the t-test or z-test, assume that the test statistic follows a normal
distribution. Thanks to the CLT, even if your data isn’t normally distributed, the sampling
distribution of the test statistic will be approximately normal for large sample sizes.
This enables valid significance testing for population means or proportions.
Application of central limit
Theoram
Quality Control & Process Monitoring
In manufacturing and quality control, the average measurements of products (like weight or
size) from samples are monitored.
CLT allows control charts (like X-bar charts) to assume normality of sample means.
This helps detect when a process is going out of control or producing defective products.
Polling and Surveys
When pollsters take samples of people to estimate population preferences (like election results),
CLT helps justify treating the sample mean (or proportion) as normally distributed.
This allows calculation of margins of error and confidence intervals for poll results.
Application of central limit
Theoram
Financial Modeling
CLT is used in risk management and option pricing.
For example, the average returns over a period are assumed to be normally distributed, which
helps in portfolio optimization and predicting future risks.
Many algorithms assume normality or use statistics based on normal distribution.
CLT justifies approximations and helps in the design of experiments, A/B testing, and evaluating
model performance.
Chebyshev's inequality gives a way to estimate how much of the data lies within a certain number of
standard deviations from the mean, regardless of the data’s distribution.
For any random variable XXX with mean μ\muμ and finite standard deviation σ\sigmaσ, and for any
k>0
P(∣X−μ∣≥kσ) ≤ 1/k2
or equivalently,
P(∣X−μ∣<kσ) ≥1 −1/k2
Why is Chebyshev's inequality useful?
It provides a guarantee on how spread out data can be, regardless of the distribution.
Useful for non-normal or unknown distributions.
Gives worst-case bounds on the spread of data.
Probability
What is Probability?
Probability measures how likely an event is to happen. It is a number between 0 and 1 (or 0% to
100%), where:
0 means the event will never happen.
1 means the event will always happen.
Numbers in between represent the chance of occurrence.
Probability Formula
If all outcomes are equally likely,
P(Event)=Number of favorable outcomes /Total number of possible outcomes

What is a Probability Distribution?

A probability distribution describes how probabilities are distributed over the possible outcomes of
a random variable.
For a discrete random variable, it lists all possible outcomes and their probabilities.
For a continuous random variable, it’s described by a probability density function (pdf) that assigns
probabilities to intervals.
What is the Bernoulli Distribution?
The Bernoulli distribution is a discrete probability distribution for a random variable that has exactly two possible
outcomes:
Success (usually coded as 1)
Failure (usually coded as 0)
Definition:
A random variable XXX follows a Bernoulli distribution with parameter p if:
P(X=1)=p,
P(X=0)=1−p
where:
0≤p≤1
p is the probability of success
1−p is the probability of failure
The binomial distribution is a discrete probability distribution that models the number of successes
in a fixed number of independent trials, where each trial has only two possible outcomes: success or
failure

Probability Mass Function (PMF)

P(X=k)= =(n/k) pk (1−p)n−k
Where:
X = number of successes
n = number of trials
k = number of successes (0 ≤ kkk ≤ nnn)
p = probability of success on a single trial
Mean and Variance
Mean (expected value): μ=np
Variance: σ2=np(1−p)
The geometric distribution is a discrete probability distribution that models the number of
Bernoulli trials needed to get the first success.

Probability Mass Function (PMF)

P(X=k) =(1−p)k−1p
Where:
X = number of trials until first success (so k=1,2,3…)
p = probability of success
This is sometimes called the "first success" form.
Mean and Variance
Mean (expected value): μ=1/p
Variance: σ2=1−p/p2
The Poisson distribution is a discrete probability distribution used to model the number of times an
event occurs in a fixed interval of time or space, under the following conditions:

Probability Mass Function (PMF)

P(X=k)= e−λλk/k!
Where:
X = number of occurrences (0, 1, 2, …)
λ = average number of occurrences in the interval
e≈2.718
k! = factorial of k
The Poisson distribution (often misspelled as "possion") is a discrete probability distribution that
models the number of events occurring in a fixed interval of time, area, volume, or distance —
when these events happen at a constant average rate and independently of each other.

Probability Mass Function (PMF)

P(X=k)= e−λλk/k!
Where:
X = number of occurrences (events)
k = specific number of events (0, 1, 2, ...)
λ = average number of events per interval
e≈2.718
Mean and Variance
Mean: μ=λ
Variance: σ2=λ
Open Addressing-

In open addressing,
Unlike separate chaining, all the keys are stored inside the hash table.
No key is stored outside the hash table.

Techniques used for open addressing are-

Linear Probing
Quadratic Probing
Double Hashing
Operations in Open
Addressing-
Let us discuss how operations are performed in open addressing-

Insert Operation-

Hash function is used to compute the hash value for a key to be inserted.
Hash value is then used as an index to store the key in the hash table.

In case of collision,
Probing is performed until an empty bucket is found.
Once an empty bucket is found, the key is inserted.
Probing is performed in accordance with the technique used for open addressing.
Search Operation-

To search any particular key,

Its hash value is obtained using the hash function used.
Using the hash value, that bucket of the hash table is checked.
If the required key is found, the key is searched.
Otherwise, the subsequent buckets are checked until the required key or an empty bucket is found.
The empty bucket indicates that the key is not present in the hash table.

Delete Operation-
The key is first searched and then deleted.
After deleting the key, that particular bucket is marked as “deleted”.
Open Addressing Techniques-

1. Linear Probing-
In linear probing,
When collision occurs, we linearly probe for the next bucket.
We keep probing until an empty bucket is found.

Advantage-
It is easy to compute.

Disadvantage-
The main problem with linear probing is clustering.
Many consecutive elements form groups.
Then, it takes time to search an element or to find an empty bucket.

Time Complexity-
Worst time to search an element in linear probing is O (table size).
This is because-
Even if there is only one element present and all other elements are deleted.
Then, “deleted” markers present in the hash table makes search the entire table.
2. Quadratic Probing-

In quadratic probing,
When collision occurs, we probe for i2‘th bucket in ith iteration.
We keep probing until an empty bucket is found.

3. Double Hashing-
In double hashing,
We use another hash function hash2(x) and look for i * hash2(x) bucket in i th iteration.
It requires more computation time as two hash functions need to be computed.
Comparison of Open Addressing
Techniques-
Linear Quadratic Double
Probing Probing Hashing
Primary Yes No No
Clustering
Secondary Yes Yes No
Clustering
Number of
Probe
Sequence m m m2
(m = size
of table)
Cache
Lies between
performan Best Poor
the two
ce
Conclusions-

•Linear Probing has the best cache performance but suffers from clustering.
•Quadratic probing lies between the two in terms of cache performance and clustering.
•Double caching has poor cache performance but no clustering.

Load Factor (α)-

Load factor (α) is defined as-

In open addressing, the value of load factor always lie between 0 and 1.

This is because-
•In open addressing, all the keys are stored inside the hash table.
•So, size of the table is always greater or at least equal to the number of keys stored in the table.
PRACTICE PROBLEM BASED ON OPEN ADDRESSING-
Problem-
Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101
Use linear probing technique for collision resolution.

Solution-
The given sequence of keys will be inserted in the hash table as-
Step-01:
Draw an empty hash table.
For the given hash function, the possible range of hash values is [0, 6].
So, draw an empty hash table consisting of 7 buckets as-
Step-02:
Insert the given keys in the hash table one by one.
The first key to be inserted in the hash table = 50.
Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
So, key 50 will be inserted in bucket-1 of the hash table as-
Step-03:
The next key to be inserted in the hash table = 700.
Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
So, key 700 will be inserted in bucket-0 of the hash table as-
Step-04:
The next key to be inserted in the hash table = 76.
Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
So, key 76 will be inserted in bucket-6 of the hash table as-
Step-05:
The next key to be inserted in the hash table = 85.
Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
The first empty bucket is bucket-2.
So, key 85 will be inserted in bucket-2 of the hash table as-
Step-06:
The next key to be inserted in the hash table = 92.
Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is found.
The first empty bucket is bucket-3.
So, key 92 will be inserted in bucket-3 of the hash table as-
Step-07:
The next key to be inserted in the hash table = 73.
Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
The first empty bucket is bucket-4.
So, key 73 will be inserted in bucket-4 of the hash table as-
Step-08:
The next key to be inserted in the hash table = 101.
Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is found.
The first empty bucket is bucket-5.
So, key 101 will be inserted in bucket-5 of the hash table as-
Video Lecture
https://www.youtube.com/watch?v=nEcUS90C4fo
Separate Chaining Vs Open
Addressing-
Separate Chaining Open Addressing

Keys are stored inside the hash table as well as outside All the keys are stored only inside the hash table.
the hash table. No key is present outside the hash table.

The number of keys to be stored in the hash table can The number of keys to be stored in the hash table can
even exceed the size of the hash table. never exceed the size of the hash table.

Deletion is easier. Deletion is difficult.

Extra space is required for the pointers to store the

No extra space is required.
keys outside the hash table.

Cache performance is poor.

Cache performance is better.
This is because of linked lists which store the keys
This is because here no linked lists are used.
outside the hash table.

Some buckets of the hash table are never used which Buckets may be used even if no key maps to those
leads to wastage of space. particular buckets.
Thank You!

Revision - Elements or Probability: Notation For Events
No ratings yet
Revision - Elements or Probability: Notation For Events
20 pages
Probability and Statistics Overview
No ratings yet
Probability and Statistics Overview
19 pages
Ders 1
No ratings yet
Ders 1
34 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
Probability Distributions Overview
No ratings yet
Probability Distributions Overview
2 pages
A 18-Page Statistics & Data Science Cheat Sheets
No ratings yet
A 18-Page Statistics & Data Science Cheat Sheets
18 pages
Understanding Probability Distributions
No ratings yet
Understanding Probability Distributions
20 pages
DMV Unit 1
No ratings yet
DMV Unit 1
33 pages
ProbabilityDistributions BRSM SP2022 Lecture3
No ratings yet
ProbabilityDistributions BRSM SP2022 Lecture3
45 pages
3 - Introduction To Inferential Statistics
No ratings yet
3 - Introduction To Inferential Statistics
32 pages
Stats
No ratings yet
Stats
24 pages
Module01 ProbabilityAndHypothesisTesting
No ratings yet
Module01 ProbabilityAndHypothesisTesting
62 pages
Data Science Probability Review
No ratings yet
Data Science Probability Review
12 pages
Probability Theory Basics
No ratings yet
Probability Theory Basics
7 pages
Empirical Political Analysis Techniques
No ratings yet
Empirical Political Analysis Techniques
8 pages
1853 - Random Variable & Distribution
No ratings yet
1853 - Random Variable & Distribution
43 pages
Probability & Testing in Data Analytics
No ratings yet
Probability & Testing in Data Analytics
70 pages
Permutations, Probability, and Statistics Guide
No ratings yet
Permutations, Probability, and Statistics Guide
3 pages
Module01 HypothesisTesting
No ratings yet
Module01 HypothesisTesting
62 pages
Stats Review
No ratings yet
Stats Review
65 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
Chapter 5 Prob
No ratings yet
Chapter 5 Prob
6 pages
Classify Sample Observation
No ratings yet
Classify Sample Observation
2 pages
NLP Module 2
No ratings yet
NLP Module 2
73 pages
Understanding the Central Limit Theorem
100% (3)
Understanding the Central Limit Theorem
38 pages
Key Statistical Concepts and Formulas
No ratings yet
Key Statistical Concepts and Formulas
17 pages
Unit II
No ratings yet
Unit II
140 pages
Stati Sem 3 Notes
No ratings yet
Stati Sem 3 Notes
19 pages
Mba Statistics Midterm Review Sheet
No ratings yet
Mba Statistics Midterm Review Sheet
1 page
Module Wise Important Formulae
No ratings yet
Module Wise Important Formulae
45 pages
Statatics and Probability Chapter 3 and 4
No ratings yet
Statatics and Probability Chapter 3 and 4
10 pages
Statistic S at Probabili TY: Teacher: Aldwin N. Petronio
No ratings yet
Statistic S at Probabili TY: Teacher: Aldwin N. Petronio
44 pages
Random Variables & Distributions Guide
No ratings yet
Random Variables & Distributions Guide
5 pages
Theoretical Distributions 1
No ratings yet
Theoretical Distributions 1
2 pages
Theoretical Distributions
No ratings yet
Theoretical Distributions
5 pages
LQ1 Notes
No ratings yet
LQ1 Notes
15 pages
Class 4 SP
No ratings yet
Class 4 SP
23 pages
Lecture Note On Biostatistics
No ratings yet
Lecture Note On Biostatistics
74 pages
Exam P Review Sheet
No ratings yet
Exam P Review Sheet
12 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Understanding Random Variables
No ratings yet
Understanding Random Variables
5 pages
Understanding Hypergeometric Distribution
No ratings yet
Understanding Hypergeometric Distribution
129 pages
Random Variables
No ratings yet
Random Variables
7 pages
Mathematical Statistics: Probability & Distributions
No ratings yet
Mathematical Statistics: Probability & Distributions
87 pages
EDA Reviewer
No ratings yet
EDA Reviewer
8 pages
Statistics I: Introduction and Distributions of Sampling Statistics
No ratings yet
Statistics I: Introduction and Distributions of Sampling Statistics
22 pages
Internal Paper
No ratings yet
Internal Paper
20 pages
CHP 5
No ratings yet
CHP 5
63 pages
SDM 1 Formula
No ratings yet
SDM 1 Formula
9 pages
Types of Random Variables Explained
No ratings yet
Types of Random Variables Explained
4 pages
Lecture Slides - Inferential Statistics
100% (1)
Lecture Slides - Inferential Statistics
42 pages
Unit I
No ratings yet
Unit I
8 pages
2 Inferential+Statistics+ (Theoretical)
No ratings yet
2 Inferential+Statistics+ (Theoretical)
4 pages
Probability Distribution
0% (1)
Probability Distribution
21 pages
FormulaSheet Final
No ratings yet
FormulaSheet Final
19 pages
Peran Penyuluh dalam Produksi Jagung Manis
No ratings yet
Peran Penyuluh dalam Produksi Jagung Manis
12 pages
HL AA Statistics Notes RMS
No ratings yet
HL AA Statistics Notes RMS
13 pages
Grab E-Hailing Driver Insights Report
No ratings yet
Grab E-Hailing Driver Insights Report
18 pages
ANOVA in Linear Regression Analysis
No ratings yet
ANOVA in Linear Regression Analysis
606 pages
India Credit Risk Model Report
No ratings yet
India Credit Risk Model Report
18 pages
MacKinnon Critical Values For Cointegration Tests Qed WP 1227
No ratings yet
MacKinnon Critical Values For Cointegration Tests Qed WP 1227
19 pages
UNIT 18 Measures of Variation: Answers
No ratings yet
UNIT 18 Measures of Variation: Answers
9 pages
Exercise Book
No ratings yet
Exercise Book
43 pages
Test Item Analysis: I Have It . Now What Do I Do With IT???
No ratings yet
Test Item Analysis: I Have It . Now What Do I Do With IT???
28 pages
Understanding Histograms in Data Analysis
No ratings yet
Understanding Histograms in Data Analysis
17 pages
Time Series Analysis for Students
No ratings yet
Time Series Analysis for Students
7 pages
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
100% (1)
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
14 pages
Sio, U.N., & Ormerod, T.C. (2009) - Does Incubation Enhance Problem Solving, A Meta-Analytic Review. Psychological Bulletin, 135
No ratings yet
Sio, U.N., & Ormerod, T.C. (2009) - Does Incubation Enhance Problem Solving, A Meta-Analytic Review. Psychological Bulletin, 135
94 pages
Hasil Uji Paired T Test
No ratings yet
Hasil Uji Paired T Test
2 pages
Exercise 13-Chi Square-Assoc Group 6
No ratings yet
Exercise 13-Chi Square-Assoc Group 6
4 pages
Applied Linear Regression
No ratings yet
Applied Linear Regression
9 pages
Abstraaaak
No ratings yet
Abstraaaak
2 pages
Mann WHitney U Test
No ratings yet
Mann WHitney U Test
35 pages
Business Statistics KMBN-104 - Q - Ans
100% (1)
Business Statistics KMBN-104 - Q - Ans
30 pages
Chapter 8: Correlaiton and Simple Linear Regression Analysis
No ratings yet
Chapter 8: Correlaiton and Simple Linear Regression Analysis
7 pages
Introduction To ML Unit-1
No ratings yet
Introduction To ML Unit-1
90 pages
Analisis Statistik Komputasi 2019
No ratings yet
Analisis Statistik Komputasi 2019
9 pages
Understanding Correlation & Graphs
No ratings yet
Understanding Correlation & Graphs
7 pages
Session 5. Confidence Interval of The Mean When SD Is Known (18-22)
No ratings yet
Session 5. Confidence Interval of The Mean When SD Is Known (18-22)
5 pages
Dynamic DiD Regression Li Strezhnev June 25 2024
No ratings yet
Dynamic DiD Regression Li Strezhnev June 25 2024
112 pages
Asq What Is Statistical Process Control
No ratings yet
Asq What Is Statistical Process Control
3 pages
Module 3
No ratings yet
Module 3
63 pages
Univariate and Multivariate Control Charts For Monitoring Dynamic-Behavior Processes: A Case Study
No ratings yet
Univariate and Multivariate Control Charts For Monitoring Dynamic-Behavior Processes: A Case Study
35 pages
Basic Time Series Concepts
No ratings yet
Basic Time Series Concepts
10 pages
UNIT-3: Correlation and Regression Analysis
No ratings yet
UNIT-3: Correlation and Regression Analysis
3 pages

DMV - Unit I

Uploaded by

DMV - Unit I

Uploaded by

UNIT – I

Intoduction To Data Modelling

Continuous Random Variable

What is a Probability Distribution?

Probability Mass Function (PMF)

Probability Mass Function (PMF)

Probability Mass Function (PMF)

Probability Mass Function (PMF)

Techniques used for open addressing are-

To search any particular key,

Load Factor (α)-

Deletion is easier. Deletion is difficult.

Extra space is required for the pointers to store the

Cache performance is poor.

You might also like