UNIT – I
Intoduction To Data Modelling
Prepared By
Prof. Garad Varsha
Random Variable
In statistics, a random variable is a variable whose possible values are outcomes of a random
phenomenon or experiment. It assigns a numerical value to each outcome in a sample space.
Types of Random Variables:
Discrete Random Variable
◦ Takes on countable values (often whole numbers).
◦ Example: Number of heads when flipping 3 coins (0, 1, 2, 3).
◦ Common distributions: Binomial, Poisson.
Continuous Random Variable
◦ Takes on an infinite number of possible values within a given range.
◦ Example: The time it takes for a train to arrive (e.g., 2.5 minutes).
◦ Common distributions: Normal, Exponential.
Random Variable
A random variable is a function that assigns a numerical value to each outcome of a random
experiment.
In simple terms, it's a variable whose value is determined by the outcome of a random process.
Example:
Tossing a coin 3 times:
Let XXX be the number of heads.
Possible outcomes of the experiment:
HHH → 3, HHT → 2, HTT → 1, TTT → 0
So XXX is a random variable taking values {0, 1, 2, 3}.
Types of Random Variable
here are two main types of random variables in statistics:
1. Discrete Random Variable
Definition: A random variable that takes on a finite or countable number of distinct values.
Example: Number of students in a class, number of heads in 3 coin tosses.
Probability Function: Uses a Probability Mass Function (PMF).
Common Distributions:
◦ Binomial
◦ Poisson
◦ Geometric
2. Continuous Random Variable
Definition: A random variable that takes on infinitely many values within a given range or
interval.
Example: Height of students, time taken to run a race, temperature.
Probability Function: Uses a Probability Density Function (PDF).
Common Distributions:
◦ Normal (Gaussian)
◦ Exponential
◦ Uniform
Covariance
Covariance measures the direction of the linear relationship between two random variables X
and Y
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]
Or equivalently:
Cov(X,Y)=E[XY]−E[X]⋅E[Y]
Difference between covariance
and correlation
Covariance
1)Measures how two variables vary together, indicating the direction of their linear relationship
2) Units depend on the product of the units of the two variables.
3) lies between −∞to∞−∞to∞
Correlation
1) Measures both the relationship and strength of two variables
2) Unitless, making it easier to compare across datasets or features.
3) lies between -1 and 1.
Central Limit Theorem
The Central Limit Theorem explains that the sample distribution of the sample mean resembles
the normal distribution irrespective of the fact that whether the variables themselves are
distributed normally or not. The Central Limit Theorem is often called CLT in abbreviated form.
Central Limit Theorem Definition:
The Central Limit Theorem states that:
When large samples usually greater than thirty are taken into consideration then the
distribution of sample arithmetic mean approaches the normal distribution irrespective of the
fact that random variables were originally distributed normally or not.
Central Limit Theorem
Central Limit Theorem Formula
Let us assume we have a random variable X. Let σ be its standard deviation and μ is the mean of
the random variable. Now as per the Central Limit Theorem, the sample mean X‾X will
approximate to the normal distribution which is given as X‾X ⁓ N(μ, σ/√n). The Z-Score of the
random variable X‾X is given as Z = given
Here x‾x is the mean of X‾X. The image of the formula is attached below.
Central Limit Theorem
Application of central limit
Theoram
Confidence Intervals
When you want to estimate the population mean based on a sample, you often construct a
confidence interval.
CLT ensures that the sampling distribution of the sample mean is approximately normal.
This allows us to use z-scores or t-scores to calculate the margin of error and build accurate
confidence intervals, even if the original data is not normal.
2. Hypothesis Testing
Many hypothesis tests, like the t-test or z-test, assume that the test statistic follows a normal
distribution. Thanks to the CLT, even if your data isn’t normally distributed, the sampling
distribution of the test statistic will be approximately normal for large sample sizes.
This enables valid significance testing for population means or proportions.
Application of central limit
Theoram
Quality Control & Process Monitoring
In manufacturing and quality control, the average measurements of products (like weight or
size) from samples are monitored.
CLT allows control charts (like X-bar charts) to assume normality of sample means.
This helps detect when a process is going out of control or producing defective products.
Polling and Surveys
When pollsters take samples of people to estimate population preferences (like election results),
CLT helps justify treating the sample mean (or proportion) as normally distributed.
This allows calculation of margins of error and confidence intervals for poll results.
Application of central limit
Theoram
Financial Modeling
CLT is used in risk management and option pricing.
For example, the average returns over a period are assumed to be normally distributed, which
helps in portfolio optimization and predicting future risks.
Many algorithms assume normality or use statistics based on normal distribution.
CLT justifies approximations and helps in the design of experiments, A/B testing, and evaluating
model performance.
Chebyshev's inequality gives a way to estimate how much of the data lies within a certain number of
standard deviations from the mean, regardless of the data’s distribution.
For any random variable XXX with mean μ\muμ and finite standard deviation σ\sigmaσ, and for any
k>0
P(∣X−μ∣≥kσ) ≤ 1/k2
or equivalently,
P(∣X−μ∣<kσ) ≥1 −1/k2
Why is Chebyshev's inequality useful?
It provides a guarantee on how spread out data can be, regardless of the distribution.
Useful for non-normal or unknown distributions.
Gives worst-case bounds on the spread of data.
Probability
What is Probability?
Probability measures how likely an event is to happen. It is a number between 0 and 1 (or 0% to
100%), where:
0 means the event will never happen.
1 means the event will always happen.
Numbers in between represent the chance of occurrence.
Probability Formula
If all outcomes are equally likely,
P(Event)=Number of favorable outcomes /Total number of possible outcomes
What is a Probability Distribution?
A probability distribution describes how probabilities are distributed over the possible outcomes of
a random variable.
For a discrete random variable, it lists all possible outcomes and their probabilities.
For a continuous random variable, it’s described by a probability density function (pdf) that assigns
probabilities to intervals.
What is the Bernoulli Distribution?
The Bernoulli distribution is a discrete probability distribution for a random variable that has exactly two possible
outcomes:
Success (usually coded as 1)
Failure (usually coded as 0)
Definition:
A random variable XXX follows a Bernoulli distribution with parameter p if:
P(X=1)=p,
P(X=0)=1−p
where:
0≤p≤1
p is the probability of success
1−p is the probability of failure
The binomial distribution is a discrete probability distribution that models the number of successes
in a fixed number of independent trials, where each trial has only two possible outcomes: success or
failure
Probability Mass Function (PMF)
P(X=k)= =(n/k) pk (1−p)n−k
Where:
X = number of successes
n = number of trials
k = number of successes (0 ≤ kkk ≤ nnn)
p = probability of success on a single trial
Mean and Variance
Mean (expected value): μ=np
Variance: σ2=np(1−p)
The geometric distribution is a discrete probability distribution that models the number of
Bernoulli trials needed to get the first success.
Probability Mass Function (PMF)
P(X=k) =(1−p)k−1p
Where:
X = number of trials until first success (so k=1,2,3…)
p = probability of success
This is sometimes called the "first success" form.
Mean and Variance
Mean (expected value): μ=1/p
Variance: σ2=1−p/p2
The Poisson distribution is a discrete probability distribution used to model the number of times an
event occurs in a fixed interval of time or space, under the following conditions:
Probability Mass Function (PMF)
P(X=k)= e−λλk/k!
Where:
X = number of occurrences (0, 1, 2, …)
λ = average number of occurrences in the interval
e≈2.718
k! = factorial of k
The Poisson distribution (often misspelled as "possion") is a discrete probability distribution that
models the number of events occurring in a fixed interval of time, area, volume, or distance —
when these events happen at a constant average rate and independently of each other.
Probability Mass Function (PMF)
P(X=k)= e−λλk/k!
Where:
X = number of occurrences (events)
k = specific number of events (0, 1, 2, ...)
λ = average number of events per interval
e≈2.718
Mean and Variance
Mean: μ=λ
Variance: σ2=λ
Open Addressing-
In open addressing,
Unlike separate chaining, all the keys are stored inside the hash table.
No key is stored outside the hash table.
Techniques used for open addressing are-
Linear Probing
Quadratic Probing
Double Hashing
Operations in Open
Addressing-
Let us discuss how operations are performed in open addressing-
Insert Operation-
Hash function is used to compute the hash value for a key to be inserted.
Hash value is then used as an index to store the key in the hash table.
In case of collision,
Probing is performed until an empty bucket is found.
Once an empty bucket is found, the key is inserted.
Probing is performed in accordance with the technique used for open addressing.
Search Operation-
To search any particular key,
Its hash value is obtained using the hash function used.
Using the hash value, that bucket of the hash table is checked.
If the required key is found, the key is searched.
Otherwise, the subsequent buckets are checked until the required key or an empty bucket is found.
The empty bucket indicates that the key is not present in the hash table.
Delete Operation-
The key is first searched and then deleted.
After deleting the key, that particular bucket is marked as “deleted”.
Open Addressing Techniques-
1. Linear Probing-
In linear probing,
When collision occurs, we linearly probe for the next bucket.
We keep probing until an empty bucket is found.
Advantage-
It is easy to compute.
Disadvantage-
The main problem with linear probing is clustering.
Many consecutive elements form groups.
Then, it takes time to search an element or to find an empty bucket.
Time Complexity-
Worst time to search an element in linear probing is O (table size).
This is because-
Even if there is only one element present and all other elements are deleted.
Then, “deleted” markers present in the hash table makes search the entire table.
2. Quadratic Probing-
In quadratic probing,
When collision occurs, we probe for i2‘th bucket in ith iteration.
We keep probing until an empty bucket is found.
3. Double Hashing-
In double hashing,
We use another hash function hash2(x) and look for i * hash2(x) bucket in i th iteration.
It requires more computation time as two hash functions need to be computed.
Comparison of Open Addressing
Techniques-
Linear Quadratic Double
Probing Probing Hashing
Primary Yes No No
Clustering
Secondary Yes Yes No
Clustering
Number of
Probe
Sequence m m m2
(m = size
of table)
Cache
Lies between
performan Best Poor
the two
ce
Conclusions-
•Linear Probing has the best cache performance but suffers from clustering.
•Quadratic probing lies between the two in terms of cache performance and clustering.
•Double caching has poor cache performance but no clustering.
Load Factor (α)-
Load factor (α) is defined as-
In open addressing, the value of load factor always lie between 0 and 1.
This is because-
•In open addressing, all the keys are stored inside the hash table.
•So, size of the table is always greater or at least equal to the number of keys stored in the table.
PRACTICE PROBLEM BASED ON OPEN ADDRESSING-
Problem-
Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101
Use linear probing technique for collision resolution.
Solution-
The given sequence of keys will be inserted in the hash table as-
Step-01:
Draw an empty hash table.
For the given hash function, the possible range of hash values is [0, 6].
So, draw an empty hash table consisting of 7 buckets as-
Step-02:
Insert the given keys in the hash table one by one.
The first key to be inserted in the hash table = 50.
Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
So, key 50 will be inserted in bucket-1 of the hash table as-
Step-03:
The next key to be inserted in the hash table = 700.
Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
So, key 700 will be inserted in bucket-0 of the hash table as-
Step-04:
The next key to be inserted in the hash table = 76.
Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
So, key 76 will be inserted in bucket-6 of the hash table as-
Step-05:
The next key to be inserted in the hash table = 85.
Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
The first empty bucket is bucket-2.
So, key 85 will be inserted in bucket-2 of the hash table as-
Step-06:
The next key to be inserted in the hash table = 92.
Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
Since bucket-1 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is found.
The first empty bucket is bucket-3.
So, key 92 will be inserted in bucket-3 of the hash table as-
Step-07:
The next key to be inserted in the hash table = 73.
Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
The first empty bucket is bucket-4.
So, key 73 will be inserted in bucket-4 of the hash table as-
Step-08:
The next key to be inserted in the hash table = 101.
Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
Since bucket-3 is already occupied, so collision occurs.
To handle the collision, linear probing technique keeps probing linearly until an empty bucket is found.
The first empty bucket is bucket-5.
So, key 101 will be inserted in bucket-5 of the hash table as-
Video Lecture
https://www.youtube.com/watch?v=nEcUS90C4fo
Separate Chaining Vs Open
Addressing-
Separate Chaining Open Addressing
Keys are stored inside the hash table as well as outside All the keys are stored only inside the hash table.
the hash table. No key is present outside the hash table.
The number of keys to be stored in the hash table can The number of keys to be stored in the hash table can
even exceed the size of the hash table. never exceed the size of the hash table.
Deletion is easier. Deletion is difficult.
Extra space is required for the pointers to store the
No extra space is required.
keys outside the hash table.
Cache performance is poor.
Cache performance is better.
This is because of linked lists which store the keys
This is because here no linked lists are used.
outside the hash table.
Some buckets of the hash table are never used which Buckets may be used even if no key maps to those
leads to wastage of space. particular buckets.
Thank You!