OUTLIERS, VARIANCES,
PROBABILITY DISTRIBUTIONS,
AND CORRELATIONS
Amisha Sarika Gowda (1GA19CS011)
Amulya V D (1GA19CS013)
Anagha M (1GA19CS014)
Anagha S (1GA19CS015)
OUTLIERS
Outliers are data points that are numerically far distant
from the rest of the points in a dataset.
There are several reasons for the presence of outliers in
relationships. Some of these are:
Anomalous situation
Presence of a previously unknown fact
Human error
Sampling error
VARIANCE
Variance measures by the sum of squares of the
difference in values of a variable with respect to the
expected value.
Variance indicates how widely data points in a dataset
vary.
A high variance indicates that the data in the dataset is
very much spread out over a large area (random dataset),
whereas a low variance indicates that the data is very
similar in nature.
PROBABILISTIC DISTRIBUTION
Probability distribution is the distribution of P values as
a function of all possible independent values, variables,
situations, distances or variables.
The standard normal distribution formula is:
Normal distribution
It relates to Gaussian function. Figure shows distribution
around , standard deviation and variance
The figure also shows the percentages of areas in five
regions with respect to the total area under the curve for
P(x).
The variance for probability distribution represents how
individual data points relate to each other within a
dataset.
The variance is the average of the squared differences
between each data value and the mean.
CORRELATION
Correlation means analysis which lets us find the
association or the absence of the relationship between
two variables, x and y.
Correlation gives the strength of the relationship
between the model and the dependent variable on a
convenient 0-100% scale.
Correlation is a statistical technique that measures and
describes the 'strength' and 'direction’ of the relationship
between two variables.
CORRELATION
The correlation r between the two variables x and y is:
where n is the number of observations in the sample, xi
is the x value for observation i, x dash is the sample
mean of x, yi is the y value for observation i, y dash is
the sample mean of y, sx is the sample standard deviation
of x, and sy is the sample standard deviation of y.
THANK YOU