0% found this document useful (0 votes)
14 views34 pages

Understanding Correlation in Statistics

The document discusses correlation as a statistical technique to identify relationships between variables, detailing types such as positive, negative, and zero correlation, as well as simple, partial, and multiple correlation. It also introduces regression analysis, distinguishing between simple and multiple regression, and outlines supervised and unsupervised learning in machine learning, including their applications and advantages. Key concepts include the use of correlation coefficients and various algorithms for classification and regression tasks.

Uploaded by

favoha4730
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Understanding Correlation in Statistics

The document discusses correlation as a statistical technique to identify relationships between variables, detailing types such as positive, negative, and zero correlation, as well as simple, partial, and multiple correlation. It also introduces regression analysis, distinguishing between simple and multiple regression, and outlines supervised and unsupervised learning in machine learning, including their applications and advantages. Key concepts include the use of correlation coefficients and various algorithms for classification and regression tasks.

Uploaded by

favoha4730
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GLS UNIVERSITY

BCA SEMESTER -III


210302301 STATISTICS FOR DATA ANALYSIS
MODULE 3

UNIT – 4 : CORRELATION

Introduction:
In today’s business world we come across many activities, which are dependent
on each other. In businesses we see large number of problems involving the use of two
or more variables. Identifying these variables and its dependency helps us in resolving
the many problems. Many times there are problems or situations where two variables
seem to move in the same direction such as both are increasing or decreasing. At times
an increase in one variable is accompanied by a decline in another. For example,
family income and expenditure, price of a product and its demand, advertisement
expenditure and sales volume etc. If two quantities vary in such a way that movements
in one are accompanied by movements in the other, then these quantities are said to
be correlated.

Meaning:
Correlation is a statistical technique to ascertain the association or relationship
between two or more variables. Correlation analysis is a statistical technique to study
the degree and direction of relationship between two or more variables.
A correlation coefficient is a statistical measure of the degree to which changes
to the value of one variable predict change to the value of another. When the
fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change
in the other.

Types of Correlation:
Correlation is described or classified in several different ways. Three of the
most important are:

I. Positive and Negative


II. Simple, Partial and Multiple
III. Linear and non-linear

I. Positive, Negative and Zero Correlation:

Whether correlation is positive (direct) or negative (in-versa) would depend


upon the direction of change of the variable.

Positive Correlation: If both the variables vary in the same direction, correlation is
said to be positive. It means if one variable is increasing, the other on an average is
also increasing or if one variable is decreasing, the other on an average is also
deceasing, then the correlation is said to be positive correlation. For example, the
correlation between heights and weights of a group of persons is a positive
correlation.

Negative Correlation: If both the variables vary in opposite direction, the correlation
is said to be negative. If means if one variable increases, but the other variable
decreases or if one variable decreases, but the other variable increases, then the
correlation is said to be negative correlation. For example, the correlation between the
price of a product and its demand is a negative correlation
Zero Correlation: Actually it is not a type of correlation but still it is called as zero or
no correlation. When we don’t find any relationship between the variables then, it is
said to be zero correlation. It means a change in value of one variable doesn’t influence
or change the value of other variable. For example, the correlation between weight of
person and intelligence is a zero or no correlation.

II. Simple, Partial and Multiple Correlation:

The distinction between simple, partial and multiple correlation is based upon
the number of variables studied.
Simple Correlation: When only two variables are studied, it is a case of simple
correlation. For example, when one studies relationship between the marks secured
by student and the attendance of student in class, it is a problem of simple correlation.
Partial Correlation: In case of partial correlation one studies three or more variables
but considers only two variables to be influencing each other and the effect of other
influencing variables being held constant. For example, in above example of
relationship between student marks and attendance, the other variable influencing
such as effective teaching of teacher, use of teaching aid like computer, smart board
etc are assumed to be constant.
Multiple Correlation: When three or more variables are studied, it is a case of
multiple correlation. For example, in above example if study covers the relationship
between student marks, attendance of students, effectiveness of teacher, use of
teaching aids etc, it is a case of multiple correlation.

III. Linear and Non-linear Correlation:


Depending upon the constancy of the ratio of change between the variables, the
correlation may be Linear or Non-linear Correlation.

Linear Correlation: If the amount of change in one variable bears a constant ratio to
the amount of change in the other variable, then correlation is said to be linear. If such
variables are plotted on a graph paper all the plotted points would fall on a straight
line. For example: If it is assumed that, to produce one unit of finished product we
need 10 units of raw materials, then subsequently to produce 2 units of finished
product we need double of the one unit.

Non-linear Correlation: If the amount of change in one variable does not bear a
constant ratio to the amount of change to the other variable, then correlation is said to
be non-linear. If such variables are plotted on a graph, the points would fall on a curve
and not on a straight line. For example, if we double the amount of advertisement
expenditure, then sales volume would not necessarily be doubled.

State in each case whether there is


(a) Positive Correlation
(b) Negative Correlation
(c) No Correlation
Karl Pearson’s Coefficient of Correlation:

Karl Pearson’s Coefficient of Correlation:


Karl Pearson’s method of calculating coefficient of correlation is based on the
covariance of the two variables in a series. This method is widely used in practice and
the coefficient of correlation is denoted by the symbol “r”. If the two variables under
study are X and Y, the following formula suggested by Karl Pearson can be used for
measuring the degree of relationship of correlation.

Example 1
From following information find the correlation coefficient between advertisement
expenses and sales volume using Karl Pearson’s coefficient of correlation method.
Interpretation: From the above calculation it is very clear that there is high degree of
positive correlation i.e. r = 0.7866, between the two variables. i.e. Increase in
advertisement expenses leads to increased sales volume.

Example 2
Find the correlation coefficient between age and playing habits of the following
students using Karl Pearson’s coefficient of correlation method.

Solution:
To find the correlation between age and playing habits of the students, we need to
compute the percentages of students who are having the playing habit.

Percentage of playing habits = No. of Regular Players / Total No. of Students * 100

Now, let us assume that ages of the students are variable X and percentages of playing
habits are variable Y.
Example 3
Find Karl Pearson’s coefficient of correlation between capital employed and profit
obtained from the following data.
Example 4

Coefficient of correlation between X and Y is 0.3. Their covariance is 9. The variance of


X is 16. Find the standard devotion of Y series.
Regression
Regression analysis is a statistical method that
examines the relationship between one or more
independent variables and a dependent variable. It's
commonly used for prediction and understanding the
strength and nature of relationships between variables.
Regression analysis
Types of Regression
Simple Regression: This involves one independent
variable predicting a dependent variable. It's like
predicting someone's height based solely on their age.
Multiple Regression: Here, we have multiple
independent variables predicting a dependent variable.
Imagine predicting someone's weight based on age,
height, and maybe even their daily pizza intake.
Types of Regression
Linear Regression: This type assumes a linear
relationship between variables, meaning the change in
the dependent variable is proportional to the change in
the independent variable(s). Think of a straight line on
a graph.
Non-Linear Regression: In contrast, this type
acknowledges a non-linear relationship. The
relationship might be curved, like a sine wave or a
parabola, making it a bit trickier to model.
Supervised & Unsupervised Learning

<date/time> <footer> 1
Introduction

Machine learning is a field of computer science that gives


computers the ability to learn without being explicitly programmed.
Supervised learning and unsupervised learning are two main
types of machine learning.

17/02/2025 2
17/02/2025 3
Supervised Learning
Supervised learning is a form of ML in which the model is trained to
associate input data with specific output labels, drawing from labeled
training data.
Here, the algorithm is furnished with a dataset containing input features
paired with corresponding output labels.
The model's objective is to discern the correlation between input features
and output labels, enabling it to provide precise predictions or
classifications when confronted with unseen data.

17/02/2025 4

For example, a
labeled dataset of
images of
Elephant, Camel
and Cow would
have each image
tagged with either
“Elephant” ,
“Camel”or “Cow.”

17/02/2025 5

Supervised learning involves training a machine from labeled data.

Labeled data consists of examples with the correct answer or
classification.

The machine learns the relationship between inputs (fruit images) and
outputs (fruit labels).

The trained machine can then make predictions on new, unlabeled data.
Example: Let’s say you have a fruit basket that you want to identify. The
machine would first analyze the image to extract features such as its
shape, color, and texture. Then, it would compare these features to the
features of the fruits it has already learned about. If the new image’s
features are most similar to those of an apple, the machine would predict
that the fruit is an apple.
17/02/2025 6
Types of Supervised Learning

Classification
In classification tasks, the model predicts a discrete class label or category.
For example, it classifies emails as spam or not based on features like
keywords and sender information.

Regression
In regression tasks, the model anticipates a continuous value or quantity.
For instance, it forecasts house prices by considering features such as
square footage, number of bedrooms, and location.

17/02/2025 7
1. Regression
Regression is a type of supervised learning that is used to predict
continuous values, such as house prices, stock prices, or customer
churn. Regression algorithms learn a function that maps from the input
features to the output value.

Some common regression algorithms include:

Linear Regression

Polynomial Regression

Support Vector Machine Regression

Decision Tree Regression

Random Forest Regression
17/02/2025 8
2- Classification

Classification is a type of supervised learning that is used to predict
categorical values, such as whether a customer will churn or not,
whether an email is spam or not, or whether a medical image shows a
tumor or not. Classification algorithms learn a function that maps from
the input features to a probability distribution over the output classes.

Some common classification algorithms include:

Logistic Regression

Support Vector Machines

Decision Trees

Random Forests

Naive Baye
17/02/2025 9
Applications of Supervised learning

17/02/2025 10
Applications of Supervised learning
1. Spam Filtering : Identify and classify spam emails based on their content,
helping users avoid unwanted messages.

2. Image Classification: classify images into different categories, such as


animals, objects, or scenes, facilitating tasks like image search, content
moderation, and image-based product recommendations.
Facebook's facial recognition feature uses supervised learning to identify
people in photos.

3. Medical Diagnosis : Assist in medical diagnosis by analyzing patient data,


such as medical images, test results, and patient history, to identify patterns
that suggest specific diseases or conditions.

17/02/2025 11
Applications of Supervised learning
4. Fraud Detection : Analyze financial transactions and identify patterns that
indicate fraudulent activity, helping financial institutions prevent fraud and
protect their customers.

5. Natural Language Processing (NLP) : plays a crucial role in NLP tasks,


including sentiment analysis, machine translation, and text summarization,
enabling machines to understand and process human language effectively.

6. Speech Recognition: Virtual assistants like Siri and Alexa use supervised
learning to recognize spoken words and phrases.

17/02/2025 12
Advantages

Supervised learning allows collecting data and produces data output from
previous experiences.

Helps to optimize performance criteria with the help of experience.

Supervised machine learning helps to solve various types of real-world
computation problems.

It performs classification and regression tasks.

It allows estimating or mapping the result to a new sample.

We have complete control over choosing the number of classes we want in
the training data.

17/02/2025 13
Disadvantages

Classifying big data can be challenging.

Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.

Supervised learning cannot handle all complex tasks in Machine
Learning.

Computation time is vast for supervised learning.

It requires a labelled data set.

It requires a training process.

17/02/2025 14
Unsupervised Learning
Unsupervised learning is a type of machine learning that learns from
unlabeled data. This means that the data does not have any pre-existing
labels or categories.
The goal of unsupervised learning is to discover patterns and relationships
in the data without any explicit guidance.
Unsupervised learning is the training of a machine using information that is
neither classified nor labeled and allowing the algorithm to act on that
information without guidance.
Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.

17/02/2025 15
17/02/2025 16

Unsupervised learning allows the model to discover patterns and relationships in
unlabeled data.

Clustering algorithms group similar data points together based on their inherent
characteristics.

Feature extraction captures essential information from the data, enabling the model to
make meaningful distinctions.

Label association assigns categories to the clusters based on the extracted patterns and
characteristics.
Example: For instance, suppose it is given an image having both dogs and cats which it has
never seen. Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their similarities,
patterns, and differences, i.e., we can easily categorize the above picture into two parts. The
first may contain all pics having dogs in them and the second part may contain all pics
having cats in them.
17/02/2025 17
Types of Unsupervised Learning
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
Common clustering algorithms include K-means clustering, hierarchical
clustering, DBSCAN, and Gaussian mixture models (GMM).
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
The most well-known algorithm for association rule learning is Apriori, which
is used for market basket analysis. Association rule learning is commonly
applied in retail to analyze purchasing patterns, identify frequently co-
occurring items, and make recommendations.
17/02/2025 18
Applications of Unsupervised Learning
1. Customer Segmentation: Online retailers use unsupervised learning to segment
customers based on their buying behavior.
2. Recommendation Systems: Netflix uses unsupervised learning to recommend
movies and TV shows based on user viewing history.
3. Anomaly Detection: Credit card companies use unsupervised learning to detect
unusual transaction patterns that may indicate fraud.
4. Scientific discovery: Unsupervised learning can uncover hidden relationships and
patterns in scientific data, leading to new hypotheses and insights in various scientific
fields.
5. Image analysis: Unsupervised learning can group images based on their content,
facilitating tasks such as image classification, object detection, and image retrieval.

17/02/2025 19
Advantages


It does not require training data to be labeled.

Dimensionality reduction can be easily accomplished using unsupervised
learning.

Capable of finding previously unknown patterns in data.

Unsupervised learning can help you gain insights from unlabeled data that you
might not have been able to get otherwise.

Unsupervised learning is good at finding patterns and relationships in data
without being told what to look for. This can help you learn new things about
your data.

17/02/2025 20
Disadvantages

Difficult to measure accuracy or effectiveness due to lack of predefined answers
during training.

The results often have lesser accuracy.

The user needs to spend time interpreting and label the classes which follow that
classification.

Unsupervised learning can be sensitive to data quality, including missing values,
outliers, and noisy data.

Without labeled data, it can be difficult to evaluate the performance of
unsupervised learning models, making it challenging to assess their effectiveness

17/02/2025 21
Supervised Vs. Unsupervised Learning

17/02/2025 22

You might also like