0% found this document useful (0 votes)

2 views

program-2

The document outlines a program aimed at performing exploratory data analysis (EDA) on datasets with at least two numerical columns, focusing on statistical techniques and visualizations to understand variable relationships. Key tasks include loading datasets, creating scatter plots, calculating Pearson correlation coefficients, and visualizing correlation matrices through heatmaps. The program is designed to aid users in gaining insights into their data prior to applying advanced machine learning techniques.

Uploaded by

Kasi Lingamn

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

program-2

Uploaded by

Kasi Lingamn

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Practical Insights into Data Analysis

and Machine Learning

PROGRAM - 2

Develop a program to Load a dataset with at least two numerical columns

(e.g., Iris, Titanic). Plot a scatter plot of two variables and calculate their
Pearson correlation coefficient. Write a program to compute the covariance
and correlation matrix for a dataset. Visualize the correlation matrix using
a heatmap to know which variables have strong positive/negative
correlations.

Objective

To analyze the relationships between numerical variables in a dataset using

statistical and visualization techniques.
--------------------------------------------------------------------------------------------------- Program 2 2

2. Introduction
In the field of data analysis and machine learning, understanding the relationships between
variables in a dataset is a critical step. This program is designed to perform exploratory data
analysis (EDA) on a dataset with at least two numerical columns. By leveraging statistical
techniques and visualization tools, the program aims to uncover patterns, relationships, and insights
hidden within the data.
This program is particularly useful people who want to gain insights into their data before
applying more advanced techniques like machine learning or predictive modeling. By combining
statistical analysis with visualizations, the program provides a comprehensive understanding of the
dataset's structure and relationships, enabling informed decision-making and hypothesis
generation. The program focuses on the following key tasks - loading a dataset, visualizing
relationships, calculating correlation, covariance and correlation matrices, visualizing correlations.

2.1 Statistical Concepts

Correlation Analysis

• Correlation analysis is a powerful tool for exploring relationships between variables in a

dataset. It reveals patterns, strengths, and directions of associations, providing insights that
drive further analysis and informed decision-making.
• If one variable increase while the other also increases, it indicates a Positive Correlation.
For example, there's likely a positive correlation between the number of hours studied and
exam scores.
• If one variable increase while the other decreases, it signifies a Negative Correlation. there
might be a negative correlation between the amount of time spent watching TV and physical
activity levels.
• Changes in one variable don't seem to be associated with changes in the other is called No
Correlation. For example, there might be little to no correlation between shoe size and IQ.

Pearson’s correlation
• There are various correlation coefficients, but the most widely used is Pearson’s
correlation (also known as Pearson’s R). When correlation is mentioned without
specifying the type, it typically refers to Pearson’s R. However, it's important to note that
Pearson’s correlation applies only to numerical data—detecting relationships in categorical
data requires more advanced techniques.
• Correlation measures linear relationships. If the relationship between the variables is non-
linear (e.g., curved), the Pearson correlation coefficient might be close to zero even if there's
a strong relationship.
• Outliers can significantly affect the correlation coefficient. A single outlier can make a weak
correlation appear strong or vice versa.
3 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

• Correlation values always fall between -1 and +1:

► -1 indicates a strong negative correlation.

► +1 indicates a strong positive correlation.

► 0 means no correlation, implying no relationship between the variables.

The following table can help interpret correlation values effectively:

Range Meaning
0.7 to 1.0 Strong positive correlation
0.3 to 0.7 Weak positive correlation
−0.3 to 0.3 Negligible correlation
−0.7 to −0.3 Weak negative correlation
−1.0 to −0.7 Strong negative correlation

Formula for Pearson’s Correlation Coefficient

Where:

Correlation Matrix
• A correlation matrix is a table that shows the pairwise correlation coefficients between
multiple variables in a dataset. It is a square matrix where each element represents the
correlation between two variables. Correlation matrices are widely used in data analysis to
understand the relationships between variables.
• The matrix is symmetric because the correlation between variable A and variable B is the
same as the correlation between variable B and variable A.

A B C
A 1 0.8 -0.3
B 0.8 1 0.1
C -0.3 0.1 1

► A and B have a strong positive correlation (0.8).

► A and C have a weak negative correlation (-0.3).
► B and C have almost no correlation (0.1).
--------------------------------------------------------------------------------------------------- Program 2 4

Covariance

• Covariance is a statistical measure that describes the extent to which two variables change
together. It indicates the direction of the linear relationship between two variables.
However, unlike the correlation coefficient, covariance is not standardized, so its value can
range from negative infinity to positive infinity, making it difficult to interpret the strength
of the relationship.
► Positive Covariance: Indicates that as one variable increases, the other tends to

increase.
► Negative Covariance: Indicates that as one variable increases, the other tends to

decrease.
► Zero Covariance: Indicates no linear relationship between the variables.

Formula for Covariance

Example: Consider the following dataset

X Y
1 2
2 3
3 4
4 5
5 6

The covariance is 2, indicating a positive relationship between X and Y.

5 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Differences Between Covariance and Correlation

Aspect Covariance Correlation

Range −∞ to +∞ −1 to +1
Depends on the units of the
Scale Standardized (unitless)
variables
Direction and strength in a
Interpretation Direction of relationship only
standardized way.
Use Case Less common in practice Widely used for analysis

1.2 Data Visualization

1.2.1 Scatter plot

• A scatter plot is a type of data visualization that displays the relationship between two
numerical variables.
• Each point on the plot represents an observation in the dataset, with the x-axis representing
one variable and the y-axis representing the other. Scatter plots are useful for identifying
patterns, trends, and correlations between the two variables.
• A scatter plot helps determine whether there is a relationship (positive, negative, or no
correlation) between the two variables. For example, if the points tend to rise from left to
right, it suggests a positive correlation. If they fall from left to right, it suggests a negative
correlation. If there is no clear pattern, it suggests no correlation.
• Scatter plots can reveal outliers, which are data points that deviate significantly from the
overall pattern.
• A scatter plot can help determine whether the relationship between the two variables is
linear or nonlinear.

1.2.2 Heatmap

• A heatmap is a graphical representation of data where individual values are represented

using colors. It is particularly useful for visualizing large matrices, making it easy to identify
patterns, relationships, and trends.
• A correlation matrix contains correlation coefficients between multiple variables. Using a
heatmap, we can visually interpret these correlations:

• Strong positive correlation (close to +1) → Darker warm colors (e.g., red/orange).
• Strong negative correlation (close to -1) → Darker cool colors (e.g., blue).
• No correlation (around 0) → Neutral colors (e.g., white/light yellow).

This helps quickly identify which variables are strongly or weakly correlated.
--------------------------------------------------------------------------------------------------- Program 2 6

• Interpreting a Correlation Heatmap

► Diagonal Elements: Always 1 (a variable is perfectly correlated with itself).

► Off-Diagonal Elements: Show the correlation between different variables.

► Dark red → Strong positive correlation.

► Dark blue → Strong negative correlation.

► Light colors → Weak or no correlation.

► Symmetry: The matrix is symmetric

1.3 Program
7 Practical Insights into Data Analysis and Machine Learning -----------------------------------------
--------------------------------------------------------------------------------------------------- Program 2 8
9 Practical Insights into Data Analysis and Machine Learning -----------------------------------------

Viva Questions

Statistical Concepts:
• What is correlation analysis, and why is it important?
• What is the difference between positive, negative, and no correlation?
• What is Pearson's correlation coefficient, and what does it measure?
• How does Pearson's correlation handle non-linear relationships?
• What is a correlation matrix, and how is it useful?
• What is covariance, and how is it different from correlation?
• What are the key differences between covariance and correlation?

Data Visualization:
• What is a scatter plot, and what information does it provide?
• How can you interpret a scatter plot to determine the relationship between two variables?
• What is a heatmap, and how is it used in data analysis?
• How do you interpret a correlation heatmap?

Advanced Questions:
• How would you handle outliers when calculating correlation coefficients?
• What are some alternatives to Pearson's correlation for non-linear relationships?
• How would you interpret a correlation coefficient of 0.5?
• What are the limitations of using a heatmap for visualizing correlations?
• How would you determine if a correlation is statistically significant?
• What is the difference between a correlation matrix and a covariance matrix?
• How would you use the insights from a correlation matrix in a machine learning model?

Stat 151 - Final Review
No ratings yet
Stat 151 - Final Review
15 pages
Ch14 Regression
No ratings yet
Ch14 Regression
89 pages
MKT500 - Week 11 - Winter 2023 - D2L
No ratings yet
MKT500 - Week 11 - Winter 2023 - D2L
81 pages
Correlation
No ratings yet
Correlation
84 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Stats 3.1.pptx
No ratings yet
Stats 3.1.pptx
18 pages
DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024
No ratings yet
DS-203: E2 Assignment - Linear Regression Report: Sahil Barbade (210040131) 29th Jan 2024
18 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
correlation coefficient
No ratings yet
correlation coefficient
44 pages
20200519072923cce68d4cc4
No ratings yet
20200519072923cce68d4cc4
28 pages
Correlation
No ratings yet
Correlation
30 pages
Lecture 5 - Correlation
No ratings yet
Lecture 5 - Correlation
48 pages
Unit 8 Regression
No ratings yet
Unit 8 Regression
23 pages
Analise Bivariada_moodle
No ratings yet
Analise Bivariada_moodle
46 pages
Correlation and Simple Linear Regression Analyses: Objectives
No ratings yet
Correlation and Simple Linear Regression Analyses: Objectives
6 pages
S.id.C.8 Linear Regression
No ratings yet
S.id.C.8 Linear Regression
11 pages
Lecture
No ratings yet
Lecture
3 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (1)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Linear Regression
100% (1)
Linear Regression
56 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
Presentation regresion and correlation
No ratings yet
Presentation regresion and correlation
31 pages
SUMSEM2022-23 ITA6014 TH VL2022230701044 2023-08-08 Reference-Material-I
No ratings yet
SUMSEM2022-23 ITA6014 TH VL2022230701044 2023-08-08 Reference-Material-I
31 pages
Ch. 7: Scatterplots, Association, and Correlation
No ratings yet
Ch. 7: Scatterplots, Association, and Correlation
4 pages
Scatter plot
No ratings yet
Scatter plot
33 pages
Aiml M3 C3
No ratings yet
Aiml M3 C3
37 pages
Pearson's Correlation
No ratings yet
Pearson's Correlation
10 pages
EDA-GROUP-1
No ratings yet
EDA-GROUP-1
19 pages
Stat
No ratings yet
Stat
17 pages
UNIT III PORIYAN NOTES (1)
No ratings yet
UNIT III PORIYAN NOTES (1)
33 pages
Correlation SBC
No ratings yet
Correlation SBC
4 pages
BA 216 Lecture 5 Notes
No ratings yet
BA 216 Lecture 5 Notes
31 pages
Datasets - Bodyfat2 Fitness Newfitness Abdomenpred: Saseg 8B - Correlation Analysis
No ratings yet
Datasets - Bodyfat2 Fitness Newfitness Abdomenpred: Saseg 8B - Correlation Analysis
34 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
ML Unit-III Notes
No ratings yet
ML Unit-III Notes
83 pages
ANALYTICAL TECHNIQUES LU4 Lecture Notes
No ratings yet
ANALYTICAL TECHNIQUES LU4 Lecture Notes
25 pages
RN10 BEEA StatPro RN Correlation and Regression Analyses MP RM FD
No ratings yet
RN10 BEEA StatPro RN Correlation and Regression Analyses MP RM FD
33 pages
Correlation-Analysis-in-Excel
No ratings yet
Correlation-Analysis-in-Excel
7 pages
Looking at Data Relationships p79: Explanatory
No ratings yet
Looking at Data Relationships p79: Explanatory
8 pages
Inferential Statistics (Inferential Statistics (Correlation AND PARTIAL-Correlation)
No ratings yet
Inferential Statistics (Inferential Statistics (Correlation AND PARTIAL-Correlation)
28 pages
Correlation 2
No ratings yet
Correlation 2
23 pages
Stat BootCamp3
No ratings yet
Stat BootCamp3
30 pages
Lecture 13 Correlation Chapter 12 Part 1
No ratings yet
Lecture 13 Correlation Chapter 12 Part 1
20 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Stat 1_Q1_Week 7
No ratings yet
Stat 1_Q1_Week 7
16 pages
Correlation and Regression
No ratings yet
Correlation and Regression
15 pages
Examining Relationships in Quantitative Research
No ratings yet
Examining Relationships in Quantitative Research
9 pages
Correlation 1
No ratings yet
Correlation 1
9 pages
Lecture 7 Correlation
No ratings yet
Lecture 7 Correlation
18 pages
Makaku PDF
No ratings yet
Makaku PDF
3 pages
Assignment 12'
No ratings yet
Assignment 12'
6 pages
Linear Regression and Correlation
No ratings yet
Linear Regression and Correlation
65 pages
QT _Unit 2_Part B - Regression
No ratings yet
QT _Unit 2_Part B - Regression
40 pages
Correlation Analysis-Students NotesMAR 2023
No ratings yet
Correlation Analysis-Students NotesMAR 2023
24 pages
Summarize The Methods of Studying Correlation.: Module - 3
No ratings yet
Summarize The Methods of Studying Correlation.: Module - 3
17 pages
Chapter 3: Describing Relationships: Section 3.1
No ratings yet
Chapter 3: Describing Relationships: Section 3.1
16 pages
Spss Tutorials: Pearson Correlation
No ratings yet
Spss Tutorials: Pearson Correlation
10 pages
Session 12
No ratings yet
Session 12
9 pages
QMM 1
No ratings yet
QMM 1
18 pages
Correlation
No ratings yet
Correlation
29 pages
Summarizing Relationships: ETF1100 Business Statistics Week 6 Charanjit Kaur
No ratings yet
Summarizing Relationships: ETF1100 Business Statistics Week 6 Charanjit Kaur
4 pages
Exercises of Advanced Statistics
From Everand
Exercises of Advanced Statistics
Simone Malacrida
No ratings yet
Introduction To Managerial Decision Modeling
No ratings yet
Introduction To Managerial Decision Modeling
59 pages
Paradigm and Method: Qualitative and Quantitative Research
No ratings yet
Paradigm and Method: Qualitative and Quantitative Research
12 pages
Semester (1.5y) : Sarhad Intitue of Resarch and Education Lakki Marwat
No ratings yet
Semester (1.5y) : Sarhad Intitue of Resarch and Education Lakki Marwat
9 pages
Validation of The Research Instrument
88% (41)
Validation of The Research Instrument
3 pages
Drug Utilisation Evaluation
No ratings yet
Drug Utilisation Evaluation
43 pages
Estogad 2
No ratings yet
Estogad 2
3 pages
PROs
No ratings yet
PROs
29 pages
2022-2023 WAY Student Handbook
No ratings yet
2022-2023 WAY Student Handbook
25 pages
Motivation and Involvement
100% (1)
Motivation and Involvement
17 pages
1817 4089 1 PB
No ratings yet
1817 4089 1 PB
7 pages
Profile of MAJU
No ratings yet
Profile of MAJU
77 pages
Download Collecting Managing and Assessing Data Using Sample Surveys 1st Edition Peter Stopher ebook All Chapters PDF
100% (2)
Download Collecting Managing and Assessing Data Using Sample Surveys 1st Edition Peter Stopher ebook All Chapters PDF
85 pages
Minor Project Format
No ratings yet
Minor Project Format
3 pages
Sample Letter Request For Response To A Survey
No ratings yet
Sample Letter Request For Response To A Survey
2 pages
The Impact of Intellectual Capital On Innovation: A Literature Study
No ratings yet
The Impact of Intellectual Capital On Innovation: A Literature Study
13 pages
To: Matt Schnabel: President@umd - Edu
No ratings yet
To: Matt Schnabel: President@umd - Edu
7 pages
Impact of Online Gaming Addiction To STI Students
No ratings yet
Impact of Online Gaming Addiction To STI Students
15 pages
7 Steps To World Class Manufacturing
100% (2)
7 Steps To World Class Manufacturing
15 pages
Origin of HAZOP Analysis
No ratings yet
Origin of HAZOP Analysis
5 pages
EAPP Second Quarter 1
No ratings yet
EAPP Second Quarter 1
25 pages
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly download the ebook to start your content journey
100% (1)
Econometrics 1st Edition K. Nirmal Ravi Kumar - Quickly download the ebook to start your content journey
70 pages
Introduction To The Course: Rob Reider
No ratings yet
Introduction To The Course: Rob Reider
36 pages
Delta Module Three Principal Examiner S Report 2010
No ratings yet
Delta Module Three Principal Examiner S Report 2010
17 pages
How To Write Business Plan Chapter
No ratings yet
How To Write Business Plan Chapter
24 pages
Lesson 2 Asking A Clinical Question and Searching For Research Evidence
No ratings yet
Lesson 2 Asking A Clinical Question and Searching For Research Evidence
26 pages
Defininghumanities CLAIRE
No ratings yet
Defininghumanities CLAIRE
12 pages
Spring Iso 19906 Final
No ratings yet
Spring Iso 19906 Final
47 pages
FoodE D2.7 Deliverable
No ratings yet
FoodE D2.7 Deliverable
16 pages
Artiker Bahasa Inggris
No ratings yet
Artiker Bahasa Inggris
9 pages