0% found this document useful (0 votes)

1 views

Machine_Learning_data_analysis (1)

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) in the context of a breast cancer dataset. It outlines key steps in EDA, including data import, feature checks, label encoding, and visualization techniques such as distribution plots and correlation matrices. The findings indicate no missing values, a slight imbalance in the dataset, and the presence of outliers, with most features showing a right skew.

Uploaded by

Yvan Onguene

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Machine_Learning_data_analysis (1)

Uploaded by

Yvan Onguene

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Machine Learning

Louis Fippo Fitime

April 20, 2022

Louis Fippo Fitime Machine Learning April 20, 2022 1 / 21

Exploratory Data Analysis
Exploratory Data Analysis is all about analyzing the dataset and
summarizing the key insights and characteristics of the data.
EDA is one of the first steps that we follow in a Data Science Project to
understand the data better

Checllist for EDA

1 Checking the different features present in the dataset & its shape
2 Checking the data type of each columns
3 Encoding the labels for classification problems
4 Checking for missing values
5 Descriptive summary of the dataset
6 Checking the distribution of the target variable
7 Grouping the data based on target variable

Louis Fippo Fitime Machine Learning April 20, 2022 2 / 21

Exploratory Data Analysis (2)

Checllist for EDA

1 Distribution plot for all the columns

2 Count plot for Categorical columns
3 Pair plot
4 Checking for Outliers
5 Correlation matrix
6 Inference from EDA

Louis Fippo Fitime Machine Learning April 20, 2022 3 / 21

Import dataset (Breast Cancer Wisconsin (Diagnostic)
Data Set)

The following Python code import dependencies and data

# This program imports the dependencies.

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# import dataset
breast_cancer_data = pd.read_csv('/path_to_data/data.csv')

Louis Fippo Fitime Machine Learning April 20, 2022 4 / 21

Checking the different features present in the dataset

For this, we can use the head() function in pandas

# This code display the five first row of the dataset

breast_cancer_data.head()

Checking the shape of the dataset:

# This code checking the shape

breast_cancer_data.shape

Louis Fippo Fitime Machine Learning April 20, 2022 5 / 21

Checking the data type of each columns and non-null count

For this, we can use the info() function in pandas

# This code display the dataset info

breast_cancer_data.info()

Comment the type of each variable...

Louis Fippo Fitime Machine Learning April 20, 2022 6 / 21

Encoding the labels for classification problems

This example will encode the “diagnosis” column of our dataset, so that
all the columns are in the numerical format. We will encode “B” as 0 and
“M” as 1.

# This code encode the "diagnosis" column

label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer_data['diagnosis'])
breast_cancer_data['target'] = labels
breast_cancer_data.drop(columns=['id','diagnosis'], axis=1, inplace=True)

Here, we are encoding the “diagnosis” column, storing it in a different column called “target”
and removing the “diagnosis” column. We are also removing the “id” column as it is not
necessary.

Louis Fippo Fitime Machine Learning April 20, 2022 7 / 21

Checking for missing values

Now, let’s check whether there are any missing values in the dataset.

# This code checks for missing values

breast_cancer_data.isnull().sum()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 8 / 21

Descriptive summary of the dataset

The next step is to get some statistical measures about the dataset. This
is what we call as “Descriptive Statistics” which is a summarization of the
data. For this, we can use describe() function in pandas.

# This code provide some statistical measures about the dataset

breast_cancer_data.describe()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 9 / 21

Checking the distribution of the target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable

breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 10 / 21

Grouping the data based on target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable

breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 11 / 21

Inferences

The dataset has 569 rows & 32 columns.

We don’t have any missing values in the dataset.
We could see that the data is right skewed for most of the features.
There is a slight imbalance in the dataset (Benign cases are more
than Malignant cases).
The mean value for most of the features are greater for Malignant
cases than the mean value for Benign cases.

Louis Fippo Fitime Machine Learning April 20, 2022 12 / 21

Exploratory Data Analysis (2)

Checllist for EDA

1 Distribution plot for all the columns

2 Count plot for Categorical columns
3 Pair plot
4 Checking for Outliers
5 Correlation matrix
6 Inference from EDA

Louis Fippo Fitime Machine Learning April 20, 2022 13 / 21

Grouping the data based on target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable

breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 14 / 21

Importing the Data Visualization Libraries

# This code import the data Visualization libraries

import matplotlib.pyplot as plt
import seaborn as sns

Matplotlib Seaborn are the two main Data Visualization libraries in Python. There are also
other libraries like Plotly and GGplot.

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 15 / 21

Count plot for Categorical columns

This dataset contains only one categorical variable(“target”) with two

categories: 0 (Benign) and 1(Malignant). When we have categorical
variable, we will plot it in a count plot and when we have a numerical
variable, we will use a distribution plot.

# This code Count plot for Categorical columns

sns.countplot(x='target', data=breast_cancer_data

As we can clearly see, the number of data points with label “0” is higher than label “1”. This
means that we have more Benign cases compared to Malignant cases in the dataset. So we can
say that this dataset is slightly imbalanced. Count plot will show the total counts in each
category.

Louis Fippo Fitime Machine Learning April 20, 2022 16 / 21

Distribution plot for all columns

Now we can build distribution plot for all other columns as they contain
numerical values. Distribution plot tells us whether the data is Normally
Distributed or there is some Skewness in the data.

# This code shows distribution plot for all other columns

for column in breast_cancer_data:
sns.displot(x=column, data=breast_cancer_data)

When the skewness in the data is large, we may need to do some transformations, in order to
get better results from the Machine Learning models once we train them.

Louis Fippo Fitime Machine Learning April 20, 2022 17 / 21

Pair plot

A pair plot gives pairwise relationships in a dataset

# This code shows paitplot

sns.pairplot(dataframe_name)

The idea behind pair plot is to understand the relationship between the variables present in the
data. Alternatively, we can find this relationship using a Correlation Matrix which we will discuss
later in this post.

Louis Fippo Fitime Machine Learning April 20, 2022 18 / 21

Checking for Outliers

Outliers detection is one of the important tasks that we have to do. Most
of the Machine Learning models like Regression models, K-Nearest
Neighbors, etc. are sensitive to outliers. On the other hand, models like
Random Forest are not affected by Outliers.

# This code compute the outliers

for column in breast_cancer_data:
plt.figure()
breast_cancer_data.boxplot([column])

The circles above the top whisker and below the bottom whisker represents the Outliers.

Louis Fippo Fitime Machine Learning April 20, 2022 19 / 21

Correlation matrix

Building a Correlation Matrix is an important step in Data Visualization.

The main purpose of a correlation matrix is to understand the correlation
(in other words, relationship) between the variables present in a dataset. It
is very helpful in Feature Selection which is carried out to choose the
important features and remove the irrelevant and unnecessary features.

# This code compute the Correlation matrix

correlation_matrix = breast_cancer_data.corr()
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix, cbar=True, fmt='.1f', annot=True, cmap='Blues')
plt.savefig('Correlation Heat map')

We will create a Heat Map to visualize the correlation between the variables

Louis Fippo Fitime Machine Learning April 20, 2022 20 / 21

Inference from EDA Data Visualization

No missing Values in the dataset

All variables have continuous numerical values except for Target
column
Mean is slightly more than the median for most of the features. So it
is right skewed. This is visible through the distribution plots
Slight imbalance in the dataset (Benign(0) cases are more than
Malignant(1) cases). Refer Count Plot
Mean of most features are clearly larger for Malignant cases compared
to the benign cases (Groupby)
Most of the features have Outliers
Correlation Matrix reveal that most of the features are highly
correlated. So we can remove certain features during Feature
Selection

Louis Fippo Fitime Machine Learning April 20, 2022 21 / 21

M.S. King - Planet Rotschild Vol. 1. - The Forbidden History of The New World Order
100% (7)
M.S. King - Planet Rotschild Vol. 1. - The Forbidden History of The New World Order
291 pages
Report - Project8 - FRA - Surabhi - Report
0% (1)
Report - Project8 - FRA - Surabhi - Report
15 pages
ML Ts Proj
100% (9)
ML Ts Proj
58 pages
PSC Checklist
No ratings yet
PSC Checklist
19 pages
PRACTICAL5
No ratings yet
PRACTICAL5
23 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Assvid
No ratings yet
Assvid
13 pages
SMDM Project Sample Report
No ratings yet
SMDM Project Sample Report
30 pages
ML Lab Record V Sem
No ratings yet
ML Lab Record V Sem
22 pages
Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA
No ratings yet
Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA
21 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
Pca Anova PDF
No ratings yet
Pca Anova PDF
21 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
10 IT U2 CH1
No ratings yet
10 IT U2 CH1
64 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Data Mining in Bioinformatics
No ratings yet
Data Mining in Bioinformatics
21 pages
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
100% (2)
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
47 pages
Report - Data Visualization and Exploration
No ratings yet
Report - Data Visualization and Exploration
14 pages
Maxbox Starter140 Data Correlation Analysis
No ratings yet
Maxbox Starter140 Data Correlation Analysis
6 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
November 2010)
No ratings yet
November 2010)
6 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
22 pages
Assignment1_LATEX
No ratings yet
Assignment1_LATEX
11 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
1-Linear Regression
No ratings yet
1-Linear Regression
22 pages
ML_DS_interview_quetions
No ratings yet
ML_DS_interview_quetions
17 pages
Values
No ratings yet
Values
30 pages
Assignment2 Stats
No ratings yet
Assignment2 Stats
5 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
No ratings yet
Capstone Project Final Report Rupesh Kumar PGP-DSBA APR 21C
77 pages
Problem Statement II
No ratings yet
Problem Statement II
2 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
cs188-fa22-note19
No ratings yet
cs188-fa22-note19
8 pages
DMProject Report
No ratings yet
DMProject Report
19 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
What Is The Concept of Data Cleaning
No ratings yet
What Is The Concept of Data Cleaning
20 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Data Mining 2-5
No ratings yet
Data Mining 2-5
4 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Project 2 Factor Hair Revised Case Study
No ratings yet
Project 2 Factor Hair Revised Case Study
25 pages
Manisha 3001 Week 12
No ratings yet
Manisha 3001 Week 12
22 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
No ratings yet
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
22 pages
Py_ Hierarchical clustering on COVID dataset — Actuaries' Analytical Cookbook
No ratings yet
Py_ Hierarchical clustering on COVID dataset — Actuaries' Analytical Cookbook
6 pages
Report - Project8 - FRA - Surabhi - Report
100% (2)
Report - Project8 - FRA - Surabhi - Report
15 pages
CS614 Short Notes Midterm
No ratings yet
CS614 Short Notes Midterm
18 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Jupyter Lab
No ratings yet
Jupyter Lab
42 pages
Systems Analysis and Design 6th Edition Dennis Test Bankpdf download
100% (1)
Systems Analysis and Design 6th Edition Dennis Test Bankpdf download
47 pages
100 Puzzles to Learn Data Warehousing
From Everand
100 Puzzles to Learn Data Warehousing
Cristian Scutaru
No ratings yet
Unconscious Goal Installation - Workbook
100% (3)
Unconscious Goal Installation - Workbook
14 pages
Atestat Engleză - Var. Finală x3
No ratings yet
Atestat Engleză - Var. Finală x3
21 pages
DRAM
No ratings yet
DRAM
24 pages
Confederation of Indian Industry
No ratings yet
Confederation of Indian Industry
12 pages
Tuff Torq K92 Service Manual
No ratings yet
Tuff Torq K92 Service Manual
55 pages
IMC Marathon
No ratings yet
IMC Marathon
4 pages
Visual Analysis Essay 1
No ratings yet
Visual Analysis Essay 1
5 pages
Divya Kotak Mahindra Bank
100% (1)
Divya Kotak Mahindra Bank
95 pages
Paper 1 PDF
No ratings yet
Paper 1 PDF
4 pages
Journal Ledger and Trial Balence
No ratings yet
Journal Ledger and Trial Balence
14 pages
Pressure-Induced Phase Transitions in AB2X4 Chalcogeni
No ratings yet
Pressure-Induced Phase Transitions in AB2X4 Chalcogeni
248 pages
Loctite - Valves - 98013 Manual
No ratings yet
Loctite - Valves - 98013 Manual
20 pages
A GLUT-Based User Interface Library: by Paul Rademacher
No ratings yet
A GLUT-Based User Interface Library: by Paul Rademacher
38 pages
Rubrics For Terrarium
100% (1)
Rubrics For Terrarium
2 pages
Course Outline Antenna
No ratings yet
Course Outline Antenna
4 pages
Curriculum Vitae - Muhammad Shafqat-1-2
No ratings yet
Curriculum Vitae - Muhammad Shafqat-1-2
2 pages
Sentence Errors Practice (Level - 2) PDF
No ratings yet
Sentence Errors Practice (Level - 2) PDF
27 pages
Luxembourg: City Portrait, Fortress, Attractions, Museums and Best Addresses
No ratings yet
Luxembourg: City Portrait, Fortress, Attractions, Museums and Best Addresses
47 pages
EO1995 2000 Lawrence - 0008
No ratings yet
EO1995 2000 Lawrence - 0008
114 pages
Rr311403 Finite Element Method
100% (2)
Rr311403 Finite Element Method
8 pages
Hw4 Solutions
No ratings yet
Hw4 Solutions
6 pages
BFIN 1quareter Exam
100% (1)
BFIN 1quareter Exam
29 pages
Gift Card Method June 2021
100% (1)
Gift Card Method June 2021
8 pages
Ecommerce Sales Dashboard (Rubina Jamadar)
No ratings yet
Ecommerce Sales Dashboard (Rubina Jamadar)
12 pages
Tech Radiopharmacy
100% (1)
Tech Radiopharmacy
52 pages
20UP20DN
No ratings yet
20UP20DN
4 pages
Critical Race Theory' Is The Right's New Bogeyman. The Left Must Not Fall For It - Cas Mudde - The Guardian
No ratings yet
Critical Race Theory' Is The Right's New Bogeyman. The Left Must Not Fall For It - Cas Mudde - The Guardian
7 pages
Ieee 1205-2014
100% (1)
Ieee 1205-2014
77 pages

Machine_Learning_data_analysis (1)

Uploaded by

Machine_Learning_data_analysis (1)

Uploaded by

Machine Learning

Louis Fippo Fitime

April 20, 2022

Louis Fippo Fitime Machine Learning April 20, 2022 1 / 21

Checllist for EDA

Louis Fippo Fitime Machine Learning April 20, 2022 2 / 21

Checllist for EDA

1 Distribution plot for all the columns

Louis Fippo Fitime Machine Learning April 20, 2022 3 / 21

The following Python code import dependencies and data

# This program imports the dependencies.

Louis Fippo Fitime Machine Learning April 20, 2022 4 / 21

For this, we can use the head() function in pandas

# This code display the five first row of the dataset

Checking the shape of the dataset:

# This code checking the shape

Louis Fippo Fitime Machine Learning April 20, 2022 5 / 21

For this, we can use the info() function in pandas

# This code display the dataset info

Comment the type of each variable...

Louis Fippo Fitime Machine Learning April 20, 2022 6 / 21

# This code encode the "diagnosis" column

Louis Fippo Fitime Machine Learning April 20, 2022 7 / 21

# This code checks for missing values

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 8 / 21

# This code provide some statistical measures about the dataset

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 9 / 21

# This code checks the distribution of the target variable

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 10 / 21

# This code checks the distribution of the target variable

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 11 / 21

The dataset has 569 rows & 32 columns.

Louis Fippo Fitime Machine Learning April 20, 2022 12 / 21

Checllist for EDA

1 Distribution plot for all the columns

Louis Fippo Fitime Machine Learning April 20, 2022 13 / 21

# This code checks the distribution of the target variable

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 14 / 21

# This code import the data Visualization libraries

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 15 / 21

This dataset contains only one categorical variable(“target”) with two

# This code Count plot for Categorical columns

Louis Fippo Fitime Machine Learning April 20, 2022 16 / 21

# This code shows distribution plot for all other columns

Louis Fippo Fitime Machine Learning April 20, 2022 17 / 21

A pair plot gives pairwise relationships in a dataset

# This code shows paitplot

Louis Fippo Fitime Machine Learning April 20, 2022 18 / 21

# This code compute the outliers

Louis Fippo Fitime Machine Learning April 20, 2022 19 / 21

Building a Correlation Matrix is an important step in Data Visualization.

# This code compute the Correlation matrix

Louis Fippo Fitime Machine Learning April 20, 2022 20 / 21

No missing Values in the dataset

Louis Fippo Fitime Machine Learning April 20, 2022 21 / 21

You might also like