0% found this document useful (0 votes)
1 views

Machine_Learning_data_analysis (1)

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) in the context of a breast cancer dataset. It outlines key steps in EDA, including data import, feature checks, label encoding, and visualization techniques such as distribution plots and correlation matrices. The findings indicate no missing values, a slight imbalance in the dataset, and the presence of outliers, with most features showing a right skew.

Uploaded by

Yvan Onguene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Machine_Learning_data_analysis (1)

The document provides a comprehensive overview of Exploratory Data Analysis (EDA) in the context of a breast cancer dataset. It outlines key steps in EDA, including data import, feature checks, label encoding, and visualization techniques such as distribution plots and correlation matrices. The findings indicate no missing values, a slight imbalance in the dataset, and the presence of outliers, with most features showing a right skew.

Uploaded by

Yvan Onguene
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Learning

Louis Fippo Fitime

April 20, 2022

Louis Fippo Fitime Machine Learning April 20, 2022 1 / 21


Exploratory Data Analysis
Exploratory Data Analysis is all about analyzing the dataset and
summarizing the key insights and characteristics of the data.
EDA is one of the first steps that we follow in a Data Science Project to
understand the data better

Checllist for EDA

1 Checking the different features present in the dataset & its shape
2 Checking the data type of each columns
3 Encoding the labels for classification problems
4 Checking for missing values
5 Descriptive summary of the dataset
6 Checking the distribution of the target variable
7 Grouping the data based on target variable

Louis Fippo Fitime Machine Learning April 20, 2022 2 / 21


Exploratory Data Analysis (2)

Checllist for EDA

1 Distribution plot for all the columns


2 Count plot for Categorical columns
3 Pair plot
4 Checking for Outliers
5 Correlation matrix
6 Inference from EDA

Louis Fippo Fitime Machine Learning April 20, 2022 3 / 21


Import dataset (Breast Cancer Wisconsin (Diagnostic)
Data Set)

The following Python code import dependencies and data

# This program imports the dependencies.


import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# import dataset
breast_cancer_data = pd.read_csv('/path_to_data/data.csv')

Louis Fippo Fitime Machine Learning April 20, 2022 4 / 21


Checking the different features present in the dataset

For this, we can use the head() function in pandas

# This code display the five first row of the dataset


breast_cancer_data.head()

Checking the shape of the dataset:

# This code checking the shape


breast_cancer_data.shape

Louis Fippo Fitime Machine Learning April 20, 2022 5 / 21


Checking the data type of each columns and non-null count

For this, we can use the info() function in pandas

# This code display the dataset info


breast_cancer_data.info()

Comment the type of each variable...

Louis Fippo Fitime Machine Learning April 20, 2022 6 / 21


Encoding the labels for classification problems

This example will encode the “diagnosis” column of our dataset, so that
all the columns are in the numerical format. We will encode “B” as 0 and
“M” as 1.

# This code encode the "diagnosis" column


label_encode = LabelEncoder()
labels = label_encode.fit_transform(breast_cancer_data['diagnosis'])
breast_cancer_data['target'] = labels
breast_cancer_data.drop(columns=['id','diagnosis'], axis=1, inplace=True)

Here, we are encoding the “diagnosis” column, storing it in a different column called “target”
and removing the “diagnosis” column. We are also removing the “id” column as it is not
necessary.

Louis Fippo Fitime Machine Learning April 20, 2022 7 / 21


Checking for missing values

Now, let’s check whether there are any missing values in the dataset.

# This code checks for missing values


breast_cancer_data.isnull().sum()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 8 / 21


Descriptive summary of the dataset

The next step is to get some statistical measures about the dataset. This
is what we call as “Descriptive Statistics” which is a summarization of the
data. For this, we can use describe() function in pandas.

# This code provide some statistical measures about the dataset


breast_cancer_data.describe()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 9 / 21


Checking the distribution of the target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable


breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 10 / 21


Grouping the data based on target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable


breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 11 / 21


Inferences

The dataset has 569 rows & 32 columns.


We don’t have any missing values in the dataset.
We could see that the data is right skewed for most of the features.
There is a slight imbalance in the dataset (Benign cases are more
than Malignant cases).
The mean value for most of the features are greater for Malignant
cases than the mean value for Benign cases.

Louis Fippo Fitime Machine Learning April 20, 2022 12 / 21


Exploratory Data Analysis (2)

Checllist for EDA

1 Distribution plot for all the columns


2 Count plot for Categorical columns
3 Pair plot
4 Checking for Outliers
5 Correlation matrix
6 Inference from EDA

Louis Fippo Fitime Machine Learning April 20, 2022 13 / 21


Grouping the data based on target variable

The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.

# This code checks the distribution of the target variable


breast_cancer_data['diagnosis'].value_counts()

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 14 / 21


Importing the Data Visualization Libraries

# This code import the data Visualization libraries


import matplotlib.pyplot as plt
import seaborn as sns

Matplotlib Seaborn are the two main Data Visualization libraries in Python. There are also
other libraries like Plotly and GGplot.

Comment the results...

Louis Fippo Fitime Machine Learning April 20, 2022 15 / 21


Count plot for Categorical columns

This dataset contains only one categorical variable(“target”) with two


categories: 0 (Benign) and 1(Malignant). When we have categorical
variable, we will plot it in a count plot and when we have a numerical
variable, we will use a distribution plot.

# This code Count plot for Categorical columns


sns.countplot(x='target', data=breast_cancer_data

As we can clearly see, the number of data points with label “0” is higher than label “1”. This
means that we have more Benign cases compared to Malignant cases in the dataset. So we can
say that this dataset is slightly imbalanced. Count plot will show the total counts in each
category.

Louis Fippo Fitime Machine Learning April 20, 2022 16 / 21


Distribution plot for all columns

Now we can build distribution plot for all other columns as they contain
numerical values. Distribution plot tells us whether the data is Normally
Distributed or there is some Skewness in the data.

# This code shows distribution plot for all other columns


for column in breast_cancer_data:
sns.displot(x=column, data=breast_cancer_data)

When the skewness in the data is large, we may need to do some transformations, in order to
get better results from the Machine Learning models once we train them.

Louis Fippo Fitime Machine Learning April 20, 2022 17 / 21


Pair plot

A pair plot gives pairwise relationships in a dataset

# This code shows paitplot


sns.pairplot(dataframe_name)

The idea behind pair plot is to understand the relationship between the variables present in the
data. Alternatively, we can find this relationship using a Correlation Matrix which we will discuss
later in this post.

Louis Fippo Fitime Machine Learning April 20, 2022 18 / 21


Checking for Outliers

Outliers detection is one of the important tasks that we have to do. Most
of the Machine Learning models like Regression models, K-Nearest
Neighbors, etc. are sensitive to outliers. On the other hand, models like
Random Forest are not affected by Outliers.

# This code compute the outliers


for column in breast_cancer_data:
plt.figure()
breast_cancer_data.boxplot([column])

The circles above the top whisker and below the bottom whisker represents the Outliers.

Louis Fippo Fitime Machine Learning April 20, 2022 19 / 21


Correlation matrix

Building a Correlation Matrix is an important step in Data Visualization.


The main purpose of a correlation matrix is to understand the correlation
(in other words, relationship) between the variables present in a dataset. It
is very helpful in Feature Selection which is carried out to choose the
important features and remove the irrelevant and unnecessary features.

# This code compute the Correlation matrix


correlation_matrix = breast_cancer_data.corr()
plt.figure(figsize=(20,20))
sns.heatmap(correlation_matrix, cbar=True, fmt='.1f', annot=True, cmap='Blues')
plt.savefig('Correlation Heat map')

We will create a Heat Map to visualize the correlation between the variables

Louis Fippo Fitime Machine Learning April 20, 2022 20 / 21


Inference from EDA Data Visualization

No missing Values in the dataset


All variables have continuous numerical values except for Target
column
Mean is slightly more than the median for most of the features. So it
is right skewed. This is visible through the distribution plots
Slight imbalance in the dataset (Benign(0) cases are more than
Malignant(1) cases). Refer Count Plot
Mean of most features are clearly larger for Malignant cases compared
to the benign cases (Groupby)
Most of the features have Outliers
Correlation Matrix reveal that most of the features are highly
correlated. So we can remove certain features during Feature
Selection

Louis Fippo Fitime Machine Learning April 20, 2022 21 / 21

You might also like