Machine_Learning_data_analysis (1)
Machine_Learning_data_analysis (1)
1 Checking the different features present in the dataset & its shape
2 Checking the data type of each columns
3 Encoding the labels for classification problems
4 Checking for missing values
5 Descriptive summary of the dataset
6 Checking the distribution of the target variable
7 Grouping the data based on target variable
This example will encode the “diagnosis” column of our dataset, so that
all the columns are in the numerical format. We will encode “B” as 0 and
“M” as 1.
Here, we are encoding the “diagnosis” column, storing it in a different column called “target”
and removing the “diagnosis” column. We are also removing the “id” column as it is not
necessary.
Now, let’s check whether there are any missing values in the dataset.
The next step is to get some statistical measures about the dataset. This
is what we call as “Descriptive Statistics” which is a summarization of the
data. For this, we can use describe() function in pandas.
The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.
The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.
The next step is to check the distribution of the dataset based on the
target variable to see if there is an imbalance. This is an exclusive step for
Classification problems.
Matplotlib Seaborn are the two main Data Visualization libraries in Python. There are also
other libraries like Plotly and GGplot.
As we can clearly see, the number of data points with label “0” is higher than label “1”. This
means that we have more Benign cases compared to Malignant cases in the dataset. So we can
say that this dataset is slightly imbalanced. Count plot will show the total counts in each
category.
Now we can build distribution plot for all other columns as they contain
numerical values. Distribution plot tells us whether the data is Normally
Distributed or there is some Skewness in the data.
When the skewness in the data is large, we may need to do some transformations, in order to
get better results from the Machine Learning models once we train them.
The idea behind pair plot is to understand the relationship between the variables present in the
data. Alternatively, we can find this relationship using a Correlation Matrix which we will discuss
later in this post.
Outliers detection is one of the important tasks that we have to do. Most
of the Machine Learning models like Regression models, K-Nearest
Neighbors, etc. are sensitive to outliers. On the other hand, models like
Random Forest are not affected by Outliers.
The circles above the top whisker and below the bottom whisker represents the Outliers.
We will create a Heat Map to visualize the correlation between the variables