Data Preprocessing, Analysis, and Visualization for building a Machine learning model
Last Updated :
27 Oct, 2022
In this article, we are going to see the concept of Data Preprocessing, Analysis, and Visualization for building a Machine learning model. Business owners and organizations use Machine Learning models to predict their Business growth. But before applying machine learning models, the dataset needs to be preprocessed.
So, let’s import the data and start exploring it.
Importing Libraries and Dataset
We will be using these libraries :
- Pandas library is used for data analysis.
- Numpy library is used for complex mathematical operations.
- Scikit-learn for model training and score evaluation.
Python3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
dataset = pd.read_csv( 'Churn_Modelling.csv' )
|
Now let us observe the dataset.
Output :
Info() function retrieves the information about the dataset such as data type, number of rows and columns, etc.
Output :
Exploratory data analysis and visualization
To find out the correlation between the features, Let’s make the heatmap.
Python3
plt.figure(figsize = ( 12 , 6 ))
sns.heatmap(dataset.corr(),
cmap = 'BrBG' ,
fmt = '.2f' ,
linewidths = 2 ,
annot = True )
|
Output :
Now we can also explore the distribution of CreditScore, Age, Balance, ExtimatedSalary using displot.
Python3
lis = [ 'CreditScore' , 'Age' , 'Balance' , 'EstimatedSalary' ]
plt.subplots(figsize = ( 15 , 8 ))
index = 1
for i in lis:
plt.subplot( 2 , 2 , index)
sns.distplot(dataset[i])
index + = 1
|
Output :
We can also check the categorical count of each category in Geography and Gender.
Python3
lis2 = [ 'Geography' , 'Gender' ]
plt.subplots(figsize = ( 10 , 5 ))
index = 1
for col in lis2:
y = dataset[col].value_counts()
plt.subplot( 1 , 2 , index)
plt.xticks(rotation = 90 )
sns.barplot(x = list (y.index), y = y)
index + = 1
|
Output :
Data Preprocessing
Data preprocessing is used to convert raw data into a clear format. Raw data consist of missing values, noisy data, and raw data may be text, image, numeric values, etc.
By the above definition, we understood that transforming unstructured data into a structured form is called data preprocessing. If the unstructured data is used in machine learning models to analyze or to predict, the prediction will be false because unstructured data contains missing values and unwanted data. So for good prediction, the data need to be preprocessed.
Finding Missing Values and Handling them
Let’s observe whether null values are present.
Output :
Here, True indicates a null value and False indicates there is no null value. We can observe that there are 3 columns containing null values. The 3 columns are Geography, Gender, and Age. Now we need to remove the null values, to do this there are 3 ways they are:
- Deleting rows
- Replacing null with custom values
- Replacing using Mean, Median, and Mode
In this scenario, we replace null values with Mean and Mode.
Python3
dataset[ "Geography" ].fillna(dataset[ "Geography" ].mode()[ 0 ],inplace = True )
dataset[ "Gender" ].fillna(dataset[ "Gender" ].mode()[ 0 ],inplace = True )
dataset[ "Age" ].fillna(dataset[ "Age" ].mean(),inplace = True )
|
As we know Geography and Gender is a Categorical columns we used mode and Age is an integer type so we used mean.
Note: By using “Inplace = True”, the original data set is modified.
Now once again let us check if any null values still exist.
Label Encoding
Label Encoding is used to convert textual data to integer data. As we know there are two textual data type columns which are “Geography” and “Gender”.
Python3
le = LabelEncoder()
dataset[ 'Geography' ] = le.fit_transform(dataset[ "Geography" ])
dataset[ 'Gender' ] = le.fit_transform(dataset[ "Gender" ])
|
First we initialized LabelEncoder() function, then transformed textual data to integer data with fit_transform() function.
So now, the “Geography” and “Gender” columns are converted to integer data types.
Splitting Dependent and Independent Variables
Dataset is split into x and y variables and converted to an array.
Python3
x = dataset.iloc[:, 3 : 13 ].values
y = dataset.iloc[:, 13 : 14 ].values
|
Here x is the independent variable and y is the dependent variable.
Splitting into Train and Test Dataset
Python3
x_train, x_test, y_train, y_test = train_test_split(x,y,
test_size = 0.2 ,
random_state = 0 )
|
Here we split data into train and test sets.
Feature Scaling
Feature Scaling is a technique done to normalize the independent variables.
Python3
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)
|
We have successfully preprocessed the dataset. And now we are ready to apply Machine Learning models.
Model Training and Evaluation
As this is a Classification problem then we will be using the below models for training the data.
And for evaluation, we will be using Accuracy Score.
Python3
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors = 3 )
rfc = RandomForestClassifier(n_estimators = 7 ,
criterion = 'entropy' ,
random_state = 7 )
svc = SVC()
lc = LogisticRegression()
for clf in (rfc, knn, svc,lc):
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print ( "Accuracy score of " ,clf.__class__.__name__, "=" ,
100 * metrics.accuracy_score(y_test, y_pred))
|
Output :
Accuracy score of RandomForestClassifier = 84.5
Accuracy score of KNeighborsClassifier = 82.5
Accuracy score of SVC = 86.15
Accuracy score of LogisticRegression = 80.75
Conclusion
Random Forest classifier and SVC are showing the best results with an accuracy of around 85%
Similar Reads
How to Prepare Data Before Deploying a Machine Learning Model?
Before deploying a machine learning model, it is important to prepare the data to ensure that it is in the correct format and that any errors or inconsistencies have been cleaned. Here are some steps to prepare data before deploying a machine learning model: Data collection: Collect the data that yo
12 min read
Steps to Build a Machine Learning Model
Machine learning models offer a powerful mechanism to extract meaningful patterns, trends, and insights from this vast pool of data, giving us the power to make better-informed decisions and appropriate actions. In this article, we will explore the Fundamentals of Machine Learning and the Steps to b
9 min read
Building a Machine Learning Model Using J48 Classifier
What is the J48 Classifier? J48 is a machine learning decision tree classification algorithm based on Iterative Dichotomiser 3. It is very helpful in examine the data categorically and continuously. Note: To build our J48 machine learning model weâll use the weka tool. What is Weka? Weka is an open-
3 min read
Building Your First Machine Learning Model
Today, we're exploring a comprehensive guide to building a wine quality prediction model using some of the most powerful tools and libraries available in Python. Whether you're a beginner looking to understand the basics or an experienced data scientist aiming to refine your skills, this guide has s
5 min read
Identifying Overfitting in Machine Learning Models Using Scikit-Learn
Overfitting is a critical issue in machine learning that can significantly impact the performance of models when applied to new, unseen data. Identifying overfitting in machine learning models is crucial to ensuring their performance generalizes well to unseen data. In this article, we'll explore ho
7 min read
Time Series Analysis & Visualization in Python
Every dataset has distinct qualities that function as essential aspects in the field of data analytics, providing insightful information about the underlying data. Time series data is one kind of dataset that is especially important. This article delves into the complexities of time series datasets,
11 min read
7 Major Challenges Faced By Machine Learning Professionals
In Machine Learning, there occurs a process of analyzing data for building or training models. It is just everywhere; from Amazon product recommendations to self-driven cars, it holds great value throughout. As per the latest research, the global machine-learning market is expected to grow by 43% by
5 min read
Pros and Cons of Decision Tree Regression in Machine Learning
Decision tree regression is a widely used algorithm in machine learning for predictive modeling tasks. It is a powerful tool that can handle both classification and regression problems, making it versatile for various applications. However, like any other algorithm, decision tree regression has its
5 min read
Flowchart for basic Machine Learning models
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that allow computers to learn from large amount of data, identify patterns and make decisions. It help them to predict new similar data without explicit programming for each task. A good way to understand how machine learning works is
4 min read
Loan Eligibility Prediction using Machine Learning Models in Python
Have you ever thought about the apps that can predict whether you will get your loan approved or not? In this article, we are going to develop one such model that can predict whether a person will get his/her loan approved or not by using some of the background information of the applicant like the
5 min read