0% found this document useful (0 votes)

78 views

Workflow of A Machine Learning Project

The document discusses the workflow of a machine learning project. It covers gathering data, data preprocessing including cleaning, feature engineering, selecting a model, training and testing the model, and evaluation. It also discusses supervised and unsupervised learning as well as classification and regression problems.

Uploaded by

ashish

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

Workflow of A Machine Learning Project

Uploaded by

ashish

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Workflow of a Machine Learning Project

In this blog, we will discuss the workflow of a Machine learning project this includes all the
steps required to build the proper machine learning project from scratch.

We will also go over data pre-processing, data cleaning, feature exploration and feature
engineering and show the impact that it has on Machine Learning Model Performance. We
will also cover a couple of the pre-modelling steps that can help to improve the model
performance.

Python Libraries that would be need to achieve the task:

1. Numpy
2. Pandas
3. Sci-kit Learn
4. Matplotlib

Overview of the Workflow of ML

We can define the machine learning workflow in 3 stages.

1. Gathering data
2. Data pre-processing
3. Researching the model that will be best for the type of data
4. Training and testing the model
5. Evaluation
Okay but first let’s start from the basics

What is the machine learning Model?

The machine learning model is nothing but a piece of code; an engineer or data scientist
makes it smart through training with data. So, if you give garbage to the model, you will get
garbage in return, i.e. the trained model will provide false or wrong predictions.

The process of gathering data depends on the type of project we desire to make, if we want to
make an ML project that uses real-time data, then we can build an IoT system that using
different sensors data. The data set can be collected from various sources such as a file,
database, sensor and many other such sources but the collected data cannot be used directly
for performing the analysis process as there might be a lot of missing data, extremely large
values, unorganized text data or noisy data. Therefore, to solve this problem Data Preparation
is done.

We can also use some free data sets which are present on the internet. Kaggle and UCI
Machine learning Repository are the repositories that are used the most for making
Machine learning models. Kaggle is one of the most visited websites that is used for
practicing machine learning algorithms, they also host competitions in which people can
participate and get to test their knowledge of machine learning.

Data pre-processing is one of the most important steps in machine learning. It is the most
important step that helps in building machine learning models more accurately. In machine
learning, there is an 80/20 rule. Every data scientist should spend 80% time for data pre-
processing and 20% time to actually perform the analysis.

What is data pre-processing?

Data pre-processing is a process of cleaning the raw data i.e. the data is collected in the real
world and is converted to a clean data set. In other words, whenever the data is gathered from
different sources it is collected in a raw format and this data isn’t feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set, this part
of the process is called as data pre-processing.

Why do we need it?

As we know that data pre-processing is a process of cleaning the raw data into clean data, so
that can be used to train the model. So, we definitely need data pre-processing to achieve
good results from the applied model in machine learning and deep learning projects.

Most of the real-world data is messy, some of these types of data are:

1. Missing data: Missing data can be found when it is not continuously created or due to
technical issues in the application (IOT system).

2. Noisy data: This type of data is also called outliners, this can occur due to human errors
(human manually gathering the data) or some technical problem of the device at the time of
collection of data.

3. Inconsistent data: This type of data might be collected due to human errors (mistakes
with the name or values) or duplication of data.

Three Types of Data

1. Numeric e.g. income, age

2. Categorical e.g. gender, nationality

3. Ordinal e.g. low/medium/high

How can data pre-processing be performed?

These are some of the basic pre — processing techniques that can be used to convert raw
data.

1. Conversion of data: As we know that Machine Learning models can only handle numeric
features, hence categorical and ordinal data must be somehow converted into numeric
features.

2. Ignoring the missing values: Whenever we encounter missing data in the data set then we
can remove the row or column of data depending on our need. This method is known to be
efficient but it shouldn’t be performed if there are a lot of missing values in the dataset.

3. Filling the missing values: Whenever we encounter missing data in the data set then we
can fill the missing data manually, most commonly the mean, median or highest frequency
value is used.

4. Machine learning: If we have some missing data then we can predict what data shall be
present at the empty position by using the existing data.
5. Outliers detection: There are some error data that might be present in our data set that
deviates drastically from other observations in a data set. [Example: human weight = 800 Kg;
due to mistyping of extra 0]

Our main goal is to train the best performing model possible, using the pre-processed data.

Supervised Learning:
In Supervised learning, an AI system is presented with data which is labelled, which means
that each data tagged with the correct label.

The supervised learning is categorized into 2 other categories which are “Classification” and
“Regression”.

Classification:
Classification problem is when the target variable is categorical (i.e. the output could be
classified into classes — it belongs to either Class A or B or something else).

A classification problem is when the output variable is a category, such as “red” or “blue” ,
“disease” or “no disease” or “spam” or “not spam”.
Classification | GIF: www.cs.toronto.edu

As shown in the above representation, we have 2 classes which are plotted on the graph i.e.
red and blue which can be represented as ‘setosa flower’ and ‘versicolor flower’, we can
image the X-axis as ther ‘Sepal Width’ and the Y-axis as the ‘Sepal Length’, so we try to
create the best fit line that separates both classes of flowers.

These some most used classification algorithms.

 K-Nearest Neighbor
 Naive Bayes
 Decision Trees/Random Forest
 Support Vector Machine
 Logistic Regression

Regression:
While a Regression problem is when the target variable is continuous (i.e. the output is
numeric).
Regression | GIF: techburst.io

As shown in the above representation, we can imagine that the graph’s X-axis is the ‘Test
scores’ and the Y-axis represents ‘IQ’. So we try to create the best fit line in the given graph
so that we can use that line to predict any approximate IQ that isn’t present in the given data.

These some most used regression algorithms.

 Linear Regression
 Support Vector Regression
 Decision Tress/Random Forest
 Gaussian Progresses Regression
 Ensemble Methods

Unsupervised Learning:
In unsupervised learning, an AI system is presented with unlabeled, un-categorized data and
the system’s algorithms act on the data without prior training. The output is dependent upon
the coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.

The unsupervised learning is categorized into 2 other categories which are “Clustering” and
“Association”.

Clustering:
A set of inputs is to be divided into groups. Unlike in classification, the groups are not known
beforehand, making this typically an unsupervised task.
Clustering

Methods used for clustering are:

 Gaussian mixtures
 K-Means Clustering
 Boosting
 Hierarchical Clustering
 K-Means Clustering
 Spectral Clustering

Overview of models under categories:

Overview of models

For training a model we initially split the model into 3 three sections which are ‘Training
data’ ,‘Validation data’ and ‘Testing data’.

You train the classifier using ‘training data set’, tune the parameters using ‘validation set’
and then test the performance of your classifier on unseen ‘test data set’. An important point
to note is that during training the classifier only the training and/or validation set is available.
The test data set must not be used during training the classifier. The test set will only be
available during testing the classifier.
Training set: The training set is the material through which the computer learns how to
process information. Machine learning uses algorithms to perform the training part. A set of
data used for learning, that is to fit the parameters of the classifier.

Validation set: Cross-validation is primarily used in applied machine learning to estimate the
skill of a machine learning model on unseen data. A set of unseen data is used from the
training data to tune the parameters of a classifier.

Test set: A set of unseen data used only to assess the performance of a fully-specified
classifier.

Once the data is divided into the 3 given segments we can start the training process.

In a data set, a training set is implemented to build up a model, while a test (or validation) set
is to validate the model built. Data points in the training set are excluded from the test
(validation) set. Usually, a data set is divided into a training set, a validation set (some people
use ‘test set’ instead) in each iteration, or divided into a training set, a validation set and a test
set in each iteration.

The model uses any one of the models that we had chosen in step 3/ point 3. Once the model
is trained we can use the same trained model to predict using the testing data i.e. the unseen
data. Once this is done we can develop a confusion matrix, this tells us how well our model is
trained. A confusion matrix has 4 parameters, which are ‘True positives’, ‘True Negatives’,
‘False Positives’ and ‘False Negative’. We prefer that we get more values in the True
negatives and true positives to get a more accurate model. The size of the Confusion matrix
completely depends upon the number of classes.

 True positives : These are cases in which we predicted TRUE and our predicted
output is correct.
 True negatives : We predicted FALSE and our predicted output is correct.
 False positives : We predicted TRUE, but the actual predicted output is FALSE.
 False negatives : We predicted FALSE, but the actual predicted output is TRUE.

We can also find out the accuracy of the model using the confusion matrix.

Accuracy = (True Positives +True Negatives) / (Total number of classes)

i.e. for the above example:

Accuracy = (100 + 50) / 165 = 0.9090 (90.9% accuracy)

5. Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the
best model that represents our data and how well the chosen model will work in the future.
To improve the model we might tune the hyper-parameters of the model and try to improve
the accuracy and also looking at the confusion matrix to try to increase the number of true
positives and true negatives.

In this blog, we have discussed the workflow a Machine learning project and gives us a basic
idea of how a should the problem be tackled.

Implementation of the workflow of an Machine Learning project:

https://github.com/NotAyushXD/Titanic-dataset

Machine Learning Notes
100% (10)
Machine Learning Notes
19 pages
Machine Learning Lab Viva
100% (1)
Machine Learning Lab Viva
9 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Factors Influencing Student Choosing Bsca Degree in One of The Higher Education Institution in Aklan
100% (2)
Factors Influencing Student Choosing Bsca Degree in One of The Higher Education Institution in Aklan
26 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Part 2 Introduction To ML
No ratings yet
Part 2 Introduction To ML
13 pages
Machine Learning 1
No ratings yet
Machine Learning 1
34 pages
Machine Learning Unit 1
No ratings yet
Machine Learning Unit 1
72 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
AI Project Cycle PPT - Notes
No ratings yet
AI Project Cycle PPT - Notes
9 pages
4.introduction To Learning - Unit 2
No ratings yet
4.introduction To Learning - Unit 2
8 pages
Unit 1
No ratings yet
Unit 1
41 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Chapter-3-Common Issues in Machine Learning
No ratings yet
Chapter-3-Common Issues in Machine Learning
20 pages
ML Unit 1
No ratings yet
ML Unit 1
19 pages
ML Unit-1 (CEC)
No ratings yet
ML Unit-1 (CEC)
108 pages
AI Session 3 Machine Learning Slides
No ratings yet
AI Session 3 Machine Learning Slides
35 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
MAchineLearningNotes
No ratings yet
MAchineLearningNotes
6 pages
Types of Machine Learning - Tpoint Tech
No ratings yet
Types of Machine Learning - Tpoint Tech
10 pages
Intorduction of ML
No ratings yet
Intorduction of ML
14 pages
machine learning notes
No ratings yet
machine learning notes
20 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
20 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Learning Algorithms
No ratings yet
Learning Algorithms
28 pages
Types of ML
No ratings yet
Types of ML
4 pages
AI Project Cycle 4
No ratings yet
AI Project Cycle 4
3 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
4 pages
ML Unit 1
No ratings yet
ML Unit 1
20 pages
Ai Unit-4 ML
No ratings yet
Ai Unit-4 ML
4 pages
Cluster
No ratings yet
Cluster
42 pages
Fulldoc - Dsec Mca - Crime Prediction
No ratings yet
Fulldoc - Dsec Mca - Crime Prediction
56 pages
Unit 1
No ratings yet
Unit 1
95 pages
Deep Learning Workflow
No ratings yet
Deep Learning Workflow
11 pages
Machine learning_question bank
No ratings yet
Machine learning_question bank
45 pages
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
No ratings yet
Data Science and Data Analytics Lab CS695A: Sayan Maity Cse 3B Roll-05 12017009001193
30 pages
All algos_of_ML
No ratings yet
All algos_of_ML
31 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
ML Report
No ratings yet
ML Report
13 pages
Unit 1
No ratings yet
Unit 1
43 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
15 pages
ML Type
No ratings yet
ML Type
13 pages
Chapter-2-Fundamentals of Machine Learning
No ratings yet
Chapter-2-Fundamentals of Machine Learning
23 pages
Unit 1
No ratings yet
Unit 1
47 pages
Data Science Process and Machine Learning
No ratings yet
Data Science Process and Machine Learning
6 pages
UNIT 3__ML
No ratings yet
UNIT 3__ML
15 pages
Unit 1
No ratings yet
Unit 1
32 pages
Unit 2
No ratings yet
Unit 2
63 pages
Scikit - Notes ML
100% (2)
Scikit - Notes ML
12 pages
ETE Ans
No ratings yet
ETE Ans
73 pages
ML Unit 1
No ratings yet
ML Unit 1
42 pages
ML Unit 1
No ratings yet
ML Unit 1
6 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
10 Machine Learning
No ratings yet
10 Machine Learning
9 pages
ML Question Answer
No ratings yet
ML Question Answer
21 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
26 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Lauren Movius - Cultural Globalisation and Challenges To Traditional Communication Theories
No ratings yet
Lauren Movius - Cultural Globalisation and Challenges To Traditional Communication Theories
13 pages
J of Industrial Ecology - 2022 - Baroth - Role of Protected Area in Reducing Marine and Plastic Litter A Case Study From
No ratings yet
J of Industrial Ecology - 2022 - Baroth - Role of Protected Area in Reducing Marine and Plastic Litter A Case Study From
12 pages
What Is Education Management 1000 Words Essay (Reference)
No ratings yet
What Is Education Management 1000 Words Essay (Reference)
6 pages
CRM Checklists
No ratings yet
CRM Checklists
3 pages
Analysis of Study On Literacy Rate and Life Expectancy Rate in 17 States of India in The Year 2001
No ratings yet
Analysis of Study On Literacy Rate and Life Expectancy Rate in 17 States of India in The Year 2001
5 pages
Literature Review On Scientific Management
100% (3)
Literature Review On Scientific Management
5 pages
Research Report
No ratings yet
Research Report
33 pages
Rationale For Reducing Project Duration Rationale For Reducing Project Duration
0% (1)
Rationale For Reducing Project Duration Rationale For Reducing Project Duration
17 pages
Control Charts: of Chance Encounters by C.J.Wild and G.A.F. Seber
No ratings yet
Control Charts: of Chance Encounters by C.J.Wild and G.A.F. Seber
32 pages
Universal Health Coverage in Indonesia: Concept, Progress, and Challenges
No ratings yet
Universal Health Coverage in Indonesia: Concept, Progress, and Challenges
29 pages
ChandanaShrinathWijetunga-LandscapeResearch
No ratings yet
ChandanaShrinathWijetunga-LandscapeResearch
19 pages
(LightCastle Partners) Business Dev Trainee Consultant - Second Round Assessment
No ratings yet
(LightCastle Partners) Business Dev Trainee Consultant - Second Round Assessment
4 pages
MSBP A1
No ratings yet
MSBP A1
8 pages
Social Media and Web Analytics Unit-5
No ratings yet
Social Media and Web Analytics Unit-5
10 pages
Managing Risk in Construction Projects Second edition Nigel J. Smith instant download
100% (2)
Managing Risk in Construction Projects Second edition Nigel J. Smith instant download
62 pages
Development of Information Technology Act and Its Impact On Right To Privacy 1
No ratings yet
Development of Information Technology Act and Its Impact On Right To Privacy 1
103 pages
MS4226 Project Progress Report
No ratings yet
MS4226 Project Progress Report
3 pages
Content and Contextual Analysis
No ratings yet
Content and Contextual Analysis
8 pages
Pain and Analgesia Following Onychectomy in Cats: A Systematic Review
No ratings yet
Pain and Analgesia Following Onychectomy in Cats: A Systematic Review
13 pages
58 Questions
No ratings yet
58 Questions
16 pages
Customer Loyalty Towards Retail Outlet: Retail Marketing - Project Proposal
No ratings yet
Customer Loyalty Towards Retail Outlet: Retail Marketing - Project Proposal
10 pages
Organisational Study Report of Oushadhi
No ratings yet
Organisational Study Report of Oushadhi
15 pages
Tantangan Profesi Arsip
No ratings yet
Tantangan Profesi Arsip
12 pages
Review of Related Literature
No ratings yet
Review of Related Literature
36 pages
What Is A Professional Learning Community
No ratings yet
What Is A Professional Learning Community
7 pages
BAM-503-_Operations_Management
No ratings yet
BAM-503-_Operations_Management
10 pages
Soalan Performance Appraisal
No ratings yet
Soalan Performance Appraisal
5 pages
QB BCom BBA Business Research Methods
No ratings yet
QB BCom BBA Business Research Methods
22 pages
Laundry Detergent
No ratings yet
Laundry Detergent
4 pages

Workflow of A Machine Learning Project

Uploaded by

Workflow of A Machine Learning Project

Uploaded by

Workflow of a Machine Learning Project

Python Libraries that would be need to achieve the task:

Overview of the Workflow of ML

We can define the machine learning workflow in 3 stages.

What is the machine learning Model?

What is data pre-processing?

Why do we need it?

Three Types of Data

2. Categorical e.g. gender, nationality

3. Ordinal e.g. low/medium/high

How can data pre-processing be performed?

These some most used classification algorithms.

These some most used regression algorithms.

Methods used for clustering are:

Overview of models under categories:

Accuracy = (True Positives +True Negatives) / (Total number of classes)

i.e. for the above example:

Accuracy = (100 + 50) / 165 = 0.9090 (90.9% accuracy)

Implementation of the workflow of an Machine Learning project:

You might also like