0% found this document useful (0 votes)
34 views

Machine Learning-1

Uploaded by

kigali ac
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Machine Learning-1

Uploaded by

kigali ac
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 64

SECTOR: ICT & MULTIMEDIA

SUB-SECTOR/TRADE: NETWORKING AND INTERNET TECHNOLOGY


RTQF LEVEL: 5
ACADEMIC YEAR: 2024 -2025
CERTIFICATE TITLE: TVET CERTIFICATE IV IN NETWORK AND INTERNET
TECHNOLOG

MODULE CODE:

MODULE NAME: MACHINE LEARNING APPLICATION

Credits: 8 Hours: 12

PREPARED BY:
UNKUNDIYE Jeanette
Tel:(+250)789415047
E-mail: [email protected]

APADE on 08th July 2024

1
Elements of Competence and Performance Criteria
Elements of Performance criteria
competence
1. Apply Data 1.1 Environment is properly prepared based on system
requirements.
Pre-processing
1.2 Data is properly manipulated based on the python
libraries functionalities.
1.3 Visualization results are properly interpreted based on
its statistical analysis.
1.4 Data cleaning is appropriately performed based on the
provided dataset
2. Develop Machine 2.1 Machine Learning algorithm is properly selected
Learning Model based on the characteristics of the dataset.
2.2 Machine Learning models are properly trained based
on a training set of data.
2.3 Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4 Hyper parameters are properly fine-tuned based on
evaluation results
3. Perform Model 3.1. Deployment methods are clearly selected based on
deployment the requirements.
3.2. Model file is properly integrated in the system
based on the Deployment method (RESTful API
guidelines.)
3.3. Prediction responses are accurately delivered to the
Clients based on the model insights.

SWDML501 MACHINE LEARNING APPLICATION


2
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Learning outcome 1: Apply Data Pre-processing

1.1 Environment is properly prepared based on system requirements.


● Description of Machine learning concepts
 Machine learning overview
Machine learning (ML) is a subset of artificial intelligence (AI) focused on enabling computers to
learn from and make predictions or decisions based on data. Unlike traditional programming, where
explicit instructions are given, machine learning algorithms use statistical methods to identify patterns
and improve their performance over time as they are exposed to more data.

Definition
 Machine learning (ML) is an application of artificial Intelligence that involves algorithms and
data that automatically analyse and make decisions by themselves without human intervention.

 It describes how computers perform tasks on their own by previous experiences.

 Therefore, we can say in machine language artificial intelligence is generated based on


experience.

Machine learning life cycle


Machine learning has given the computer systems the abilities to automatically learn without being
explicitly programmed. But how does a machine learning system work? So, it can be described using
the life cycle of machine learning. Machine learning life cycle is a cyclic process to build an efficient
machine learning project. The main purpose of the life cycle is to find a solution to the problem or
project.

Machine learning life cycle involves seven major steps, which are given below:

1. Gathering Data
The goal of this step is to identify and obtain all data-related problems. In this step, we need to identify
the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices.
3
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
This step includes the below tasks:
• Identify various data sources
• Collect data
• Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in
further steps.

2. Data preparation
Data preparation is a step where we put our data into a suitable place and prepare it to use in our
machine learning training, in this step, first, we put all data together, and then randomize the ordering
of data

This step can be further divided into two processes:


 Data exploration: It is used to understand the nature of data that we have to work with.
We need to understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
 Data pre-processing: this method involves preparing data so that it can be made ready
for analysis and modeling.

3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper format
to make it more suitable for analysis in the next step.
It is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
 Missing Values
 Duplicate data
 Invalid data
 Noise

4. Analyse Data
This step aims to build a machine learning model to analyze the data using various analytical
techniques and review the outcome. It starts with the determination of the type of the problems, where
we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the model.

Data analysis step involves:


 Selection of analytical techniques
 Building models
 Review the result

5. Train the model


This step we train our model to improve its performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a model is
required so that it can understand the various patterns, rules, and, features.

6. Test the model


In this step, we check for the accuracy of our model by providing a test dataset to it.

SWDML501 MACHINE LEARNING APPLICATION


4
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Testing the model determines the percentage accuracy of the model as per the requirement of project
or problem.
7. Deployment
In this step, deploy the model in the real-world system. If the above-prepared model is producing an
accurate result as per our requirement with acceptable speed, then we deploy the model in the real
system. But before deploying the project, we will check whether it is improving its performance using
available data or not. The deployment phase is similar to making the final report for a project.

Machine learning applications


Machine learning is becoming increasingly popular and is being used in everyday life without us even
realizing it. Some of the most popular real-world applications of machine learning:

1. Image Recognition: Image recognition is the errand of recognizing objects in digital images. It
is a sort of machine-learning strategy that employs a trained model to distinguish objects within
an image.

2. Speech Recognition: Speech recognition is a form of artificial intelligence that enables


machines to interpret and act on human speech. It is used to perform tasks such as converting
spoken words into text, understanding commands, and performing various other tasks.

3. Traffic prediction: Traffic prediction is a form of machine learning that includes foreseeing
traffic patterns based on chronicled information. This will be utilized to plan traffic routes
better, diminish congestion, and improve safety.

5
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
4. Product Recommendations: Product recommendation is an machine learning algorithm that
employs information to recommend items to clients. It is utilized to personalize shopping
encounters and increment sales.

5. Self-driving Cars: Self-driving cars are vehicles that are capable of detecting their
environment and navigating without human input. They utilize a combination of machine
learning algorithms such as object recognition, path planning, and choice making to
independently drive.

6. E-mail Spam and Malware Filtering: Mail spam and malware filtering is a form of machine
learning that employs algorithms to distinguish and channel out malicious emails. It is utilized
to ensure clients from malicious emails, phishing scams, and other shapes of cybercrime.

7. Virtual Personal Assistant: A virtual personal assistant is a artificial intelligence computer


program that's utilized to mechanize assignments such as planning, updates, and other every
day errands. It uses a combination of natural language processing and machine learning
algorithms to understand user requests and provide helpful reactions.

8. Online Fraud Detection: Online fraud detection is a form of machine learning that uses
algorithms to detect and prevent fraudulent activities. It is used to protect users from identity
theft, credit card fraud, and other forms of online crime.

9. Stock Market Trading: Stock market trading is a type of machine learning that uses
algorithms to analyze and predict stock prices. It is used to make better trading decisions and
increase profits.

10. Medical Diagnosis: Medical diagnosis is a form of machine learning that uses algorithms to
diagnose diseases. It is used to provide accurate and timely diagnoses to patients.

11. Automatic Language Translation: Automatic language translation is a type of machine


learning that uses algorithms to translate text from one language to another. It is used to bridge
the language gap and make communication easier.

Advantages and disadvantages


 Advantage
• Enhanced Efficiency and Accuracy: ML algorithms automate repetitive tasks, such as data
entry and document processing, reducing manual effort and minimizing errors. This leads to
streamlined operations, improved productivity, and cost savings.
• Real-time Decision Making: ML algorithms can analyse large volumes of data in real time,
allowing banks to make faster and more informed decisions. This is particularly useful in areas
such as fraud detection, where prompt action is crucial.
• Improved Risk Management: Machine Learning models enable banks to identify and manage
risks more effectively. By analysing various data sources, including market data and customer
behaviour, ML algorithms can provide early warnings of potential risks, allowing banks to
mitigate them proactively.
• Enhanced Customer Experience: Machine Learning enables banks to provide personalized
services, tailored product recommendations, and personalized customer support. This
personalized approach fosters customer satisfaction, loyalty, and retention.
SWDML501 MACHINE LEARNING APPLICATION
6
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Disadvantage
• Data Privacy and Security: ML algorithms rely on large amounts of customer data, including
personal and financial information. Banks must prioritize data privacy and security to protect
customer confidentiality and comply with regulatory requirements.
• Ethical Considerations: ML algorithms may inadvertently learn biases present in the data,
leading to discriminatory outcomes. It is crucial to address and mitigate bias to ensure fair
and ethical decision-making.
• Model Interpretability: Complex ML models, such as deep learning algorithms, can lack
interpretability, making it challenging to explain their decisions. Banks must strike a balance
between accuracy and interpretability, especially in areas where explain ability is essential,
such as loan approvals.
• Implementation and Integration: Implementing Machine Learning solutions requires
significant investments in technology, infrastructure, and talent. Banks must ensure smooth
integration with existing systems and overcome any technical hurdles.

Difference between machine learning, artificial intelligence, and deep learning


 Artificial Intelligence (AI) Broadest category. Encompasses intelligent agents that can
reason, learn, and act autonomously. -
 Machine Learning (ML) Subset of AI. Focuses on algorithms for computers to learn from
data without explicit programming. AI
 Deep Learning (DL) Subset of ML. Uses artificial neural networks to analyse complex
patterns in data. Inspired by the human brain.

Feature Artificial Intelligence Machine Learning Deep Learning (DL)


(AI) (ML)
Definition Broad science of Subset of AI focused on Subset of ML using neural
creating intelligent learning from data networks
7
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
agents
Goal Mimic human Enable computers to Learn complex patterns
intelligence learn without explicit from data
programming
Approach Rule-based, Data-driven, algorithm- Data-driven, neural
knowledge-based, based network-based
statistical
Data Can be data-driven or Primarily data-driven Large amounts of data
Requirement rule-based required
Feature Often requires manual Can involve feature Typically doesn't require
Engineering feature engineering engineering explicit feature
engineering
Complexity Can range from simple Varies based on Highly complex due to
to highly complex algorithm neural networks
Tasks Problem-solving, Pattern recognition, Image recognition, speech
decision-making, prediction, classification, recognition, natural
natural language etc. language understanding,
processing, etc. etc.
Examples Expert systems, Recommendation Image and speech
chatbots, robotics systems, fraud detection, recognition, self-driving
spam filters cars

 Types of Machine Learning

Supervised: Supervised learning is a machine learning algorithm that employs a known


dataset to foresee the outcome of new data. The system uses labelled data to build a model
that understands the datasets and learns about each one. After the training and processing are
done,
The algorithm is trained using labelled data of dogs and cats. The trained model
predicts whether the new image is that of a cat or a dog.

Some examples of supervised learning method include linear regression, logistic regression, support
vector machines, Naive Bayes, and decision tree

SWDML501 MACHINE LEARNING APPLICATION


8
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Unsupervised: Unsupervised learning is a machine learning algorithm that works with
unlabelled data. The objective is to discover structure and designs within the
information. Illustrations include clustering, and dimensionality reduction.
Below is an example of an unsupervised learning method that trains a model using
unlabelled data:

The purpose of the model is to classify each kind of vehicle. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:
Clustering, Association.

Semi-supervised: It is a machine learning algorithm that combines labelled and


unlabelled information in order to learn the fundamental structure of the information.
The objective is to use the labelled information to better understand the structure of the
unlabelled information.
Reinforcement: Reinforcement learning is a machine learning algorithm that learns
from its environment by taking actions and getting rewards for those actions. The
objective is to maximize the rewards it gets within the long run.

 Focus on maximizing reward from a given environment


 Use algorithms such as Q-learning, SARSA, and Deep Q-networks
 Useful for applications such as game playing and robotics

 Machine Learning tools

● Preparing Machine Learning environment


This guide provides step-by-step instructions for installing Python, essential tools for machine
learning, and testing your environment to ensure everything is set up correctly.

 Installation of Python
1.1 Download Python
1. Visit the Python Official Website: Go to Python's official website.
2. Choose the Version: Download the latest version of Python (usually the most recent stable
release is recommended). For compatibility with most tools and libraries, Python 3.x is
preferred over Python 2.x.
1.2 Install Python
 Windows:
1. Run the installer you downloaded.
9
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
2. Check the box that says "Add Python to PATH" at the start of the installation.
3. Select "Install now" to begin the installation process.
4. Once the installation is complete, you can verify it by opening Command Prompt and
typing python --version.
 macOS:
1. Open the .pkg installer you downloaded.
2. Follow the prompts in the installer.
3. Verify the installation by opening Terminal and typing python3 --version.
 Linux:
1. Most Linux distributions come with Python pre-installed. To install or update Python,
use the package manager specific to your distribution.
 For Debian-based systems (like Ubuntu): sudo apt update && sudo apt install
python3
 For Red Hat-based systems (like Fedora): sudo dnf install python3
2. Verify the installation by typing python3 --version in the Terminal.

1.3 Install pip (Python Package Installer)


pip is included with Python installations by default. To ensure you have it:
 Open Command Prompt or Terminal and type pip --version. If pip is installed, you'll see the
version number. If not, you can install it using:
o For Windows/macOS/Linux: python -m ensurepip --upgrade

 Installation of Tools
To work with machine learning in Python, you'll need several tools and libraries. The most common
ones include:

2.1 Virtual Environment Setup


Creating a virtual environment helps manage dependencies for different projects separately.
 Install virtualenv:
pip install virtualenv
 Create a Virtual Environment:
virtualenv myenv
 Activate the Virtual Environment:
o Windows:
myenv\Scripts\activate
o macOS/Linux:
source myenv/bin/actívate

2.2 Install Key Libraries


Once inside the virtual environment, install the essential libraries for machine learning:
 NumPy (Numerical operations):
pip install numpy
 Pandas (Data manipulation and analysis):
pip install pandas
 Scikit-learn (Machine learning algorithms):
pip install scikit-learn
 Matplotlib (Plotting and visualization):
pip install matplotlib
 Seaborn (Statistical data visualization):
pip install seaborn
SWDML501 MACHINE LEARNING APPLICATION
10
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 TensorFlow or PyTorch (Deep learning frameworks):
o TensorFlow:
pip install tensorflow
o PyTorch:
pip install torch torchvision
 Jupyter Notebook (Interactive notebooks for data science):
pip install jupyter

 Environment Testing
To confirm that your Python environment and tools are correctly installed and configured, follow these
steps:

3.1 Verify Python and Pip


Open Command Prompt or Terminal and type:
python --version
pip --version

1.2 Data is properly manipulated based on the python libraries functionalities.


● Data Collection and Acquisition
Data collection and acquisition are fundamental aspects of machine learning. Data collection refers to
the process of gathering data relevant to the machine learning task at hand. This can be done through
various methods. While, data acquisition typically involves storing the data in a format that can be
easily processed by machine learning algorithms.

Here's a table summarizing the data collection and acquisition methods:

Data Collection Methods Data Acquisition Methods


Manual data collection: collecting data through Saving data to files (CSV, JSON, etc.)
surveys, experiments, or other means
Web scraping: involves extracting data from Storing data in a database
websites
Using APIs: provide a way to access data from Streaming data from real-time sources
external sources.
Data crawling: is similar to web scraping,
involves automatically extracting data from
websites by following links and parsing HTML
content

 Description of key terms


Data: Raw, unorganized facts and figures. The building blocks of information. Examples:
numbers, words, images, sounds.
Information: Processed data that has meaning and context. Derived from data through analysis
and interpretation.
Examples: sales reports, customer preferences, weather forecasts.
Dataset: Collection of related data items. Can be structured or unstructured. Used for analysis,
modeling, and machine learning.

11
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Examples: customer database, sales transaction data, sensor readings.
Data warehouse: Centralized repository for storing and managing large volumes of data.
Optimized for querying and analysis. Supports business intelligence and decision-making.
Combines data from various sources into a single, consistent view.
Big data: Extremely large and complex datasets that cannot be easily managed using traditional
data processing techniques. Characterized by volume, velocity, and variety. Requires
specialized technologies and tools for processing and analysis.
Examples: social media data, IoT sensor data, financial transaction data.
 Identification of Source of data
IoT Sensors: These are devices embedded in physical objects that collect and transmit data
about their environment or the object itself. Examples include temperature sensors, motion
detectors, and smart meters.

Camera: This refers to devices that capture visual data, either through photographs or video.
Cameras can be used for security, monitoring, or image analysis.

Computer: This is a general-purpose electronic device that processes and stores data.
Computers can generate data through software applications, user inputs, or system logs.

Smartphone: This is a portable device that combines mobile phone capabilities with
computing features. Smartphones generate data from apps, sensors, user interactions, and
communications.

Social Data: This includes data generated from social media platforms and online interactions,
such as posts, likes, comments, and shares.

Transactional Data: This refers to data generated from transactions, such as purchases, sales,
or financial exchanges. It typically includes details about the transaction, such as date, amount,
and parties involved.

 Description of 6 V's of Big Data


The 6 V's of Big Data are key characteristics that describe the complexities and challenges associated
with handling large and diverse datasets. Here's a brief description of each:

 Volume: This refers to the amount of data being generated and stored. In the context of Big
Data, it means handling vast quantities of data, ranging from terabytes to Exabyte’s. Volume
highlights the scale at which data is collected, processed, and stored.

 Variety: This pertains to the different types and sources of data. Data can come in various
formats, such as structured data (like databases), semi-structured data (like XML or JSON),
and unstructured data (like text documents or videos). Variety emphasizes the diversity of data
types and the need to integrate them.

 Velocity: This is the speed at which data is generated and processed. In the context of Big
Data, velocity refers to the fast rate at which data flows into systems and needs to be handled,
often in real-time or near-real-time scenarios.
SWDML501 MACHINE LEARNING APPLICATION
12
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Veracity: This relates to the accuracy and reliability of the data. Veracity involves assessing
the quality and trustworthiness of the data, ensuring it is accurate, complete, and free from
errors or biases.

 Value: This is about the usefulness and relevance of the data. Value refers to extracting
meaningful insights and actionable information from data that can drive decisions, create
opportunities, or provide competitive advantages.

 Variability: This describes the inconsistencies in data over time or across different sources.
Variability can affect data quality and analysis, making it important to account for fluctuations
and changes in data patterns.

 Description of Types of data

Data can be categorized into three main types based on its format and organization: structured data,
semi-structured data, and unstructured data. Here's a description of each type:

1. Structured Data: Structured data is highly organized and easily searchable in predefined formats. It
is typically stored in relational databases or spreadsheets where data is arranged in rows and columns.
 Characteristics:
o Format: Consistent and well-defined, often adhering to a schema.
o Examples: Customer names, addresses, transaction records, inventory lists.
o Access and Analysis: Easily queried using SQL and analyzed using various data
analysis tools.
 Use Cases: Financial records, sales reports, customer databases.
2. Semi-structured Data: Semi-structured data does not conform to a rigid schema like structured
data but still contains organizational elements that make it easier to analyze than unstructured data. It
often includes metadata or tags to provide context.
 Characteristics:
o Format: Partially organized with some identifiable structure, such as tags or keys.
o Examples: JSON files, XML documents, HTML files, emails, and log files.
o Access and Analysis: Requires specialized tools or programming languages (like
Python or R) to parse and analyze, often using hierarchical or tag-based queries.
13
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Use Cases: Web data, social media posts, and data from APIs.
3. Unstructured Data: Unstructured data lacks a predefined format or structure, making it more
challenging to collect, process, and analyze. It does not fit neatly into tables or schemas.
 Characteristics:
o Format: No standardized format; data can be free-form.
o Examples: Text documents, images, videos, audio files, and social media content.
o Access and Analysis: Requires advanced techniques such as natural language
processing (NLP), machine learning, and text mining to extract insights.
 Use Cases: Customer feedback, multimedia content, and content from blogs or forums.

 Gathering Machine Learning dataset


Gathering a dataset for machine learning involves several key steps to ensure that the data is relevant,
high-quality, and suitable for the intended analysis or model training. Here’s a general guide on how to
gather and prepare a machine learning dataset:

1. Define Objectives and Requirements: define the problem you want to solve or the task you
want to perform (e.g., classification, regression, clustering). Determine the type of data needed,
including features (input variables) and labels (output variables) if applicable.

2. Identify Data Sources: determine the source of data to be used such as


 Public Datasets: Use publicly available datasets from sources like Kaggle, UCI Machine
Learning Repository, or government databases.
 Company Data: Leverage internal company data if it’s relevant to the problem.
 APIs: Access data through APIs from platforms like Twitter, Google, or financial services.
 Web Scraping: Extract data from websites using web scraping tools or libraries.
 IoT Sensors and Devices: Collect data from sensors or devices if applicable to your use
case.

3. Collect Data:
 Manual Collection: For smaller datasets, manually gather data from sources such as
surveys or forms.
 Automated Collection: Use scripts or tools to automate the collection process, especially
for large datasets or real-time data.
 APIs and Web Scraping: Write scripts to fetch data from APIs or scrape websites.
4. Preprocess Data
 Cleaning: Handle missing values, remove duplicates, and correct errors or inconsistencies
in the data.
 Normalization/Standardization: Scale numerical features to ensure that they are on a
similar scale, which is important for many algorithms.
 Encoding: Convert categorical variables into numerical formats (e.g., one-hot encoding,
label encoding).
 Splitting: Divide the dataset into training, validation, and test sets to evaluate the model’s
performance.
5. Augment Data: Increase the size of the dataset by creating variations of existing data, such as
rotating images or introducing noise.

6. Validate and Document: Ensure that the collected data meets the quality and relevance
criteria for the problem. And, Document the data sources, collection methods, preprocessing
steps, and any assumptions or limitations.
SWDML501 MACHINE LEARNING APPLICATION
14
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
7. Ensure Data Privacy and Compliance: Ensure that the data collection and usage comply
with privacy regulations such as GDPR, HIPAA, or CCPA. Consider ethical implications,
particularly if working with sensitive or personal data.

8. Explore and Analyze Data: Use statistical and visualization techniques to understand the
distribution, patterns, and relationships within the data. And, Create new features or modify
existing ones to improve the performance of your machine learning model.

9. Store and Manage Data: Store the dataset in a format that is easy to access and manage, such
as databases or data lakes. Use version control systems or data management tools to keep track
of changes and updates to the dataset.
10. Update and Maintain: Regularly update the dataset as new data becomes available or as
requirements change. Monitor the data quality and relevance over time to ensure ongoing
usefulness for model training.

1.3 Visualization results are properly interpreted based on its statistical


analysis.
● Interpret Data Visualization
 Description of data Visualization tools
Data Visualization is the process of presenting data in the form of graphs or charts. It helps to
understand large and complex amounts of data very easily. It allows the decision-makers to make
decisions very efficiently and also allows them in identifying new trends and patterns very easily
Matplotlib
Matplotlib is a low-level library of Python which is used for data visualization. It is easy to use and
emulates MATLAB like graphs and visualization. This library is built on the top of NumPy arrays and
consist of several plots like line chart, bar chart, histogram, etc. It provides a lot of flexibility but at the
cost of writing more code.

To install Matplotlib type the below command in the terminal.

pip install matplotlib

1. Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to be as
usable as MATLAB, with the ability to use Python and the advantage of being free and open-source.
Example:
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.show()

15
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization depicted and
displays the title using various attributes.

plt.title("Linear graph")

# Adding title to the plot with a color


plt.title("Linear graph", fontsize=25, color="green")

Adding X Label and Y Label


In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis respectively.
These can be added to the graph by using the xlabel() and ylabel() methods.

plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')

Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the data displayed
in the graph’s Y-axis. It generally appears as the box containing a small sample of each color on the
graph and a small description of what this data means.

plt.legend(["GFG"])

Seaborn
Seaborn is a well-known Python library for data visualization that offers a user-friendly interface for
producing visually appealing and informative statistical graphics. It is designed to work with Pandas
dataframes, making it easy to visualize and explore data quickly and effectively.

Installing Seaborn

pip install seaborn

Sample Datasets
Seaborn provides several built-in datasets that we can use for data visualization and statistical analysis.
These datasets are stored in pandas dataframes, making them easy to use with Seaborn's plotting
functions.
Example:
import seaborn as sns

tips = sns.load_dataset("tips")
sns.histplot(data=tips,
x="total_bill")

SWDML501 MACHINE LEARNING APPLICATION


16
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
import seaborn as sns
exercise =
sns.load_dataset("exercise")
exercise.head()

Seaborn Plot types


Seaborn provides a wide range of plot types that can be used for data visualization and exploratory
data analysis. Broadly speaking, any visualization can fall into one of the three categories.

 Univariate – x only (contains only one axis of information)


 Bivariate – x and y (contains two axis of information)
 Trivariate – x, y, z (contains three axis of information)

Seaborn scatter plots


Scatter plots are used to visualize the relationship between two continuous variables. Each point on the
plot represents a single data point, and the position of the point on the x and y-axis represents the
values of the two variables.

import seaborn as sns


tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", data=tips)

17
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Plotly
Plotly is an interactive, open-source plotting library. Plotly charts are dynamic.
It can be displayed in Jupyter notebooks. Plotly is built on top of the Plotly JavaScript library
(plotly.js). Plotly can be used anywhere.

Installation of Plotly is very easy, if you have not installed it yet, Follow these steps:

pip install plotly

Through the basics of plotly let’s see how to create some basic charts

import plotly.express as px
# using the iris dataset
df = px.data.iris()

fig = px.line(df, y="sepal_width",)


fig.show()

Group and color the data according to the species. We will also change the line format.

fig = px.line(df, y="sepal_width", line_dash='species', color='species')

Tableau
Tableau is a data analytics and visualization tool. It’s the leading (33% market share followed
by Power-BI) data analytics and visualization tool in the market. Tableau comes with a very
easy drag-drop interface which makes it easy to learn and you can work on almost every type
of data in Tableau.

Installing Tableau on your System


Tableau provides us various services according to our business need Tableau Desktop,
Tableau Public, and Tableau Online, all these offer Data Visual Creation. Choice of Tableau
depends upon the type of work.

Tableau Desktop is a program that allows you to execute complicated data analysis tasks and
generate dynamic, interactive representations to explain the results.

Power BI
Power BI is a cloud-based business analytics service from Microsoft that enables anyone to
visualize and analyse data, with better speed and efficiency. It is a powerful as well as a
flexible BI tool for connecting with and analysing a wide variety of data.

Power BI Desktop is a free application that can be downloaded and installed on the system. It
can be connected to multiple data sources. Usually, an analysis work begins in Power BI
Desktop where report creation takes place.

SWDML501 MACHINE LEARNING APPLICATION


18
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Power BI can be used in two ways:

 As an app from the Microsoft store and just sign in to get started. This is the online
version of the tool.
 Download the software locally and then install it. Make sure you read all the
installation instructions.

Use Types of Data Visualization

Plot supports various types of plots, including:

1. Bar Chart
A bar chart is a pictorial representation of data that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent
Bar Charts: Great for categorical data.

import plotly.express as px
df = px.data.tips()

# Creating the bar chart


fig = px.bar(df, x='day', y="total_bill",
color='sex',facet_row='time',
facet_col='sex')

fig.show()

2. Scatter Plot
A scatter plot is a set of dotted points to represent individual pieces of data in the horizontal
and vertical axis.
Scatter Plots: Useful for displaying relationships between variables.

import plotly.express as px

# using the dataset


df = px.data.tips()

# plotting the scatter chart


fig = px.scatter(df, x='total_bill',
y="tip")
# showing the plot
fig.show()

3. Histogram

19
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
A histogram is basically used to represent data in the form of some groups. It is a type of bar
plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency.
Histograms: Useful for distribution of numerical data.

import plotly.express as px

# using the dataset


df = px.data.tips()

# plotting the histogram


fig = px.histogram(df, x="total_bill")

# showing the plot


fig.show()

4. Pie Chart
A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical
proportions. It depicts a special chart that uses “pie slices”, where each sector shows the
relative sizes of data.
Pie Charts: Useful for showing proportions.

import plotly.express as px

# Loading the iris dataset


df = px.data.tips()

fig = px.pie(df, values="total_bill",


names="day")
fig.show()

5. 3D Scatter Plot
3D Scatter Plot can plot two-dimensional graphics that can be enhanced by mapping up to
three additional variables while using the semantics of hue, size, and style parameters.
3D Plots: For visualizing three-dimensional data.

SWDML501 MACHINE LEARNING APPLICATION


20
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
import plotly.express as px

# data to be plotted
df = px.data.tips()

# plotting the figure


fig = px.scatter_3d(df,
x="total_bill", y="sex", z="tip")

fig.show()

6. Box Plot
A Box Plot is also known as Whisker plot is created to display the summary of the set of data
values having properties like minimum, first quartile, median, third quartile and maximum. In
the box plot, a box is created from the first quartile to the third quartile, a vertical line is also
there which goes through the box at the median.

import plotly.express as px
df = px.data.tips()

# plotting the boxplot


fig = px.box(df, x="day",
y="tip")

# showing the plot


fig.show()

Various customizations that can be used on boxplots –

 color: used to assign color to marks


 facet_row: assign marks to facetted subplots in the vertical direction
 facet_col: assign marks to facetted subplots in the horizontal direction
 boxmode: One of ‘group’ or ‘overlay’ In ‘overlay’ mode, boxes are on drawn top of
one another. In ‘group’ mode, boxes are placed beside each other.
 notched: If True, boxes are drawn with notches

 Applying data Visualization Best Practices

Applying data visualization best practices ensures that your visualizations effectively communicate
insights and are easy to understand. Here are some key best practices to keep in mind:

1. Know Your Audience: Make sure the visualization is appropriate for the audience’s level
of data literacy.
2. Choose the Right Type of Chart

 Line Charts: Good for showing trends over time.


21
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Bar Charts: Ideal for comparing categorical data.
 Pie Charts: Useful for showing proportions, but only with a small number of categories.
 Histograms: Great for showing distributions of data.
 Heatmaps: Useful for visualizing matrix-like data and identifying patterns or correlations.
 Scatter Plots: Best for showing relationships between two continuous variables.

3. Design for Clarity: Keep design elements to a minimum to avoid distraction. Remove
unnecessary elements like excessive gridlines, background images, or decorative elements that
don’t add value.
4. Use Effective Labels and Annotations: Clearly label axes to indicate what each axis
represents. Include units if applicable.

 Titles: Provide a descriptive title that explains what the visualization is showing.
 Legends: Use legends to explain colors, sizes, or shapes used in the visualization.
 Annotations: Highlight important data points or trends with annotations to guide the viewer’s
focus.

5. Choose Colors Wisely: Use a color scheme that is both aesthetically pleasing and
accessible. Avoid using too many colors, and consider color blindness (e.g., use colorblind-
friendly palettes like Viridis).
6. Ensure Data Integrity: Verify that data is correctly represented and calculated. Be cautious
with scale choices. For example, avoid distorting data with misleading y-axis scaling.
7. Optimize Readability

 Font Size: Ensure text is legible; don’t use overly small fonts.
 Contrast: Use high contrast between text and background for better readability.
 Spacing: Provide adequate spacing between elements to avoid overcrowding.

8. Interactive Features

 Tooltips: Use tooltips to provide additional information when hovering over data points.
 Filters: Allow users to filter or zoom in on specific data subsets if interactivity is supported.
 Drilldowns: Provide options to drill down into data for more detailed exploration.

9. Data Transformation and Aggregation

 Aggregation: Aggregate data meaningfully to avoid overloading the viewer with too much
detail.
 Normalization: Normalize data when comparing across different scales or units to make
comparisons more meaningful.

10. Test and Iterate

 Feedback: Gather feedback from actual users to identify areas for improvement.
 Testing: Test visualizations in different scenarios and with different data subsets to ensure they
work well in various contexts.

SWDML501 MACHINE LEARNING APPLICATION


22
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Interpreting visualizations results

Patterns and Trends


Visualizing trends and patterns is a fundamental aspect of analysing data and it’s one of our most
reliable decision-making mechanisms. It helps us simplify complex information and turn it into
insights.

import plotly.express as px
import pandas as pd

# Sample data
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr',
'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
'Nov', 'Dec'],
'Sales': [200, 220, 250, 275, 300, 310,
290, 280, 320, 340, 330, 350]
})

fig = px.line(df, x='Month', y='Sales',


title='Monthly Sales Trend')

fig.show()

Context and Background


Context refers to the background information and circumstances surrounding the data. By
Understanding Data Source to know where the data comes from and its reliability. While Background
involves understanding the historical, geographical, or situational factors that influence the data.
Understanding Historical Trends to Review historical data to compare with current trends.

import plotly.express as px
import pandas as pd

df = pd.DataFrame({
'Region': ['North', 'South', 'East', 'West'],
'Satisfaction': [85, 78, 88, 72]
})

fig = px.imshow(
[df['Satisfaction']], # Using a list to pass
the single row of data
x=df['Region'], # X-axis labels
color_continuous_scale='Reds',
title='Customer Satisfaction by Region',
labels={'color': 'Satisfaction'} # Label

23
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
for color scale
)

# Show the plot


fig.show()

Correlations and Relationships


1. Correlation measures the strength and direction of the relationship between two
variables.
Types:

 Positive Correlation: As one variable increases, the other also increases.


 Negative Correlation: As one variable increases, the other decreases.
 No Correlation: No discernible relationship between the variables.

2. Relationships describe how variables interact with each other.


 Types:

 Linear Relationships: Data points follow a straight-line pattern.


 Non-Linear Relationships: Data points follow a curve or other complex pattern

import plotly.express as px
import pandas as pd

# Sample data
df = pd.DataFrame({
'Advertising Spend': [1000,
2000, 3000, 4000, 5000],
'Sales': [15000, 25000,
35000, 45000, 55000]
})

# Create scatter plot


fig = px.scatter(df,
x='Advertising Spend',
y='Sales', title='Advertising
Spend vs. Sales')

# Show the plot


fig.show()

1.4 Data cleaning is appropriately performed based on the provided dataset


● Perform Data cleaning
 Data cleaning overview
Definition

SWDML501 MACHINE LEARNING APPLICATION


24
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly formatted,
duplicated, or insufficient data from a dataset. Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies.
Purpose
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

Steps

Overall, the data cleaning process involves identifying and fixing data issues, choosing appropriate
cleaning methods, and implementing those methods to improve data quality

1. Identify and Fix Structural Mistakes: This step focuses on correcting errors in data organization
and formatting. It includes tasks like:

 Standardizing data formats: Ensuring consistency in date formats, numerical values, and text
(e.g., using consistent separators, capitalization).
 Handling missing values: Deciding how to handle missing data (e.g., imputation, removal).
 Correcting typos and inconsistencies: Fixing spelling errors and ensuring data accuracy.

2. Set Data Cleaning Techniques: This involves defining the specific methods and tools used to clean
the data. It might include:

 Choosing data cleaning software: Selecting appropriate tools for the task (e.g., Excel, Python
libraries, and specialized data cleaning software).
 Determining cleaning algorithms: Selecting algorithms or techniques for tasks like outlier
detection, imputation, and data transformation.

3. Filter Outliers and Fix Missing Data: This step focuses on addressing specific data quality issues:

25
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Identifying and handling outliers: Outliers are data points significantly different from others.
They might be removed, corrected, or analyzed further.
 Imputing missing values: Filling in missing data with estimated values based on other data
points or statistical methods.

4. Implement Processes: This involves putting the chosen techniques into action and executing the
data cleaning plan. It might include:

 Writing cleaning scripts: Developing code or scripts to automate repetitive cleaning tasks.
 Applying cleaning techniques: Using the selected methods to clean the data.
 Validating results: Checking the cleaned data for accuracy and completeness.

 Description of Characteristics of quality Data


o Accuracy: This refers to how closely data values match the true or real values.
Accurate data should be free from errors and correctly reflect the real-world situation or
measurements it represents.
For example, if a dataset contains the height of individuals, accuracy means those heights
should be measured correctly and not contain any errors.

o Completeness: Completeness measures whether all necessary data is present and if the
dataset covers all required aspects of the information. Incomplete data might be missing
values, missing fields, or lack coverage in certain areas. For instance, in a customer
database, completeness would ensure that all required fields (like name, address, and
contact number) are filled in for every customer.

o Consistency: Consistency refers to the uniformity of data across different datasets or


within the same dataset. Data should not conflict or contradict itself.

For example, if a dataset shows a person’s age as 30 in one record and 35 in another
record, it lacks consistency.

o Relevance: Relevance indicates how well the data meets the needs of its intended use
or purpose. Data is relevant if it is appropriate and useful for the specific context or
application it is being used for.
For example, in a sales analysis, having data on customer purchase behaviour is relevant,
but data on unrelated topics like weather patterns would not be.

o Validity: Validity measures whether the data conforms to defined formats, rules, or
constraints and accurately represents the intended information. Valid data adheres to
predefined criteria or ranges.
For example, if a dataset requires dates to be in the format MM/DD/YYYY, valid data
should follow this format and not include dates in different formats.

 Data cleaning for inconsistencies rectification

Importance of data Cleaning

Data cleaning is a critical process for maintaining data quality and integrity. Here are five main
reasons why data cleaning is important:
SWDML501 MACHINE LEARNING APPLICATION
26
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
1. Improves Accuracy and Reliability: Data cleaning helps eliminate errors, inaccuracies, and
inconsistencies from datasets. Accurate data leads to more reliable analysis and decision-
making.
2. Enhances Data Usability: Clean data is easier to work with and analyze. It ensures that
datasets are complete, consistent, and formatted correctly, making it simpler to perform
accurate analyses, generate insights, and create reports. This usability is crucial for deriving
meaningful and actionable conclusions from the data.
3. Prevents Misleading Results: Inaccurate or dirty data can lead to misleading results and
conclusions. Data cleaning helps identify and correct errors or anomalies, reducing the risk of
incorrect interpretations and ensuring that any insights or decisions based on the data are based
on sound information.
4. Boosts Efficiency: Clean data reduces the time and effort required for data processing and
analysis. By addressing issues like duplicate records, missing values, or inconsistent formatting
beforehand, analysts and data scientists can focus on extracting insights rather than spending
time correcting data problems.
5. Ensures Compliance and Integrity: For organizations that need to comply with regulations or
standards, such as GDPR or HIPAA, data cleaning is essential to ensure that data is handled
appropriately. Clean data helps maintain data integrity and supports adherence to data
governance policies and legal requirements.

Data cleaning Techniques

The image outlines several methods of data cleaning:

1. Clustering: This method groups similar data points together, which can help identify outliers or
inconsistencies.
2. Regression: This statistical technique can be used to fill in missing values by predicting them
based on other data points.
3. Binning: This method involves grouping continuous data into smaller ranges or bins, which can
help smooth out data and reduce noise.
27
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
4. Fill the Missing Value: This refers to various techniques for handling missing data, such as
imputation or deletion.

 Data normalization
Data normalization is a vital pre-processing, mapping, and scaling method that helps forecasting and
prediction models become more accurate. The current data range is transformed into a new,
standardized range using this method.

Normalization is scaling the data to be analysed to a specific range such as [0.0, 1.0] to provide better
results.

Min-max scaling and Z-Score Normalisation (Standardisation) are the two methods most frequently
used for normalization in feature scaling.

Importance of data normalization


 Reduced Data Redundancy: Eliminates duplicate data, saving storage space and reducing
maintenance efforts. Prevents inconsistencies and errors that can arise from updating multiple
copies of the same data.

 Enhanced Data Integrity: Ensures data accuracy and consistency by preventing anomalies
like update, insert, and delete anomalies. Maintains data reliability and trustworthiness.

 Improved Data Efficiency: Reduces query execution time by minimizing the amount of data
accessed. Improves database performance and responsiveness.

 Better Database Structure: Creates a well-organized and structured database, making it easier
to understand, manage, and modify. Facilitates data analysis and reporting.

 Facilitates Data Sharing: Normalized data is easier to share and integrate with other systems.
 Supports data exchange and collaboration.

Data normalization Techniques

1. Z-score Normalization
Z-score normalization, also known as standardization, is a common data pre-processing technique
used to transform numerical data into a standard normal distribution with a mean of 0 and a standard
deviation of 1. This process scales features to a common range, making them comparable and
improving the performance of many machine learning algorithms.

How it Works
The Z-score for a data point is calculated using the following formula:

Z = (X - μ) / σ

Where:

Z is the Z-score
X is the original value
SWDML501 MACHINE LEARNING APPLICATION
28
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
μ is the mean of the dataset
σ is the standard deviation of the dataset

Example
Consider a dataset with two features: age (in years) and income (in thousands of dollars). Age might
range from 20 to 60, while income could vary from 30 to 200. Without normalization, the income
feature would have a much larger impact on the model due to its larger scale. By applying Z-score
normalization, both features are scaled to a similar range, allowing for a fairer comparison.

2. Min-Max normalization:
This method of normalising data involves transforming the original data linearly. The data’s minimum
and maximum values are obtained, and each value is then changed using the formula that follows.

Where:

 X is the attribute data.


 Xmin(A) and Xmax(A) are the minima and maxima absolute values of X.
 v’ is the new value of every entry in the data.
 v is the old value of every entry in the data.

3. Decimal Scaling Normalization


Decimal scaling is another way of normalizing. It works by rounding an integer to the nearest decimal
point. It normalizes data by shifting the decimal point of the numbers. We divide each data value by
the largest absolute value of the data to normalize the data using this approach. The data value, vi, is
normalized to vi’ using the formula below.

Formula: v’ = v / 10^j

Where:

 v’ is the new value after decimal scaling is applied.


 The attribute’s value is represented by V.
 The decimal point movement is now defined by integer J.

For example, feature F values range from 850 to 825. Assume that j is three. The greatest value of
feature F is 850. To use decimal scaling for normalization, we must divide all variables by 1,000. As a
result, 850 is normalized to 0,850, and 825 is changed to 0,825.

The decimal points of the data are transformed according to the absolute value of the maximum in this
procedure. As a result, the normalized data’s means will always be between 0 and 1.

 Data transformation
29
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Data transformation is the process of converting, , and structuring data into a usable format. Data
transformation is used when data needs to be converted to match that of the destination system. Data
transformation convert data into useful information.
Data Transformation Process

 Data Discovery: identify data in its source format.


 Data Mapping: determine how individual fields are modified, mapped, filtered, joined, and
aggregated.
 Data Extraction: extract the data from its original source.
 Code Generation and Execution: create a code to complete the transformation.
 Review: check it to ensure everything has been formatted correctly.
 Sending: involves sending the data to its target destination.

Importance of data transformation


Data transformation Techniques

SWDML501 MACHINE LEARNING APPLICATION


30
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms.
Techniques used in data smoothing such as:
 Binning
 Regression
 Clustering
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to construct a
new data set that eases data mining.

For example, suppose we have a data set referring to measurements of different plots, i.e., we may
have the height and width of each plot. So here, we can construct a new attribute 'area' from attributes
'height' and 'weight.

3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format.

4. Data Discretization
Data Discretization is a process of converting continuous data into a set of data intervals.

For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-
20…) or (kid, youth, adult, senior).
5. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

6. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0].
Different methods to normalize the data
 Min-max normalization
 Z-score normalization
 Decimal Scaling
Ways of Data Transformation

31
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Scripting: Data transformation through scripting involves Python or SQL to write the code to
extract and transform data.
 On-Premises ETL Tools: ETL tools take the required work to script the data transformation
by automating the process.
 Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the
cloud.

Types of Data transformation

Data transformation involves converting data from its original format or structure into a format that is
more suitable for analysis or modeling. Here are some common types of data transformation
techniques:

1. Normalization: is scaling the data to be analyzed to a specific range such as [0.0, 1.0] to
provide better results. This includes:

 Min-Max Scaling: Transforms data to fit within a specific range, typically [0, 1].
 Z-Score Standardization: Converts data to have a mean of 0 and a standard deviation of 1.
 Robust Scaling: Uses the median and interquartile range to scale data, which is robust to
outliers.

2. Encoding Categorical Data

 One-Hot Encoding: Converts categorical variables into binary vectors, creating a new column
for each category.
 Label Encoding: Converts categorical values into numerical labels, assigning each category a
unique integer.
 Ordinal Encoding: Similar to label encoding, but preserves the order of categories if they are
ordinal (i.e., have a meaningful sequence).
SWDML501 MACHINE LEARNING APPLICATION
32
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
3. Feature Engineering

 Feature Extraction: Creating new features from existing data, such as extracting the year from
a date or deriving interaction terms between features.
 Feature Selection: Choosing a subset of relevant features to improve model performance and
reduce dimensionality.
 Polynomial Features: Generating new features by raising existing features to a power or
creating interaction terms.

4. Aggregation

 Summarization: Combining data at a higher level, such as aggregating sales data by month or
by region.
 Rolling Aggregates: Computing statistics (e.g., mean, sum) over a moving window, useful for
time-series data.

5. Discretization/Binning

 Equal-Width Binning: Dividing the range of continuous data into equal-sized intervals.
 Equal-Frequency Binning: Dividing data so that each bin contains approximately the same
number of observations.
 Custom Binning: Defining bins based on domain knowledge or specific criteria relevant to the
problem.

6. Power Transformation

 Square Root Transformation: Applying the square root function to reduce skewness.
 Box-Cox Transformation: A family of power transformations that can stabilize variance and
make the data more normally distributed.

7. Data Cleaning

 Handling Missing Values: Imputing or removing missing data.


 Outlier Treatment: Identifying and managing outliers to prevent them from skewing analysis.

8. Data Reduction

 Principal Component Analysis (PCA): Reducing the dimensionality of data by projecting it


onto a lower-dimensional space while preserving variance.
 Feature Selection Techniques: Selecting a subset of relevant features to simplify models and
reduce computational cost.

9. Date and Time Transformation

 Extracting Components: Extracting parts of date/time data, such as year, month, day, hour,
etc.

33
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Time-based Features: Creating features like day of the week, seasonality, or time since an
event.

10. Data Scaling

 Unit Vector Scaling: Scaling data to have a norm of 1, often used in text data analysis.

SWDML501 MACHINE LEARNING APPLICATION


34
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Learning outcome 2: Develop Machine Learning Model

● Description of Machine learning algorithms and applications


Machine learning is the field of study that enables computers to learn from data and make decisions
without explicit programming. Machine learning models play a pivotal role in tackling real-world
problems across various domains by affecting our approach to tackling problems and decision-making.

Some of the key terminologies of ML before building one are:


 Feature: Features are the pieces of information that we use to train our model to make
predictions. In simpler terms, they are the columns or attributes of the dataset that contain the
data used for analysis and modelling.
 Label: The output or target variable that the model aims to predict in supervised learning, also
known as the dependent variable.
 Training set: The portion of the dataset that is used to train the machine learning model. The
model learns patterns and relationships in the data from the training set.
 Validation set: A subset of the dataset that is used to tune the model’s hyper parameters and
helps in assessing performance during training of the model.
 Test Set: It is also a part of the dataset that is used to evaluate our final model performance on
unseen data.

 Supervised Learning Algorithms


Supervised learning is a type of machine learning where a set of labelled data is used to train a
model for future predictions. Labelled data consists of input features and output values, allowing the
algorithm to make decisions based on the provided data.
Supervised learning algorithms can be further divided into two categories depending on the type of
output they produce.

35
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Regression Algorithms are used to predict a continuous numerical value, such as a
house's price or a day's temperature. Different types of regression algorithms exist,
such as (Linear Regression, Polynomial Regression, and Ridge Regression,)
 Classification Algorithms are used to predict a categorical or discrete value, such
as whether an email is spam. Some examples of classification algorithms includes
(Decision trees, Support vector machines (SVM), k-nearest neighbours (KNN).
Supervised learning can be used to perform classification or regression tasks. Standard supervised
learning algorithms includes

Linear Regression
Linear regression is used to identify the linear relationship between a dependent variable and one or
more independent variables.

It is commonly used to predict the value of a continuous variable from the input given by multiple
features or to classify observations into categories by considering their correlation with reference
variables.

Logistic Regression
Logistic regression is a type of predictive modelling algorithm used for classification tasks. It is used
to estimate the probability of an event occurring based on the values of one or more predictor
variables.

SWDML501 MACHINE LEARNING APPLICATION


36
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
The predicted probabilities are expected to lie between 0 and 1 and are usually represented as a 0 or 1
if the predicted value is above or below a given threshold.
An advantage of logistic regression compared to other supervised machine learning algorithms is its
simplicity and interpretability.

Decision Trees
Decision Trees are popular supervised learning algorithms for classification and regression tasks
using a "tree" structure to represent decisions and their associated outcomes.

Each node of the tree represents an attribute, while each branch represents a decision.
Random Forest
Random forests are essentially multiple decision trees combined to form one powerful "forest" model
with better predictive accuracy than individual trees.

37
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Using random forests model also reduces overfitting due to the fact that many trees are generated,
meaning that it's less likely to fit the data too well. In other words it's handles bias and variance well.

Support Vector Machine (SVM)


Support Vector Machines (SVM) are robust algorithm that use a kernel to map data into a high-
dimensional space and then draw a linear boundary between the distinct classes.

SVMs are often used for text classification or image recognition because they can accurately predict
categorical variables from large datasets.
Nevertheless, SVMs have several advantages over other supervised learning algorithms. For instance,
they are highly tolerant of noise and can handle non-sense features.

SWDML501 MACHINE LEARNING APPLICATION


38
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
k-Nearest Neighbors (KNN)
kNN is a simpler supervised learning algorithm that models the relationship between a given data
point and its “nearest” neighbours.

It aims to classify and predict based on a certain number of ‘nearest neighbours’, having similar
features or properties, either in the same class or otherwise.

Naive Bayes
Naive Bayes Classification is a powerful machine learning technique used for both supervised and
unsupervised learning. It uses Bayes' Theorem to calculate the probability that a given data point
belongs to one class or another based on its input features.

39
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
This makes it perfect for applications such as
 Text classification,
 Recommendation systems,
 Sentiment analysis,
 Image recognition

 Unsupervised Learning Algorithms


Unsupervised learning is a machine learning approach in which models do not have any supervisor to
guide them. Models themselves find the hidden patterns and insights from the provided data.
It’s mainly handles the unlabelled data. Somebody can compare it to learning, which occurs when a
student solves problems without a teacher’s supervision.

SWDML501 MACHINE LEARNING APPLICATION


40
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Unsupervised learning aims to discover the dataset’s underlying pattern, assemble that data
according to similarities, and express that dataset in a precise format. Let us discuss different
unsupervised machine learning algorithms.
K-Means Clustering
K-Means Clustering is an Unsupervised Learning algorithm. It arranges the unlabelled dataset into
several clusters.
Here K denotes the number of pre-defined groups. K can hold any random value, as if K=3, there will
be three clusters, and for K=4, there will be four clusters.
It is a repetitive algorithm that splits the given unlabelled dataset into K clusters.
Each dataset belongs to only one group that has related properties. It enables us to collect the data into
several groups.

Principal Component Analysis (PCA)


Principal Component Analysis is an unsupervised learning algorithm. We use it for dimensionality
reduction in machine learning. Principal Component Analysis (PCA) is a dimensionality reduction

41
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
technique in machine learning that transforms data into a new coordinate system. It projects the data
onto a set of orthogonal axes, known as principal components, which capture the most variance in the
data. The ultimate goal is to reduce the number of dimensions while preserving as much of the
variability in the data as possible.

Hierarchical Clustering
Hierarchical clustering, also known as Hierarchical cluster analysis. It is an unsupervised clustering
algorithm. It includes building clusters that have a preliminary order from top to bottom.

For example, all files and folders on the hard disk are in a hierarchy.
The algorithm clubs related objects into groups named clusters. Finally, we get a set of clusters or
groups. Here each cluster is different from the other cluster.
Also, the data points in each cluster are broadly related to each other.

There are two types of Hierarchical clustering method are:


SWDML501 MACHINE LEARNING APPLICATION
42
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering
 Agglomerative Hierarchical Clustering
In an agglomerative hierarchical algorithm, each data point is considered a single cluster. Then these
clusters successively unite or agglomerate (bottom-up approach) the clusters’ sets. The hierarchy of
the clusters is shown using a dendrogram.

 Divisive Hierarchical Clustering


In a divisive hierarchical algorithm, all the data points form one colossal cluster. The clustering
method involves partitioning (Top-down approach) one massive cluster into several small clusters.

Anomaly Detection
Anomaly detection, also known as outlier detection, is a process in machine learning and statistics
used to identify data points that significantly differ from the majority of the data. These outliers or
anomalies can represent critical or rare events, errors, or novel situations that are different from the
norm.

 Semi-supervised Learning Models


Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labelled data and a large amount of
unlabelled data to train a model.
Examples of Semi-Supervised Learning
Text classification: In text classification, the goal is to classify a given text into one or more
predefined categories. Semi-supervised learning can be used to train a text classification model using a
small amount of labelled data and a large amount of unlabelled text data.
Image classification: In image classification, the goal is to classify a given image into one or more
predefined categories. Semi-supervised learning can be used to train an image classification model
using a small amount of labelled data and a large amount of unlabelled image data.

43
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Anomaly detection: In anomaly detection, the goal is to detect patterns or observations that are
unusual or different from the norm
 Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions
by interacting with an environment. The agent takes actions to maximize cumulative rewards over
time. Unlike supervised learning, where the model learns from labelled data, RL is based on trial and
error.
Example:
We have an agent and a reward, with many hurdles in between. The agent is supposed to find the best
possible path to reach the reward. The following problem explains the problem more easily.

Types of Reinforcement:
There are two types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behaviour, increases the strength and the frequency of the behaviour. In other words, it
has a positive effect on behaviour.
2. Negative: Negative Reinforcement is defined as strengthening of behaviour because a
negative condition is stopped or avoided.
Elements of Reinforcement Learning
Reinforcement learning elements are as follows:
Policy: Policy defines the learning agent behaviour for given time period.
Reward function: Reward function is used to define a goal in a reinforcement learning problem. A
reward function is a function that provides a numerical score based on the state of the environment
Value function: Value functions specify what is good in the long run. The value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that state.
SWDML501 MACHINE LEARNING APPLICATION
44
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Model of the environment: Models are used for planning.
Q-Learning
Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection
policy for an agent. It works by learning a value function, called the Q-function, which estimates the
expected cumulative reward of taking an action in a given state and following the optimal policy
thereafter.

How Q-Learning Works

As the agent exposes itself to the environment and receives different rewards by executing different
actions, the values are updated per the following equation:

where Q\left(s,a\right) is the current Q-value, Q_{\text{new}}\left(s,a\right) is the update Q-value, \


alpha is the learning rate, R\left(s,a\right) is the reward, \gamma is a number between [0,1] and is used
to discount the reward as the time passes, given the assumption that action in the beginning, is more
important than at the end (an assumption that is confirmed by many real-life use cases).

Example
Our agent is a rat that has to cross a maze and reach the end (its house) point. There are mines in the
maze, and the rat can only move one tile (from one square to one another) at a time. If the rat steps
onto a mine, it will be dead. The rat wants to reach its home in the shortest time possible:

Deep Q-Networks (DQN)


Deep Q-Networks (DQN) are a type of reinforcement learning algorithm that combines Q-learning
with deep neural networks. DQNs represent a significant advance in reinforcement learning by
addressing the limitations of traditional Q-learning and enabling the handling of complex
environments with high-dimensional state and action spaces.

 Neural Network Architectures

45
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Artificial Neural Networks (ANNs) are a broad category of neural network models that are designed
to simulate the way the human brain processes information. They consist of interconnected nodes
(neurons) organized in layers.
 Artificial Neural Networks (ANNs) Components:
These networks usually consist of an input layer, one to two hidden layers, and an output layer.
 Input Layer: Receives the input features.
 Hidden Layers: Intermediate layers where computations are performed. Each neuron in a
hidden layer receives inputs from the previous layer and sends its output to the next layer.
 Output Layer: Produces the final output.
Deep Neural Networks (DNNs) are a subset of ANNs with many hidden layers. The term "deep"
refers to the depth of the network, meaning the number of hidden layers between the input and output
layers.
 Deep Neural Networks (DNNs) Components:
 Multiple Hidden Layers: DNNs have multiple hidden layers, which allow them to learn
complex representations and features from the data.
 Hierarchical Learning: Layers are stacked in a way that each layer learns increasingly
abstract features of the data.

The various possible architectures of Neural Network Architectures:


A. Single-layer Feed Forward Network
A Single-Layer Feed Forward Network, often referred to as a Single-Layer Perceptron, is one of the
simplest types of artificial neural networks. The signals always flow from the input layer to the output
layer. Hence, the network is known as FEED FORWARD.
 Structure:
Input Layer: Takes input features and passes them directly to the output layer.
Output Layer: Produces the final output based on weighted sums of the inputs.
 Characteristics:
No Hidden Layers: It consists only of an input layer and an output layer.
SWDML501 MACHINE LEARNING APPLICATION
46
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Linear Decision Boundaries: Can only learn linear relationships between inputs and outputs. For
complex patterns, it may not perform well.

B. Multi-layer Feed Forward Network


A Multi-Layer Feed Forward Network, also known as a Multi-Layer Perceptron (MLP), consists of
multiple layers of neurons, including at least one hidden layer between the input and output layers.
 Structure:
Input Layer: Receives input features.
Hidden Layers: One or more layers of neurons that process the inputs through non-linear activation
functions.
Output Layer: Produces the final output, which can be for classification, regression, or other tasks.
 Characteristics:
Non-Linear Decision Boundaries: Due to the non-linear activation functions in hidden layers, MLPs
can model complex relationships and learn non-linear decision boundaries.
Training: Uses back propagation and gradient descent to update weights.
 Applications:
Suitable for a wide range of tasks including image recognition, speech recognition, and complex
pattern recognition.

47
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
C. Competitive Network
A Competitive Network is a type of neural network where neurons compete with each other to respond
to a particular input. Only the "winning" neuron (or neurons) is activated, and it provides the output.
 Structure:
Input Layer: Receives the input features.
Output Layer: Consists of neurons that compete to respond to the input. The neuron with the highest
activation typically "wins" and is activated.
 Characteristics:
Winner-Takes-All: Only the neuron with the highest activation value for a given input is selected.
Unsupervised Learning: Often used in clustering and feature extraction tasks, as it can identify
patterns and similarities in the input data.

D. Recurrent Network
A Recurrent Network is a type of neural network where connections between neurons form directed
cycles. This allows the network to maintain a state and process sequences of data.
 Structure:
SWDML501 MACHINE LEARNING APPLICATION
48
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Input Layer: Receives sequential data.
Hidden Layers: Includes recurrent connections where outputs are fed back into the network. This
allows the network to have memory of previous inputs.
Output Layer: Produces output based on the current state and past inputs.
 Characteristics:
Temporal Dynamics: Capable of modelling temporal sequences and dependencies in data.
Stateful: Maintains a state or memory of previous inputs, which is crucial for tasks involving
sequential data.

● types of neural networks


These are different types of neural networks used in machine learning and artificial intelligence. Here's
a brief overview of each:
Feed forward Neural Networks
Feed forward neural networks are a form of artificial neural network where without forming any
cycles between layers or nodes means inputs can pass data through those nodes within the hidden level
to the output nodes.
Architecture: Made up of layers with unidirectional flow of data (from input through hidden and the
output layer).
Training: Back propagation is often used during training for the main aim of reducing the prediction
errors.

49
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Neural Networks (Artificial Neural Networks)
Neural Networks (Artificial Neural Networks) Refers to a broad category of networks inspired by the
human brain. They consist of interconnected layers of nodes (neurons) including input, hidden, and
output layers.
They can be feed forward, recurrent, or convolutional, among other types. They are used for various
complex tasks such as image recognition, language translation, and more.
Example Use: Any machine learning problem where patterns need to be learned from data.
Convolutional Neural Networks (CNNs)
The convolutional neural networks structure is focused on processing the grid type data like images
and videos by using convolutional layers filtering driving the patterns and spatial hierarchies.
Key Components: Utilizing convolutional layers, pooling layers and fully connected layers.
Applications: Used for classification of images, object detection, medical imaging analyses,
autonomous driving and visualization in augmented reality.

SWDML501 MACHINE LEARNING APPLICATION


50
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Recurrent Neural Networks (RNNs)
Recurrent neural network handles sequential data in which the current output is a result of previous
inputs by looping over themselves to hold internal state (memory).
Architecture: Contains recurrent connections that enable feedback loops for processing sequences.
Applications: Language translation, open-ended text classification, ones to ones interaction, and time
series prediction are its applications.

● Selection of machine learning algorithm


Choosing the right machine learning algorithm involves a series of steps, from understanding the
problem to evaluating resource constraints. Here’s a structured approach to help you select the most
appropriate algorithm:

 Identify problem
Regression: When the goal is to predict a continuous outcome. For example,
predicting house prices based on features like size and location.

Classification: When the goal is to predict discrete labels or categories. For


example, classifying emails as spam or not spam.

Clustering: When the goal is to group similar data points together without
predefined labels. For example, customer segmentation based on purchasing
behaviour.
 analyse type of data in a given dataset
Independent variable: Variables used as input features for prediction. They can be
continuous, categorical, or a mix.

51
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Dependent variable: The target variable you are trying to predict or classify. In
regression, it’s continuous; in classification, it’s categorical.
 Resource analysis
computational power
Some algorithms are computationally intensive (e.g., deep learning models) and may
require powerful hardware like GPUs. Simple models (e.g., Linear Regression, k-NN)
are less resource-intensive and can be run on standard CPUs.

memory limitations
Large datasets and complex models require more memory. Algorithms like k-NN or
decision trees can be memory-intensive. Techniques like dimensionality reduction
(PCA) or using subsampling methods can help manage memory usage.
 choose machine learning algorithm to be used
Based on the problem type, data characteristics, and resource constraints, select an algorithm
that best fits your needs. Here are some guidelines:
For Regression:
 Simple: Linear Regression (if the relationship is linear).
 Complex: Random Forest Regression, Gradient Boosting, or Neural Networks (if the
relationship is non-linear and complex).
For Classification:
 Simple: Logistic Regression, Naive Bayes (if the data is linearly separable).
 Complex: Random Forest, Support Vector Machines, or Deep Learning models (for
non-linear and complex decision boundaries).
For Clustering:
 Simple: k-Means (if the clusters are spherical and well-separated).
 Complex: DBSCAN or Hierarchical Clustering (if the clusters have different shapes
or densities).
● Train machine learning model

 Load a dataset
To load a dataset in Python, you can use various libraries depending on the format of your data and
your specific needs. Here’s a guide on how to load a dataset using different libraries:
Using pandas
Pandas is a powerful library for data manipulation and analysis. It supports various file formats,
including CSV, Excel, and SQL databases.

 CSV File:

import pandas as pd
df = pd.read_csv('path/to/your/file.csv')

SWDML501 MACHINE LEARNING APPLICATION


52
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 Excel File:

import pandas as pd

# Load an Excel file into a DataFrame


df = pd.read_excel('path/to/your/file.xlsx', sheet_name='Sheet1')
 SQL Database:

import pandas as pd
import sqlite3

# Create a connection to the SQL database


conn = sqlite3.connect('path/to/your/database.db')

# Load a SQL table into a DataFrame


df = pd.read_sql('SELECT * FROM table_name', conn)

Using Numpy
Numpy is a fundamental library for numerical computing in Python. It’s commonly used for loading
datasets with numerical data, especially from text files.

 Text File:

import numpy as np
# Load data from a text file into a NumPy array
data = np.loadtxt('path/to/your/file.txt', delimiter=',')
 CSV File:

import numpy as np
# Load a CSV file into a NumPy array
data = np.genfromtxt('path/to/your/file.csv', delimiter=',', skip_header=1)

Using Scikit-Learn
Scikit-Learn includes some built-in datasets and functions for loading them. It is particularly useful for
quickly accessing standard datasets for machine learning.

 Built-in Datasets:

from sklearn.datasets import load_iris


# Load the Iris dataset (or any other built-in dataset)
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

Using Seaborn
Seaborn is a statistical data visualization library based on Matplotlib, and it includes some built-in
datasets. It also works well with Pandas DataFrames.

 Built-in Datasets:
import seaborn as sns
# Load a built-in Seaborn dataset
53
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
df = sns.load_dataset('iris')
Using Requests and io
When loading datasets from a URL, requests can be used to fetch the data, and io can be used to
handle the data in-memory.

 CSV File from URL:


import pandas as pd
import requests
from io import StringIO

# Fetch the data from the URL


url = 'https://example.com/path/to/your/file.csv'
response = requests.get(url)

# Load the data into a DataFrame


data = StringIO(response.text)
df = pd.read_csv(data)

 Excel File from URL:

import pandas as pd
import requests
from io import BytesIO

# Fetch the data from the URL


url = 'https://example.com/path/to/your/file.xlsx'
response = requests.get(url)

# Load the data into a DataFrame


data = BytesIO(response.content)
df = pd.read_excel(data, sheet_name='Sheet1')
 Split dataset
Splitting a dataset is a fundamental step in many machine learning workflows. It typically involves
dividing the data into separate subsets for training, validation, and testing. This helps in assessing the
performance of a model and ensuring that it generalizes well to new, unseen data.
Here’s a general approach to split a dataset:
Train set: used to train the model (typically 60-80% of the data).
Test set: Used to tune hyper parameters and validate the model during training
(typically 10-20% of the data).
Validation set: Used to evaluate the final model performance (typically 10-20% of
the data).
Dataset Splitting:
Scikit-learn alias sklearn is the most useful and robust library for machine learning in Python. The
scikit-learn library provides us with the model selection module in which we have the splitter function
train_test_split().

Syntax:
SWDML501 MACHINE LEARNING APPLICATION
54
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True,
stratify=None)
Parameters:
 *arrays: inputs such as lists, arrays, data frames, or matrices
 test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our test size. its default value is none.
 train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our train size. its default value is none.
 random_state: this parameter is used to control the shuffling applied to the data before
applying the split. it acts as a seed.
 shuffle: This parameter is used to shuffle the data before splitting. Its default value is true.
 stratify: This parameter is used to split the data in a stratified fashion
Example:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
# Let's assume we have features X and target y
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([0, 1, 0, 1, 0, 1])
# Split the dataset into training and test sets
# test_size=0.2 means 20% of the data will be used as the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the results
print("Training features:\n", X_train)
print("Training labels:\n", y_train)
print("Test features:\n", X_test)
print("Test labels:\n", y_test)
Note: random_state is set to 42, ensuring that the random shuffling and splitting are identical each
time the code is run.
 Initialize model
Initialization model refers to the process of creating an instance of a machine learning model. This is
where you define the type of model you want to use and set any hyper parameters that control the
behaviour of the model.

55
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Purpose: Defines the model’s architecture and behaviour. Proper initialization can significantly
impact model performance. Allows you to set hyper parameters that affect the learning process,
regularization, and model complexity.
Some models come with hyper parameters that you can set when initializing the model. For example:
 Linear Regression: Regularization parameters (e.g., alpha for Ridge Regression).
 Decision Trees: Maximum depth of the tree, minimum samples required to split a node.
 Logistic Regression: Solver type, regularization strength.
1. Linear Regression
Linear Regression is a model used for predicting continuous values.
from sklearn.linear_model import LinearRegression
# Initialize the Linear Regression model
linear_reg_model = LinearRegression()
2. Decision Tree
Decision Tree is used for both classification and regression tasks. Here, we’ll initialize it for
classification.
from sklearn.tree import DecisionTreeClassifier
# Initialize the Decision Tree Classifier
decision_tree_model = DecisionTreeClassifier(max_depth=3, criterion='gini')
 max_depth=3 limits the maximum depth of the tree to prevent overfitting.
 criterion='gini' specifies the function to measure the quality of a split. You can also use
'entropy'.
3. Logistic Regression
Logistic Regression is used for binary or multi-class classification problems.
from sklearn.linear_model import LogisticRegression
# Initialize the Logistic Regression model
logistic_reg_model = LogisticRegression(max_iter=200, C=1.0, solver='liblinear')
 max_iter=200 specifies the maximum number of iterations for the solver.
 C=1.0 is the regularization strength (smaller values mean stronger regularization).
 solver='liblinear' is the algorithm to use for optimization.
4. Random Forest
Random Forest is an ensemble method used for both classification and regression. Here, we’ll
initialize it for classification.
from sklearn.ensemble import RandomForestClassifier

SWDML501 MACHINE LEARNING APPLICATION


56
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
# Initialize the Random Forest Classifier
random_forest_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
 n_estimators=100 specifies the number of trees in the forest.
 max_depth=5 limits the maximum depth of each tree.
 random_state=42 ensures reproducibility by setting a fixed seed for random number
generation.
 Fit the training data into a model
Fitting refers to the process of training the model using your data. This involves adjusting the model’s
internal parameters to learn patterns from the training data.
Trains the model to understand patterns and relationships in the data. The model learns how to make
predictions or classifications based on the training data.
# Fit the model to the training data
model.fit(X_train, y_train)

● Evaluation of machine learning model

 prediction of result
In machine learning, once you have trained a model on your training data, you can use it to make
predictions on both test data and new (unseen) data.
on test data
Prediction on Test Data evaluate how well your model performs on a separate set of data that it
hasn't seen during training. This helps in assessing the model's performance and generalization ability.

# Initialize and fit the model


model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Evaluate the model
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test data: {accuracy:.2f}")
on new data (unseen data)
Prediction on New (Unseen) Data use the trained model to make predictions on new data that the
model has not encountered before. This simulates how the model will perform in real-world scenarios.
# Assume the model has already been trained as in the previous example
57
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
# New unseen data (e.g., new samples to predict)
new_data = [[5.1, 3.5, 1.4, 0.2], # Example data point
[6.7, 3.0, 5.2, 2.3]] # Another example data point
# Predict on new data
new_predictions = model.predict(new_data)
print(f"Predictions on new data: {new_predictions}")
 Visualize predictions
Visualizing predictions can provide valuable insights into how well your machine learning model
performs and how it makes decisions. Depending on the type of problem (classification or
regression), the visualization techniques may differ. Here’s how you can visualize predictions for
both classification and regression tasks
1. Classification Predictions

For classification tasks, visualizing predictions typically involves comparing predicted labels to actual
labels and understanding the decision boundaries of the model.

a. Confusion Matrix
A confusion matrix shows the number of correct and incorrect predictions for each class. Confusion
matrix is a matrix that summarizes the performance of a machine learning model on a set of test data.
It is a means of displaying the number of accurate and inaccurate instances based on the model’s
predictions. It is often used to measure the performance of classification models, which aim to predict
a categorical label for each input instance.

 True Positive (TP): The model correctly predicted a positive outcome (the actual outcome
was positive).
 True Negative (TN): The model correctly predicted a negative outcome (the actual outcome
was negative).

SWDML501 MACHINE LEARNING APPLICATION


58
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
 False Positive (FP): The model incorrectly predicted a positive outcome (the actual outcome
was negative). Also known as a Type I error.
 False Negative (FN): The model incorrectly predicted a negative outcome (the actual outcome
was positive). Also known as a Type II error.
1. Accuracy
Accuracy is used to measure the performance of the model. It is the ratio of Total correct instances
to the total instances.

2. Precision
Precision is a measure of how accurate a model’s positive predictions are. It is defined as the ratio
of true positive predictions to the total number of positive predictions made by the model.

3. Recall
Recall measures the effectiveness of a classification model in identifying all relevant instances
from a dataset. It is the ratio of the number of true positive (TP) instances to the sum of true
positive and false negative (FN) instances.

4. F1-Score
F1-score is used to evaluate the overall performance of a classification model. It is the harmonic
mean of precision and recall,

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.metrics import confusion_matrix
# Assuming y_test and y_pred are available from previous code
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names,
yticklabels=data.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
59
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
plt.title('Confusion Matrix')
plt.show()
2. Regression Predictions
For regression tasks, visualizing predictions involves plotting the predicted values against actual
values or visualizing the regression line.
a. Scatter Plot of Predictions vs. Actual Values
This helps in understanding how well the model’s predictions match the actual values.
import matplotlib.pyplot as plt
# Assuming y_test and y_pred are available from previous code
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2) # Line of perfect
predictions
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.title('True Values vs Predictions')
plt.show()
 Analyse evaluation metrics
Analysing evaluation metrics is crucial for understanding the performance of your machine learning
model. Each metric provides different insights depending on the type of problem you're dealing with,
whether it's classification or regression. Here’s an overview of the common evaluation metrics and
how to analyse them:
Accuracy Provides a general idea of how well the model performs. It is best used
when the classes are balanced.

from sklearn.metrics import accuracy_score


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Precision The ratio of correctly predicted positive observations to the total
predicted positives. It measures how many of the predicted positive cases are
actually positive.
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.2f}")
SWDML501 MACHINE LEARNING APPLICATION
60
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Recall: The proportion of actual positive cases correctly identified as positive. It's
crucial when the cost of false negatives is high (e.g., fraud detection).
Formula: Recall = TP / (TP + FN)
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall: {recall:.2f}")
F1 score: Harmonic mean of precision and recall, providing a balance between the
two.
Formula: F1 = 2 * (precision * recall) / (precision + recall)
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 Score: {f1:.2f}")

Mean Absolute Error (MAE): Average absolute difference between predicted and
actual values.
Formula: MAE = Σ(|actual - predicted|) / n
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")

Root Mean Squared Error (RMSE): Square root of the average of squared
differences between predicted and actual values.
Formula: RMSE = sqrt(Σ(actual - predicted)^2 / n)
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:.2f}")
R Squared score: Proportion of the variance in the dependent variable explained
by the independent variables.
Formula: R^2 = 1 - (SS_res / SS_tot)
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")

Adjusted R Squared score: Modified R-squared that accounts for the number of
predictors.

 Model interpretation

61
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
● Tuning Hyperparameters

 Define Hyperparameters search space


 Choose performance metric
 Perform hyperparameter search
 Evaluate performance
Get the best parameters
Get the best model
predict on validation data
apply evaluation metrics on validation data
apply evaluation metrics on test data
 Resolve Potential Bias
Underfitting
Overfitting
Learning outcome 3: Perform Model Deployment

● Selection of model deployment method


✔ Description of model deployment methods
Definition
Benefits
✔ Identification of system specifications
Types of application (web application, mobile app, standalone program, Embedded
system)
Technology (programming languages and frameworks)
✔ Identification of model specifications
Size of the dataset
Memory limitations
Computing power
Format of the model file(Scikit-Learn, TensorFlow SavedModel, ONNX, PyTorch
PT)
● Integration of model file
✔ Integration goal
Predictions/Insights
Generate content
Data analysis
✔ Compatibility
✔ Interpret API endpoint usage
Model serving method
Loading strategy
✔ Integrate with existing systems
✔ Identification data format: Establish the format for input data sent to the API
SWDML501 MACHINE LEARNING APPLICATION
62
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
JSON
Form data
✔ Implement communication
Use HTTP requests
HTTP Responses to interact with existing system
✔ Testing thoroughly deployment
✔ Deploy to production
✔ Monitor performance
✔ Track API requests, response times, and model accuracy.
● Delivering Prediction to the clients
✔ Integrating the API into application
✔ Formatting the predictions
✔ Handling errors

63
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
References
https://www.analyticsvidhya.com/blog/2021/10/step-by-step-guide-data-visualization-tableau/
Data Visualization with Power BI | DataCamp
https://docs.microsoft.com/en-us/learn/modules/get-started-with-power-bi/2-using-power-bi
https://www.geeksforgeeks.org/z-score-normalization-definition-and-examples/#:~:text=Z%2Dscore
%20normalization%2C%20also%20known,its%20importance%20and%20practical%20applications.
Data Normalization Machine Learning - GeeksforGeeks
Data Transformation in Machine Learning - GeeksforGeeks
Data Normalization Techniques in Data Mining | Hevo (hevodata.com)
https://dataaspirant.com/supervised-learning-algorithms/
https://dataaspirant.com/unsupervised-learning-algorithms/

SWDML501 MACHINE LEARNING APPLICATION


64
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372

You might also like