Machine Learning-1
Machine Learning-1
MODULE CODE:
Credits: 8 Hours: 12
PREPARED BY:
UNKUNDIYE Jeanette
Tel:(+250)789415047
E-mail: [email protected]
1
Elements of Competence and Performance Criteria
Elements of Performance criteria
competence
1. Apply Data 1.1 Environment is properly prepared based on system
requirements.
Pre-processing
1.2 Data is properly manipulated based on the python
libraries functionalities.
1.3 Visualization results are properly interpreted based on
its statistical analysis.
1.4 Data cleaning is appropriately performed based on the
provided dataset
2. Develop Machine 2.1 Machine Learning algorithm is properly selected
Learning Model based on the characteristics of the dataset.
2.2 Machine Learning models are properly trained based
on a training set of data.
2.3 Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4 Hyper parameters are properly fine-tuned based on
evaluation results
3. Perform Model 3.1. Deployment methods are clearly selected based on
deployment the requirements.
3.2. Model file is properly integrated in the system
based on the Deployment method (RESTful API
guidelines.)
3.3. Prediction responses are accurately delivered to the
Clients based on the model insights.
Definition
Machine learning (ML) is an application of artificial Intelligence that involves algorithms and
data that automatically analyse and make decisions by themselves without human intervention.
Machine learning life cycle involves seven major steps, which are given below:
1. Gathering Data
The goal of this step is to identify and obtain all data-related problems. In this step, we need to identify
the different data sources, as data can be collected from various sources such
as files, database, internet, or mobile devices.
3
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
This step includes the below tasks:
• Identify various data sources
• Collect data
• Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in
further steps.
2. Data preparation
Data preparation is a step where we put our data into a suitable place and prepare it to use in our
machine learning training, in this step, first, we put all data together, and then randomize the ordering
of data
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to use, and transforming the data in a proper format
to make it more suitable for analysis in the next step.
It is not necessary that data we have collected is always of our use as some of the data may not be
useful. In real-world applications, collected data may have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
4. Analyse Data
This step aims to build a machine learning model to analyze the data using various analytical
techniques and review the outcome. It starts with the determination of the type of the problems, where
we select the machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the model.
1. Image Recognition: Image recognition is the errand of recognizing objects in digital images. It
is a sort of machine-learning strategy that employs a trained model to distinguish objects within
an image.
3. Traffic prediction: Traffic prediction is a form of machine learning that includes foreseeing
traffic patterns based on chronicled information. This will be utilized to plan traffic routes
better, diminish congestion, and improve safety.
5
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
4. Product Recommendations: Product recommendation is an machine learning algorithm that
employs information to recommend items to clients. It is utilized to personalize shopping
encounters and increment sales.
5. Self-driving Cars: Self-driving cars are vehicles that are capable of detecting their
environment and navigating without human input. They utilize a combination of machine
learning algorithms such as object recognition, path planning, and choice making to
independently drive.
6. E-mail Spam and Malware Filtering: Mail spam and malware filtering is a form of machine
learning that employs algorithms to distinguish and channel out malicious emails. It is utilized
to ensure clients from malicious emails, phishing scams, and other shapes of cybercrime.
8. Online Fraud Detection: Online fraud detection is a form of machine learning that uses
algorithms to detect and prevent fraudulent activities. It is used to protect users from identity
theft, credit card fraud, and other forms of online crime.
9. Stock Market Trading: Stock market trading is a type of machine learning that uses
algorithms to analyze and predict stock prices. It is used to make better trading decisions and
increase profits.
10. Medical Diagnosis: Medical diagnosis is a form of machine learning that uses algorithms to
diagnose diseases. It is used to provide accurate and timely diagnoses to patients.
Some examples of supervised learning method include linear regression, logistic regression, support
vector machines, Naive Bayes, and decision tree
The purpose of the model is to classify each kind of vehicle. The machine tries to find useful insights
from the huge amount of data. It can be further classifieds into two categories of algorithms:
Clustering, Association.
Installation of Python
1.1 Download Python
1. Visit the Python Official Website: Go to Python's official website.
2. Choose the Version: Download the latest version of Python (usually the most recent stable
release is recommended). For compatibility with most tools and libraries, Python 3.x is
preferred over Python 2.x.
1.2 Install Python
Windows:
1. Run the installer you downloaded.
9
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
2. Check the box that says "Add Python to PATH" at the start of the installation.
3. Select "Install now" to begin the installation process.
4. Once the installation is complete, you can verify it by opening Command Prompt and
typing python --version.
macOS:
1. Open the .pkg installer you downloaded.
2. Follow the prompts in the installer.
3. Verify the installation by opening Terminal and typing python3 --version.
Linux:
1. Most Linux distributions come with Python pre-installed. To install or update Python,
use the package manager specific to your distribution.
For Debian-based systems (like Ubuntu): sudo apt update && sudo apt install
python3
For Red Hat-based systems (like Fedora): sudo dnf install python3
2. Verify the installation by typing python3 --version in the Terminal.
Installation of Tools
To work with machine learning in Python, you'll need several tools and libraries. The most common
ones include:
Environment Testing
To confirm that your Python environment and tools are correctly installed and configured, follow these
steps:
11
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Examples: customer database, sales transaction data, sensor readings.
Data warehouse: Centralized repository for storing and managing large volumes of data.
Optimized for querying and analysis. Supports business intelligence and decision-making.
Combines data from various sources into a single, consistent view.
Big data: Extremely large and complex datasets that cannot be easily managed using traditional
data processing techniques. Characterized by volume, velocity, and variety. Requires
specialized technologies and tools for processing and analysis.
Examples: social media data, IoT sensor data, financial transaction data.
Identification of Source of data
IoT Sensors: These are devices embedded in physical objects that collect and transmit data
about their environment or the object itself. Examples include temperature sensors, motion
detectors, and smart meters.
Camera: This refers to devices that capture visual data, either through photographs or video.
Cameras can be used for security, monitoring, or image analysis.
Computer: This is a general-purpose electronic device that processes and stores data.
Computers can generate data through software applications, user inputs, or system logs.
Smartphone: This is a portable device that combines mobile phone capabilities with
computing features. Smartphones generate data from apps, sensors, user interactions, and
communications.
Social Data: This includes data generated from social media platforms and online interactions,
such as posts, likes, comments, and shares.
Transactional Data: This refers to data generated from transactions, such as purchases, sales,
or financial exchanges. It typically includes details about the transaction, such as date, amount,
and parties involved.
Volume: This refers to the amount of data being generated and stored. In the context of Big
Data, it means handling vast quantities of data, ranging from terabytes to Exabyte’s. Volume
highlights the scale at which data is collected, processed, and stored.
Variety: This pertains to the different types and sources of data. Data can come in various
formats, such as structured data (like databases), semi-structured data (like XML or JSON),
and unstructured data (like text documents or videos). Variety emphasizes the diversity of data
types and the need to integrate them.
Velocity: This is the speed at which data is generated and processed. In the context of Big
Data, velocity refers to the fast rate at which data flows into systems and needs to be handled,
often in real-time or near-real-time scenarios.
SWDML501 MACHINE LEARNING APPLICATION
12
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Veracity: This relates to the accuracy and reliability of the data. Veracity involves assessing
the quality and trustworthiness of the data, ensuring it is accurate, complete, and free from
errors or biases.
Value: This is about the usefulness and relevance of the data. Value refers to extracting
meaningful insights and actionable information from data that can drive decisions, create
opportunities, or provide competitive advantages.
Variability: This describes the inconsistencies in data over time or across different sources.
Variability can affect data quality and analysis, making it important to account for fluctuations
and changes in data patterns.
Data can be categorized into three main types based on its format and organization: structured data,
semi-structured data, and unstructured data. Here's a description of each type:
1. Structured Data: Structured data is highly organized and easily searchable in predefined formats. It
is typically stored in relational databases or spreadsheets where data is arranged in rows and columns.
Characteristics:
o Format: Consistent and well-defined, often adhering to a schema.
o Examples: Customer names, addresses, transaction records, inventory lists.
o Access and Analysis: Easily queried using SQL and analyzed using various data
analysis tools.
Use Cases: Financial records, sales reports, customer databases.
2. Semi-structured Data: Semi-structured data does not conform to a rigid schema like structured
data but still contains organizational elements that make it easier to analyze than unstructured data. It
often includes metadata or tags to provide context.
Characteristics:
o Format: Partially organized with some identifiable structure, such as tags or keys.
o Examples: JSON files, XML documents, HTML files, emails, and log files.
o Access and Analysis: Requires specialized tools or programming languages (like
Python or R) to parse and analyze, often using hierarchical or tag-based queries.
13
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Use Cases: Web data, social media posts, and data from APIs.
3. Unstructured Data: Unstructured data lacks a predefined format or structure, making it more
challenging to collect, process, and analyze. It does not fit neatly into tables or schemas.
Characteristics:
o Format: No standardized format; data can be free-form.
o Examples: Text documents, images, videos, audio files, and social media content.
o Access and Analysis: Requires advanced techniques such as natural language
processing (NLP), machine learning, and text mining to extract insights.
Use Cases: Customer feedback, multimedia content, and content from blogs or forums.
1. Define Objectives and Requirements: define the problem you want to solve or the task you
want to perform (e.g., classification, regression, clustering). Determine the type of data needed,
including features (input variables) and labels (output variables) if applicable.
3. Collect Data:
Manual Collection: For smaller datasets, manually gather data from sources such as
surveys or forms.
Automated Collection: Use scripts or tools to automate the collection process, especially
for large datasets or real-time data.
APIs and Web Scraping: Write scripts to fetch data from APIs or scrape websites.
4. Preprocess Data
Cleaning: Handle missing values, remove duplicates, and correct errors or inconsistencies
in the data.
Normalization/Standardization: Scale numerical features to ensure that they are on a
similar scale, which is important for many algorithms.
Encoding: Convert categorical variables into numerical formats (e.g., one-hot encoding,
label encoding).
Splitting: Divide the dataset into training, validation, and test sets to evaluate the model’s
performance.
5. Augment Data: Increase the size of the dataset by creating variations of existing data, such as
rotating images or introducing noise.
6. Validate and Document: Ensure that the collected data meets the quality and relevance
criteria for the problem. And, Document the data sources, collection methods, preprocessing
steps, and any assumptions or limitations.
SWDML501 MACHINE LEARNING APPLICATION
14
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
7. Ensure Data Privacy and Compliance: Ensure that the data collection and usage comply
with privacy regulations such as GDPR, HIPAA, or CCPA. Consider ethical implications,
particularly if working with sensitive or personal data.
8. Explore and Analyze Data: Use statistical and visualization techniques to understand the
distribution, patterns, and relationships within the data. And, Create new features or modify
existing ones to improve the performance of your machine learning model.
9. Store and Manage Data: Store the dataset in a format that is easy to access and manage, such
as databases or data lakes. Use version control systems or data management tools to keep track
of changes and updates to the dataset.
10. Update and Maintain: Regularly update the dataset as new data becomes available or as
requirements change. Monitor the data quality and relevance over time to ensure ongoing
usefulness for model training.
1. Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed to be as
usable as MATLAB, with the ability to use Python and the advantage of being free and open-source.
Example:
import matplotlib.pyplot as plt
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
plt.plot(x, y)
plt.show()
15
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization depicted and
displays the title using various attributes.
plt.title("Linear graph")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the data displayed
in the graph’s Y-axis. It generally appears as the box containing a small sample of each color on the
graph and a small description of what this data means.
plt.legend(["GFG"])
Seaborn
Seaborn is a well-known Python library for data visualization that offers a user-friendly interface for
producing visually appealing and informative statistical graphics. It is designed to work with Pandas
dataframes, making it easy to visualize and explore data quickly and effectively.
Installing Seaborn
Sample Datasets
Seaborn provides several built-in datasets that we can use for data visualization and statistical analysis.
These datasets are stored in pandas dataframes, making them easy to use with Seaborn's plotting
functions.
Example:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.histplot(data=tips,
x="total_bill")
17
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Plotly
Plotly is an interactive, open-source plotting library. Plotly charts are dynamic.
It can be displayed in Jupyter notebooks. Plotly is built on top of the Plotly JavaScript library
(plotly.js). Plotly can be used anywhere.
Installation of Plotly is very easy, if you have not installed it yet, Follow these steps:
Through the basics of plotly let’s see how to create some basic charts
import plotly.express as px
# using the iris dataset
df = px.data.iris()
Group and color the data according to the species. We will also change the line format.
Tableau
Tableau is a data analytics and visualization tool. It’s the leading (33% market share followed
by Power-BI) data analytics and visualization tool in the market. Tableau comes with a very
easy drag-drop interface which makes it easy to learn and you can work on almost every type
of data in Tableau.
Tableau Desktop is a program that allows you to execute complicated data analysis tasks and
generate dynamic, interactive representations to explain the results.
Power BI
Power BI is a cloud-based business analytics service from Microsoft that enables anyone to
visualize and analyse data, with better speed and efficiency. It is a powerful as well as a
flexible BI tool for connecting with and analysing a wide variety of data.
Power BI Desktop is a free application that can be downloaded and installed on the system. It
can be connected to multiple data sources. Usually, an analysis work begins in Power BI
Desktop where report creation takes place.
As an app from the Microsoft store and just sign in to get started. This is the online
version of the tool.
Download the software locally and then install it. Make sure you read all the
installation instructions.
1. Bar Chart
A bar chart is a pictorial representation of data that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent
Bar Charts: Great for categorical data.
import plotly.express as px
df = px.data.tips()
fig.show()
2. Scatter Plot
A scatter plot is a set of dotted points to represent individual pieces of data in the horizontal
and vertical axis.
Scatter Plots: Useful for displaying relationships between variables.
import plotly.express as px
3. Histogram
19
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
A histogram is basically used to represent data in the form of some groups. It is a type of bar
plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency.
Histograms: Useful for distribution of numerical data.
import plotly.express as px
4. Pie Chart
A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical
proportions. It depicts a special chart that uses “pie slices”, where each sector shows the
relative sizes of data.
Pie Charts: Useful for showing proportions.
import plotly.express as px
5. 3D Scatter Plot
3D Scatter Plot can plot two-dimensional graphics that can be enhanced by mapping up to
three additional variables while using the semantics of hue, size, and style parameters.
3D Plots: For visualizing three-dimensional data.
# data to be plotted
df = px.data.tips()
fig.show()
6. Box Plot
A Box Plot is also known as Whisker plot is created to display the summary of the set of data
values having properties like minimum, first quartile, median, third quartile and maximum. In
the box plot, a box is created from the first quartile to the third quartile, a vertical line is also
there which goes through the box at the median.
import plotly.express as px
df = px.data.tips()
Applying data visualization best practices ensures that your visualizations effectively communicate
insights and are easy to understand. Here are some key best practices to keep in mind:
1. Know Your Audience: Make sure the visualization is appropriate for the audience’s level
of data literacy.
2. Choose the Right Type of Chart
3. Design for Clarity: Keep design elements to a minimum to avoid distraction. Remove
unnecessary elements like excessive gridlines, background images, or decorative elements that
don’t add value.
4. Use Effective Labels and Annotations: Clearly label axes to indicate what each axis
represents. Include units if applicable.
Titles: Provide a descriptive title that explains what the visualization is showing.
Legends: Use legends to explain colors, sizes, or shapes used in the visualization.
Annotations: Highlight important data points or trends with annotations to guide the viewer’s
focus.
5. Choose Colors Wisely: Use a color scheme that is both aesthetically pleasing and
accessible. Avoid using too many colors, and consider color blindness (e.g., use colorblind-
friendly palettes like Viridis).
6. Ensure Data Integrity: Verify that data is correctly represented and calculated. Be cautious
with scale choices. For example, avoid distorting data with misleading y-axis scaling.
7. Optimize Readability
Font Size: Ensure text is legible; don’t use overly small fonts.
Contrast: Use high contrast between text and background for better readability.
Spacing: Provide adequate spacing between elements to avoid overcrowding.
8. Interactive Features
Tooltips: Use tooltips to provide additional information when hovering over data points.
Filters: Allow users to filter or zoom in on specific data subsets if interactivity is supported.
Drilldowns: Provide options to drill down into data for more detailed exploration.
Aggregation: Aggregate data meaningfully to avoid overloading the viewer with too much
detail.
Normalization: Normalize data when comparing across different scales or units to make
comparisons more meaningful.
Feedback: Gather feedback from actual users to identify areas for improvement.
Testing: Test visualizations in different scenarios and with different data subsets to ensure they
work well in various contexts.
import plotly.express as px
import pandas as pd
# Sample data
df = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr',
'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',
'Nov', 'Dec'],
'Sales': [200, 220, 250, 275, 300, 310,
290, 280, 320, 340, 330, 350]
})
fig.show()
import plotly.express as px
import pandas as pd
df = pd.DataFrame({
'Region': ['North', 'South', 'East', 'West'],
'Satisfaction': [85, 78, 88, 72]
})
fig = px.imshow(
[df['Satisfaction']], # Using a list to pass
the single row of data
x=df['Region'], # X-axis labels
color_continuous_scale='Reds',
title='Customer Satisfaction by Region',
labels={'color': 'Satisfaction'} # Label
23
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
for color scale
)
import plotly.express as px
import pandas as pd
# Sample data
df = pd.DataFrame({
'Advertising Spend': [1000,
2000, 3000, 4000, 5000],
'Sales': [15000, 25000,
35000, 45000, 55000]
})
Steps
Overall, the data cleaning process involves identifying and fixing data issues, choosing appropriate
cleaning methods, and implementing those methods to improve data quality
1. Identify and Fix Structural Mistakes: This step focuses on correcting errors in data organization
and formatting. It includes tasks like:
Standardizing data formats: Ensuring consistency in date formats, numerical values, and text
(e.g., using consistent separators, capitalization).
Handling missing values: Deciding how to handle missing data (e.g., imputation, removal).
Correcting typos and inconsistencies: Fixing spelling errors and ensuring data accuracy.
2. Set Data Cleaning Techniques: This involves defining the specific methods and tools used to clean
the data. It might include:
Choosing data cleaning software: Selecting appropriate tools for the task (e.g., Excel, Python
libraries, and specialized data cleaning software).
Determining cleaning algorithms: Selecting algorithms or techniques for tasks like outlier
detection, imputation, and data transformation.
3. Filter Outliers and Fix Missing Data: This step focuses on addressing specific data quality issues:
25
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Identifying and handling outliers: Outliers are data points significantly different from others.
They might be removed, corrected, or analyzed further.
Imputing missing values: Filling in missing data with estimated values based on other data
points or statistical methods.
4. Implement Processes: This involves putting the chosen techniques into action and executing the
data cleaning plan. It might include:
Writing cleaning scripts: Developing code or scripts to automate repetitive cleaning tasks.
Applying cleaning techniques: Using the selected methods to clean the data.
Validating results: Checking the cleaned data for accuracy and completeness.
o Completeness: Completeness measures whether all necessary data is present and if the
dataset covers all required aspects of the information. Incomplete data might be missing
values, missing fields, or lack coverage in certain areas. For instance, in a customer
database, completeness would ensure that all required fields (like name, address, and
contact number) are filled in for every customer.
For example, if a dataset shows a person’s age as 30 in one record and 35 in another
record, it lacks consistency.
o Relevance: Relevance indicates how well the data meets the needs of its intended use
or purpose. Data is relevant if it is appropriate and useful for the specific context or
application it is being used for.
For example, in a sales analysis, having data on customer purchase behaviour is relevant,
but data on unrelated topics like weather patterns would not be.
o Validity: Validity measures whether the data conforms to defined formats, rules, or
constraints and accurately represents the intended information. Valid data adheres to
predefined criteria or ranges.
For example, if a dataset requires dates to be in the format MM/DD/YYYY, valid data
should follow this format and not include dates in different formats.
Data cleaning is a critical process for maintaining data quality and integrity. Here are five main
reasons why data cleaning is important:
SWDML501 MACHINE LEARNING APPLICATION
26
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
1. Improves Accuracy and Reliability: Data cleaning helps eliminate errors, inaccuracies, and
inconsistencies from datasets. Accurate data leads to more reliable analysis and decision-
making.
2. Enhances Data Usability: Clean data is easier to work with and analyze. It ensures that
datasets are complete, consistent, and formatted correctly, making it simpler to perform
accurate analyses, generate insights, and create reports. This usability is crucial for deriving
meaningful and actionable conclusions from the data.
3. Prevents Misleading Results: Inaccurate or dirty data can lead to misleading results and
conclusions. Data cleaning helps identify and correct errors or anomalies, reducing the risk of
incorrect interpretations and ensuring that any insights or decisions based on the data are based
on sound information.
4. Boosts Efficiency: Clean data reduces the time and effort required for data processing and
analysis. By addressing issues like duplicate records, missing values, or inconsistent formatting
beforehand, analysts and data scientists can focus on extracting insights rather than spending
time correcting data problems.
5. Ensures Compliance and Integrity: For organizations that need to comply with regulations or
standards, such as GDPR or HIPAA, data cleaning is essential to ensure that data is handled
appropriately. Clean data helps maintain data integrity and supports adherence to data
governance policies and legal requirements.
1. Clustering: This method groups similar data points together, which can help identify outliers or
inconsistencies.
2. Regression: This statistical technique can be used to fill in missing values by predicting them
based on other data points.
3. Binning: This method involves grouping continuous data into smaller ranges or bins, which can
help smooth out data and reduce noise.
27
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
4. Fill the Missing Value: This refers to various techniques for handling missing data, such as
imputation or deletion.
Data normalization
Data normalization is a vital pre-processing, mapping, and scaling method that helps forecasting and
prediction models become more accurate. The current data range is transformed into a new,
standardized range using this method.
Normalization is scaling the data to be analysed to a specific range such as [0.0, 1.0] to provide better
results.
Min-max scaling and Z-Score Normalisation (Standardisation) are the two methods most frequently
used for normalization in feature scaling.
Enhanced Data Integrity: Ensures data accuracy and consistency by preventing anomalies
like update, insert, and delete anomalies. Maintains data reliability and trustworthiness.
Improved Data Efficiency: Reduces query execution time by minimizing the amount of data
accessed. Improves database performance and responsiveness.
Better Database Structure: Creates a well-organized and structured database, making it easier
to understand, manage, and modify. Facilitates data analysis and reporting.
Facilitates Data Sharing: Normalized data is easier to share and integrate with other systems.
Supports data exchange and collaboration.
1. Z-score Normalization
Z-score normalization, also known as standardization, is a common data pre-processing technique
used to transform numerical data into a standard normal distribution with a mean of 0 and a standard
deviation of 1. This process scales features to a common range, making them comparable and
improving the performance of many machine learning algorithms.
How it Works
The Z-score for a data point is calculated using the following formula:
Z = (X - μ) / σ
Where:
Z is the Z-score
X is the original value
SWDML501 MACHINE LEARNING APPLICATION
28
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
μ is the mean of the dataset
σ is the standard deviation of the dataset
Example
Consider a dataset with two features: age (in years) and income (in thousands of dollars). Age might
range from 20 to 60, while income could vary from 30 to 200. Without normalization, the income
feature would have a much larger impact on the model due to its larger scale. By applying Z-score
normalization, both features are scaled to a similar range, allowing for a fairer comparison.
2. Min-Max normalization:
This method of normalising data involves transforming the original data linearly. The data’s minimum
and maximum values are obtained, and each value is then changed using the formula that follows.
Where:
Formula: v’ = v / 10^j
Where:
For example, feature F values range from 850 to 825. Assume that j is three. The greatest value of
feature F is 850. To use decimal scaling for normalization, we must divide all variables by 1,000. As a
result, 850 is normalized to 0,850, and 825 is changed to 0,825.
The decimal points of the data are transformed according to the absolute value of the maximum in this
procedure. As a result, the normalized data’s means will always be between 0 and 1.
Data transformation
29
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Data transformation is the process of converting, , and structuring data into a usable format. Data
transformation is used when data needs to be converted to match that of the destination system. Data
transformation convert data into useful information.
Data Transformation Process
For example, suppose we have a data set referring to measurements of different plots, i.e., we may
have the height and width of each plot. So here, we can construct a new attribute 'area' from attributes
'height' and 'weight.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format.
4. Data Discretization
Data Discretization is a process of converting continuous data into a set of data intervals.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-
20…) or (kid, youth, adult, senior).
5. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
6. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0].
Different methods to normalize the data
Min-max normalization
Z-score normalization
Decimal Scaling
Ways of Data Transformation
31
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Scripting: Data transformation through scripting involves Python or SQL to write the code to
extract and transform data.
On-Premises ETL Tools: ETL tools take the required work to script the data transformation
by automating the process.
Cloud-Based ETL Tools: As the name suggests, cloud-based ETL tools are hosted in the
cloud.
Data transformation involves converting data from its original format or structure into a format that is
more suitable for analysis or modeling. Here are some common types of data transformation
techniques:
1. Normalization: is scaling the data to be analyzed to a specific range such as [0.0, 1.0] to
provide better results. This includes:
Min-Max Scaling: Transforms data to fit within a specific range, typically [0, 1].
Z-Score Standardization: Converts data to have a mean of 0 and a standard deviation of 1.
Robust Scaling: Uses the median and interquartile range to scale data, which is robust to
outliers.
One-Hot Encoding: Converts categorical variables into binary vectors, creating a new column
for each category.
Label Encoding: Converts categorical values into numerical labels, assigning each category a
unique integer.
Ordinal Encoding: Similar to label encoding, but preserves the order of categories if they are
ordinal (i.e., have a meaningful sequence).
SWDML501 MACHINE LEARNING APPLICATION
32
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
3. Feature Engineering
Feature Extraction: Creating new features from existing data, such as extracting the year from
a date or deriving interaction terms between features.
Feature Selection: Choosing a subset of relevant features to improve model performance and
reduce dimensionality.
Polynomial Features: Generating new features by raising existing features to a power or
creating interaction terms.
4. Aggregation
Summarization: Combining data at a higher level, such as aggregating sales data by month or
by region.
Rolling Aggregates: Computing statistics (e.g., mean, sum) over a moving window, useful for
time-series data.
5. Discretization/Binning
Equal-Width Binning: Dividing the range of continuous data into equal-sized intervals.
Equal-Frequency Binning: Dividing data so that each bin contains approximately the same
number of observations.
Custom Binning: Defining bins based on domain knowledge or specific criteria relevant to the
problem.
6. Power Transformation
Square Root Transformation: Applying the square root function to reduce skewness.
Box-Cox Transformation: A family of power transformations that can stabilize variance and
make the data more normally distributed.
7. Data Cleaning
8. Data Reduction
Extracting Components: Extracting parts of date/time data, such as year, month, day, hour,
etc.
33
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Time-based Features: Creating features like day of the week, seasonality, or time since an
event.
Unit Vector Scaling: Scaling data to have a norm of 1, often used in text data analysis.
35
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Regression Algorithms are used to predict a continuous numerical value, such as a
house's price or a day's temperature. Different types of regression algorithms exist,
such as (Linear Regression, Polynomial Regression, and Ridge Regression,)
Classification Algorithms are used to predict a categorical or discrete value, such
as whether an email is spam. Some examples of classification algorithms includes
(Decision trees, Support vector machines (SVM), k-nearest neighbours (KNN).
Supervised learning can be used to perform classification or regression tasks. Standard supervised
learning algorithms includes
Linear Regression
Linear regression is used to identify the linear relationship between a dependent variable and one or
more independent variables.
It is commonly used to predict the value of a continuous variable from the input given by multiple
features or to classify observations into categories by considering their correlation with reference
variables.
Logistic Regression
Logistic regression is a type of predictive modelling algorithm used for classification tasks. It is used
to estimate the probability of an event occurring based on the values of one or more predictor
variables.
Decision Trees
Decision Trees are popular supervised learning algorithms for classification and regression tasks
using a "tree" structure to represent decisions and their associated outcomes.
Each node of the tree represents an attribute, while each branch represents a decision.
Random Forest
Random forests are essentially multiple decision trees combined to form one powerful "forest" model
with better predictive accuracy than individual trees.
37
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Using random forests model also reduces overfitting due to the fact that many trees are generated,
meaning that it's less likely to fit the data too well. In other words it's handles bias and variance well.
SVMs are often used for text classification or image recognition because they can accurately predict
categorical variables from large datasets.
Nevertheless, SVMs have several advantages over other supervised learning algorithms. For instance,
they are highly tolerant of noise and can handle non-sense features.
It aims to classify and predict based on a certain number of ‘nearest neighbours’, having similar
features or properties, either in the same class or otherwise.
Naive Bayes
Naive Bayes Classification is a powerful machine learning technique used for both supervised and
unsupervised learning. It uses Bayes' Theorem to calculate the probability that a given data point
belongs to one class or another based on its input features.
39
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
This makes it perfect for applications such as
Text classification,
Recommendation systems,
Sentiment analysis,
Image recognition
41
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
technique in machine learning that transforms data into a new coordinate system. It projects the data
onto a set of orthogonal axes, known as principal components, which capture the most variance in the
data. The ultimate goal is to reduce the number of dimensions while preserving as much of the
variability in the data as possible.
Hierarchical Clustering
Hierarchical clustering, also known as Hierarchical cluster analysis. It is an unsupervised clustering
algorithm. It includes building clusters that have a preliminary order from top to bottom.
For example, all files and folders on the hard disk are in a hierarchy.
The algorithm clubs related objects into groups named clusters. Finally, we get a set of clusters or
groups. Here each cluster is different from the other cluster.
Also, the data points in each cluster are broadly related to each other.
Anomaly Detection
Anomaly detection, also known as outlier detection, is a process in machine learning and statistics
used to identify data points that significantly differ from the majority of the data. These outliers or
anomalies can represent critical or rare events, errors, or novel situations that are different from the
norm.
43
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Anomaly detection: In anomaly detection, the goal is to detect patterns or observations that are
unusual or different from the norm
Reinforcement Learning
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions
by interacting with an environment. The agent takes actions to maximize cumulative rewards over
time. Unlike supervised learning, where the model learns from labelled data, RL is based on trial and
error.
Example:
We have an agent and a reward, with many hurdles in between. The agent is supposed to find the best
possible path to reach the reward. The following problem explains the problem more easily.
Types of Reinforcement:
There are two types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behaviour, increases the strength and the frequency of the behaviour. In other words, it
has a positive effect on behaviour.
2. Negative: Negative Reinforcement is defined as strengthening of behaviour because a
negative condition is stopped or avoided.
Elements of Reinforcement Learning
Reinforcement learning elements are as follows:
Policy: Policy defines the learning agent behaviour for given time period.
Reward function: Reward function is used to define a goal in a reinforcement learning problem. A
reward function is a function that provides a numerical score based on the state of the environment
Value function: Value functions specify what is good in the long run. The value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that state.
SWDML501 MACHINE LEARNING APPLICATION
44
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Model of the environment: Models are used for planning.
Q-Learning
Q-Learning is a model-free reinforcement learning algorithm used to find the optimal action-selection
policy for an agent. It works by learning a value function, called the Q-function, which estimates the
expected cumulative reward of taking an action in a given state and following the optimal policy
thereafter.
As the agent exposes itself to the environment and receives different rewards by executing different
actions, the values are updated per the following equation:
Example
Our agent is a rat that has to cross a maze and reach the end (its house) point. There are mines in the
maze, and the rat can only move one tile (from one square to one another) at a time. If the rat steps
onto a mine, it will be dead. The rat wants to reach its home in the shortest time possible:
45
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Artificial Neural Networks (ANNs) are a broad category of neural network models that are designed
to simulate the way the human brain processes information. They consist of interconnected nodes
(neurons) organized in layers.
Artificial Neural Networks (ANNs) Components:
These networks usually consist of an input layer, one to two hidden layers, and an output layer.
Input Layer: Receives the input features.
Hidden Layers: Intermediate layers where computations are performed. Each neuron in a
hidden layer receives inputs from the previous layer and sends its output to the next layer.
Output Layer: Produces the final output.
Deep Neural Networks (DNNs) are a subset of ANNs with many hidden layers. The term "deep"
refers to the depth of the network, meaning the number of hidden layers between the input and output
layers.
Deep Neural Networks (DNNs) Components:
Multiple Hidden Layers: DNNs have multiple hidden layers, which allow them to learn
complex representations and features from the data.
Hierarchical Learning: Layers are stacked in a way that each layer learns increasingly
abstract features of the data.
47
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
C. Competitive Network
A Competitive Network is a type of neural network where neurons compete with each other to respond
to a particular input. Only the "winning" neuron (or neurons) is activated, and it provides the output.
Structure:
Input Layer: Receives the input features.
Output Layer: Consists of neurons that compete to respond to the input. The neuron with the highest
activation typically "wins" and is activated.
Characteristics:
Winner-Takes-All: Only the neuron with the highest activation value for a given input is selected.
Unsupervised Learning: Often used in clustering and feature extraction tasks, as it can identify
patterns and similarities in the input data.
D. Recurrent Network
A Recurrent Network is a type of neural network where connections between neurons form directed
cycles. This allows the network to maintain a state and process sequences of data.
Structure:
SWDML501 MACHINE LEARNING APPLICATION
48
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Input Layer: Receives sequential data.
Hidden Layers: Includes recurrent connections where outputs are fed back into the network. This
allows the network to have memory of previous inputs.
Output Layer: Produces output based on the current state and past inputs.
Characteristics:
Temporal Dynamics: Capable of modelling temporal sequences and dependencies in data.
Stateful: Maintains a state or memory of previous inputs, which is crucial for tasks involving
sequential data.
49
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Neural Networks (Artificial Neural Networks)
Neural Networks (Artificial Neural Networks) Refers to a broad category of networks inspired by the
human brain. They consist of interconnected layers of nodes (neurons) including input, hidden, and
output layers.
They can be feed forward, recurrent, or convolutional, among other types. They are used for various
complex tasks such as image recognition, language translation, and more.
Example Use: Any machine learning problem where patterns need to be learned from data.
Convolutional Neural Networks (CNNs)
The convolutional neural networks structure is focused on processing the grid type data like images
and videos by using convolutional layers filtering driving the patterns and spatial hierarchies.
Key Components: Utilizing convolutional layers, pooling layers and fully connected layers.
Applications: Used for classification of images, object detection, medical imaging analyses,
autonomous driving and visualization in augmented reality.
Identify problem
Regression: When the goal is to predict a continuous outcome. For example,
predicting house prices based on features like size and location.
Clustering: When the goal is to group similar data points together without
predefined labels. For example, customer segmentation based on purchasing
behaviour.
analyse type of data in a given dataset
Independent variable: Variables used as input features for prediction. They can be
continuous, categorical, or a mix.
51
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Dependent variable: The target variable you are trying to predict or classify. In
regression, it’s continuous; in classification, it’s categorical.
Resource analysis
computational power
Some algorithms are computationally intensive (e.g., deep learning models) and may
require powerful hardware like GPUs. Simple models (e.g., Linear Regression, k-NN)
are less resource-intensive and can be run on standard CPUs.
memory limitations
Large datasets and complex models require more memory. Algorithms like k-NN or
decision trees can be memory-intensive. Techniques like dimensionality reduction
(PCA) or using subsampling methods can help manage memory usage.
choose machine learning algorithm to be used
Based on the problem type, data characteristics, and resource constraints, select an algorithm
that best fits your needs. Here are some guidelines:
For Regression:
Simple: Linear Regression (if the relationship is linear).
Complex: Random Forest Regression, Gradient Boosting, or Neural Networks (if the
relationship is non-linear and complex).
For Classification:
Simple: Logistic Regression, Naive Bayes (if the data is linearly separable).
Complex: Random Forest, Support Vector Machines, or Deep Learning models (for
non-linear and complex decision boundaries).
For Clustering:
Simple: k-Means (if the clusters are spherical and well-separated).
Complex: DBSCAN or Hierarchical Clustering (if the clusters have different shapes
or densities).
● Train machine learning model
Load a dataset
To load a dataset in Python, you can use various libraries depending on the format of your data and
your specific needs. Here’s a guide on how to load a dataset using different libraries:
Using pandas
Pandas is a powerful library for data manipulation and analysis. It supports various file formats,
including CSV, Excel, and SQL databases.
CSV File:
import pandas as pd
df = pd.read_csv('path/to/your/file.csv')
import pandas as pd
import pandas as pd
import sqlite3
Using Numpy
Numpy is a fundamental library for numerical computing in Python. It’s commonly used for loading
datasets with numerical data, especially from text files.
Text File:
import numpy as np
# Load data from a text file into a NumPy array
data = np.loadtxt('path/to/your/file.txt', delimiter=',')
CSV File:
import numpy as np
# Load a CSV file into a NumPy array
data = np.genfromtxt('path/to/your/file.csv', delimiter=',', skip_header=1)
Using Scikit-Learn
Scikit-Learn includes some built-in datasets and functions for loading them. It is particularly useful for
quickly accessing standard datasets for machine learning.
Built-in Datasets:
Using Seaborn
Seaborn is a statistical data visualization library based on Matplotlib, and it includes some built-in
datasets. It also works well with Pandas DataFrames.
Built-in Datasets:
import seaborn as sns
# Load a built-in Seaborn dataset
53
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
df = sns.load_dataset('iris')
Using Requests and io
When loading datasets from a URL, requests can be used to fetch the data, and io can be used to
handle the data in-memory.
import pandas as pd
import requests
from io import BytesIO
Syntax:
SWDML501 MACHINE LEARNING APPLICATION
54
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True,
stratify=None)
Parameters:
*arrays: inputs such as lists, arrays, data frames, or matrices
test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our test size. its default value is none.
train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our train size. its default value is none.
random_state: this parameter is used to control the shuffling applied to the data before
applying the split. it acts as a seed.
shuffle: This parameter is used to shuffle the data before splitting. Its default value is true.
stratify: This parameter is used to split the data in a stratified fashion
Example:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
# Let's assume we have features X and target y
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([0, 1, 0, 1, 0, 1])
# Split the dataset into training and test sets
# test_size=0.2 means 20% of the data will be used as the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the results
print("Training features:\n", X_train)
print("Training labels:\n", y_train)
print("Test features:\n", X_test)
print("Test labels:\n", y_test)
Note: random_state is set to 42, ensuring that the random shuffling and splitting are identical each
time the code is run.
Initialize model
Initialization model refers to the process of creating an instance of a machine learning model. This is
where you define the type of model you want to use and set any hyper parameters that control the
behaviour of the model.
55
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
Purpose: Defines the model’s architecture and behaviour. Proper initialization can significantly
impact model performance. Allows you to set hyper parameters that affect the learning process,
regularization, and model complexity.
Some models come with hyper parameters that you can set when initializing the model. For example:
Linear Regression: Regularization parameters (e.g., alpha for Ridge Regression).
Decision Trees: Maximum depth of the tree, minimum samples required to split a node.
Logistic Regression: Solver type, regularization strength.
1. Linear Regression
Linear Regression is a model used for predicting continuous values.
from sklearn.linear_model import LinearRegression
# Initialize the Linear Regression model
linear_reg_model = LinearRegression()
2. Decision Tree
Decision Tree is used for both classification and regression tasks. Here, we’ll initialize it for
classification.
from sklearn.tree import DecisionTreeClassifier
# Initialize the Decision Tree Classifier
decision_tree_model = DecisionTreeClassifier(max_depth=3, criterion='gini')
max_depth=3 limits the maximum depth of the tree to prevent overfitting.
criterion='gini' specifies the function to measure the quality of a split. You can also use
'entropy'.
3. Logistic Regression
Logistic Regression is used for binary or multi-class classification problems.
from sklearn.linear_model import LogisticRegression
# Initialize the Logistic Regression model
logistic_reg_model = LogisticRegression(max_iter=200, C=1.0, solver='liblinear')
max_iter=200 specifies the maximum number of iterations for the solver.
C=1.0 is the regularization strength (smaller values mean stronger regularization).
solver='liblinear' is the algorithm to use for optimization.
4. Random Forest
Random Forest is an ensemble method used for both classification and regression. Here, we’ll
initialize it for classification.
from sklearn.ensemble import RandomForestClassifier
prediction of result
In machine learning, once you have trained a model on your training data, you can use it to make
predictions on both test data and new (unseen) data.
on test data
Prediction on Test Data evaluate how well your model performs on a separate set of data that it
hasn't seen during training. This helps in assessing the model's performance and generalization ability.
For classification tasks, visualizing predictions typically involves comparing predicted labels to actual
labels and understanding the decision boundaries of the model.
a. Confusion Matrix
A confusion matrix shows the number of correct and incorrect predictions for each class. Confusion
matrix is a matrix that summarizes the performance of a machine learning model on a set of test data.
It is a means of displaying the number of accurate and inaccurate instances based on the model’s
predictions. It is often used to measure the performance of classification models, which aim to predict
a categorical label for each input instance.
True Positive (TP): The model correctly predicted a positive outcome (the actual outcome
was positive).
True Negative (TN): The model correctly predicted a negative outcome (the actual outcome
was negative).
2. Precision
Precision is a measure of how accurate a model’s positive predictions are. It is defined as the ratio
of true positive predictions to the total number of positive predictions made by the model.
3. Recall
Recall measures the effectiveness of a classification model in identifying all relevant instances
from a dataset. It is the ratio of the number of true positive (TP) instances to the sum of true
positive and false negative (FN) instances.
4. F1-Score
F1-score is used to evaluate the overall performance of a classification model. It is the harmonic
mean of precision and recall,
Mean Absolute Error (MAE): Average absolute difference between predicted and
actual values.
Formula: MAE = Σ(|actual - predicted|) / n
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.2f}")
Root Mean Squared Error (RMSE): Square root of the average of squared
differences between predicted and actual values.
Formula: RMSE = sqrt(Σ(actual - predicted)^2 / n)
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:.2f}")
R Squared score: Proportion of the variance in the dependent variable explained
by the independent variables.
Formula: R^2 = 1 - (SS_res / SS_tot)
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")
Adjusted R Squared score: Modified R-squared that accounts for the number of
predictors.
Model interpretation
61
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
● Tuning Hyperparameters
63
SWDML501 MACHINE LEARNING APPLICATION
By Leon HAKIZIMANA
Email: [email protected]
Tel: +250 789932372
References
https://www.analyticsvidhya.com/blog/2021/10/step-by-step-guide-data-visualization-tableau/
Data Visualization with Power BI | DataCamp
https://docs.microsoft.com/en-us/learn/modules/get-started-with-power-bi/2-using-power-bi
https://www.geeksforgeeks.org/z-score-normalization-definition-and-examples/#:~:text=Z%2Dscore
%20normalization%2C%20also%20known,its%20importance%20and%20practical%20applications.
Data Normalization Machine Learning - GeeksforGeeks
Data Transformation in Machine Learning - GeeksforGeeks
Data Normalization Techniques in Data Mining | Hevo (hevodata.com)
https://dataaspirant.com/supervised-learning-algorithms/
https://dataaspirant.com/unsupervised-learning-algorithms/