Machine Learning For Absolute Beginners A - Oliver Theobald
Machine Learning For Absolute Beginners A - Oliver Theobald
For
Absolute Beginners:
A Plain English Introduction
Third Edition
Oliver Theobald
Third Edition
Copyright © 2021 by Oliver Theobald
All rights reserved. No part of this publication may be reproduced,
distributed, or transmitted in any form or by any means, including
photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the publisher,
except in the case of brief quotations embodied in critical reviews
and certain other non-commercial uses permitted by copyright law.
Edited by Jeremy Pedersen and Red to Black Editing’s Christopher
Dino.
Skillshare
www.skillshare.com/user/machinelearning_beginners
For introductory video courses on machine learning and videos lessons
from other instructors.
Instagram
machinelearning_beginners
For mini-lessons, books quotes, and more!
1
PREFACE
Machines have come a long way since the onset of the Industrial
Revolution. They continue to fill factory floors and manufacturing plants,
but their capabilities extend beyond manual activities to cognitive tasks
that, until recently, only humans were capable of performing. Judging song
contests, driving automobiles, and detecting fraudulent transactions are
three examples of the complex tasks machines are now capable of
simulating.
But these remarkable feats trigger fear among some observers. Part of their
fear nestles on the neck of survivalist insecurities and provokes the deep-
seated question of what if ? What if intelligent machines turn on us in a
struggle of the fittest? What if intelligent machines produce offspring with
capabilities that humans never intended to impart to machines? What if the
legend of the singularity is true?
The other notable fear is the threat to job security, and if you’re a taxi driver
or an accountant, there’s a valid reason to be worried. According to joint
research from the Office for National Statistics and Deloitte UK published
by the BBC in 2015, job professions including bar worker (77%), waiter
(90%), chartered accountant (95%), receptionist (96%), and taxi driver
(57%) have a high chance of being automated by the year 2035. [1]
Nevertheless, research on planned job automation and crystal ball gazing
concerning the future evolution of machines and artificial intelligence (AI)
should be read with a pinch of skepticism. In Superintelligence: Paths,
Dangers, Strategies , author Nick Bostrom discusses the continuous
redeployment of AI goals and how “two decades is a sweet spot… near
enough to be attention-grabbing and relevant, yet far enough to make it
possible that a string of breakthroughs…might by then have occurred.”( [2] )(
[3] )
Figure 1: Historical mentions of “machine learning” in published books. Source: Google Ngram
Viewer, 2017
Although it wasn’t the first published paper to use the term “machine
learning” per se, Arthur Samuel is regarded as the first person to coin and
define machine learning as the concept and specialized field we know
today. Samuel’s landmark journal submission, Some Studies in Machine
Learning Using the Game of Checkers, introduced machine learning as a
subfield of computer science that gives computers the ability to learn
without being explicitly programmed.
While not directly treated in Arthur Samuel’s initial definition, a key
characteristic of machine learning is the concept of self-learning. This
refers to the application of statistical modeling to detect patterns and
improve performance based on data and empirical information; all without
direct programming commands. This is what Arthur Samuel described as
the ability to learn without being explicitly programmed. Samuel didn’t
infer that machines may formulate decisions with no upfront programming.
On the contrary, machine learning is heavily dependent on code input.
Instead, he observed machines can perform a set task using input data
rather than relying on a direct input command .
Figure 3: The lineage of machine learning represented by a row of Russian matryoshka dolls
Emerging from computer science and data science as the third matryoshka
doll from the left in Figure 3 is artificial intelligence. Artificial intelligence,
or AI, encompasses the ability of machines to perform intelligent and
cognitive tasks. Comparable to how the Industrial Revolution gave birth to
an era of machines simulating physical tasks, AI is driving the development
of machines capable of simulating cognitive abilities.
While still broad but dramatically more honed than computer science and
data science, AI spans numerous subfields that are popular and newsworthy
today. These subfields include search and planning, reasoning and
knowledge representation, perception, natural language processing (NLP),
and of course, machine learning.
Figure 4: Visual representation of the relationship between data-related fields
MACHINE LEARNING
CATEGORIES
Machine learning incorporates several hundred statistical-based algorithms
and choosing the right algorithm(s) for the job is a constant challenge of
working in this field. Before examining specific algorithms, it’s important
to consolidate one’s understanding of the three overarching categories of
machine learning and their treatment of input and output variables.
Supervised Learning
Supervised learning imitates our own ability to extract patterns from known
examples and use that extracted insight to engineer a repeatable outcome.
This is how the car company Toyota designed their first car prototype.
Rather than speculate or create a unique process for manufacturing cars,
Toyota created its first vehicle prototype after taking apart a Chevrolet car
in the corner of their family-run loom business. By observing the finished
car (output) and then pulling apart its individual components (input),
Toyota’s engineers unlocked the design process kept secret by Chevrolet in
America.
This process of understanding a known input-output combination is
replicated in machine learning using supervised learning. The model
analyzes and deciphers the relationship between input and output data to
learn the underlying patterns. Input data is referred to as the independent
variable (uppercase “X”), while the output data is called the dependent
variable (lowercase “y”). An example of a dependent variable (y) might be
the coordinates for a rectangle around a person in a digital photo (face
recognition), the price of a house, or the class of an item (i.e. sports car,
family car, sedan). Their independent variables—which supposedly impact
the dependent variable—could be the pixel colors, the size and location of
the house, and the specifications of the car respectively. After analyzing a
sufficient number of examples, the machine creates a model: an algorithmic
equation for producing an output based on patterns from previous input-
output examples.
Using the model, the machine can then predict an output based exclusively
on the input data. The market price of your used Lexus, for example, can be
estimated using the labeled examples of other cars recently sold on a used
car website.
With access to the selling price of other similar cars, the supervised learning
model can work backward to determine the relationship between a car’s
value (output) and its characteristics (input). The input features of your own
car can then be inputted into the model to generate a price prediction.
Figure 5: Inputs (X) are fed to the model to generate a new prediction (y)
While input data with an unknown output can be fed to the model to push
out a prediction, unlabeled data cannot be used to build the model. When
building a supervised learning model, each item (i.e. car, product, customer)
must have labeled input and output values—known in data science as a
“labeled dataset.”
Examples of common algorithms used in supervised learning include
regression analysis (i.e. linear regression, logistic regression, non-linear
regression), decision trees, k -nearest neighbors, neural networks, and
support vector machines, each of which are examined in later chapters.
Unsupervised Learning
In the case of unsupervised learning, the output variables are unlabeled, and
combinations of input and output variables aren’t known. Unsupervised
learning instead focuses on analyzing relationships between input variables
and uncovering hidden patterns that can be extracted to create new labels
regarding possible outputs.
If you group data points based on the purchasing behavior of SME (Small
and Medium-sized Enterprises) and large enterprise customers, for example,
you’re likely to see two clusters of data points emerge. This is because
SMEs and large enterprises tend to have different procurement needs. When
it comes to purchasing cloud computing infrastructure, for example,
essential cloud hosting products and a Content Delivery Network (CDN)
should prove sufficient for most SME customers. Large enterprise
customers, though, are likely to purchase a broader array of cloud products
and complete solutions that include advanced security and networking
products like WAF (Web Application Firewall), a dedicated private
connection, and VPC (Virtual Private Cloud). By analyzing customer
purchasing habits, unsupervised learning is capable of identifying these two
groups of customers without specific labels that classify a given company
as small/medium or large.
The advantage of unsupervised learning is that it enables you to discover
patterns in the data that you were unaware of—such as the presence of two
dominant customer types—and provides a springboard for conducting
further analysis once new groups are identified. Unsupervised learning is
especially compelling in the domain of fraud detection—where the most
dangerous attacks are those yet to be classified. One interesting example is
DataVisor; a company that has built its business model on unsupervised
learning. Founded in 2013 in California, DataVisor protects customers from
fraudulent online activities, including spam, fake reviews, fake app installs,
and fraudulent transactions. Whereas traditional fraud protection services
draw on supervised learning models and rule engines, DataVisor uses
unsupervised learning to detect unclassified categories of attacks.
As DataVisor explains on their website, "to detect attacks, existing solutions
rely on human experience to create rules or labeled training data to tune
models. This means they are unable to detect new attacks that haven’t
already been identified by humans or labeled in training data." [10] Put
another way, traditional solutions analyze chains of activity for a specific
type of attack and then create rules to predict and detect repeat attacks. In
this case, the dependent variable (output) is the event of an attack, and the
independent variables (input) are the common predictor variables of an
attack. Examples of independent variables could be:
a) A sudden large order from an unknown user. I.E., established
customers might generally spend less than $100 per order, but a new user
spends $8,000 on one order immediately upon registering an account.
b) A sudden surge of user ratings. I.E., As with most technology books
sold on Amazon.com, the first edition of this book rarely receives more
than one reader review per day. In general, approximately 1 in 200 Amazon
readers leave a review and most books go weeks or months without a
review. However, I notice other authors in this category (data science)
attract 50-100 reviews in a single day! (Unsurprisingly, I also see Amazon
remove these suspicious reviews weeks or months later.)
c) Identical or similar user reviews from different users. Following the
same Amazon analogy, I sometimes see positive reader reviews of my book
appear with other books (even with reference to my name as the author still
included in the review!). Again, Amazon eventually removes these fake
reviews and suspends these accounts for breaking their terms of service.
d) Suspicious shipping address. I.E., For small businesses that routinely
ship products to local customers, an order from a distant location (where
their products aren’t advertised) can, in rare cases, be an indicator of
fraudulent or malicious activity.
Standalone activities such as a sudden large order or a remote shipping
address might not provide sufficient information to detect sophisticated
cybercrime and are probably more likely to lead to a series of false-positive
results. But a model that monitors combinations of independent variables,
such as a large purchasing order from the other side of the globe or a
landslide number of book reviews that reuse existing user content generally
leads to a better prediction.
In supervised learning, the model deconstructs and classifies what these
common variables are and design a detection system to identify and prevent
repeat offenses. Sophisticated cybercriminals, though, learn to evade these
simple classification-based rule engines by modifying their tactics. Leading
up to an attack, for example, the attackers often register and operate single
or multiple accounts and incubate these accounts with activities that mimic
legitimate users. They then utilize their established account history to evade
detection systems, which closely monitor new users. As a result, solutions
that use supervised learning often fail to detect sleeper cells until the
damage has been inflicted and especially for new types of attacks.
DataVisor and other anti-fraud solution providers instead leverage
unsupervised learning techniques to address these limitations. They analyze
patterns across hundreds of millions of accounts and identify suspicious
connections between users (input)—without knowing the actual category of
future attacks (output). By grouping and identifying malicious actors whose
actions deviate from standard user behavior, companies can take actions to
prevent new types of attacks (whose outcomes are still unknown and
unlabeled).
Examples of suspicious actions may include the four cases listed earlier or
new instances of unnormal behavior such as a pool of newly registered
users with the same profile picture. By identifying these subtle correlations
across users, fraud detection companies like DataVisor can locate sleeper
cells in their incubation stage. A swarm of fake Facebook accounts, for
example, might be linked as friends and like the same pages but aren’t
linked with genuine users. As this type of fraudulent behavior relies on
fabricated interconnections between accounts, unsupervised learning
thereby helps to uncover collaborators and expose criminal rings.
The drawback, though, of using unsupervised learning is that because the
dataset is unlabeled, there aren’t any known output observations to check
and validate the model, and predictions are therefore more subjective than
those coming from supervised learning.
We will cover unsupervised learning later in this book specific to k -means
clustering. Other examples of unsupervised learning algorithms include
social network analysis and descending dimension algorithms.
Semi-supervised Learning
A hybrid form of unsupervised and supervised learning is also available in
the form of semi-supervised learning, which is used for datasets that contain
a mix of labeled and unlabeled cases. With the “more data the better” as a
core motivator, the goal of semi- supervised learning is to leverage
unlabeled cases to improve the reliability of the prediction model. One
technique is to build the initial model using the labeled cases (supervised
learning) and then use the same model to label the remaining cases (that are
unlabeled) in the dataset. The model can then be retrained using a larger
dataset (with less or no unlabeled cases). Alternatively, the model could be
iteratively re-trained using newly labeled cases that meet a set threshold of
confidence and adding the new cases to the training data after they meet the
set threshold. There is, however, no guarantee that a semi-supervised model
will outperform a model trained with less data (based exclusively on the
original labeled cases).
Reinforcement Learning
Reinforcement learning is the third and most advanced category of machine
learning. Unlike supervised and unsupervised learning, reinforcement
learning builds its prediction model by gaining feedback from random trial
and error and leveraging insight from previous iterations.
The goal of reinforcement learning is to achieve a specific goal (output) by
randomly trialing a vast number of possible input combinations and grading
their performance.
Reinforcement learning can be complicated to understand and is probably
best explained using a video game analogy. As a player progresses through
the virtual space of a game, they learn the value of various actions under
different conditions and grow more familiar with the field of play. Those
learned values then inform and influence the player’s subsequent behavior
and their performance gradually improves based on learning and
experience.
Reinforcement learning is similar, where algorithms are set to train the
model based on continuous learning. A standard reinforcement learning
model has measurable performance criteria where outputs are graded. In the
case of self-driving vehicles, avoiding a crash earns a positive score, and in
the case of chess, avoiding defeat likewise receives a positive assessment.
Q-learning
A specific algorithmic example of reinforcement learning is Q-learning. In
Q-learning, you start with a set environment of states, represented as “S.” In
the game Pac-Man, states could be the challenges, obstacles or pathways
that exist in the video game. There may exist a wall to the left, a ghost to
the right, and a power pill above—each representing different states. The
set of possible actions to respond to these states is referred to as “A.” In
Pac-Man, actions are limited to left, right, up, and down movements, as
well as multiple combinations thereof. The third important symbol is “Q,”
which is the model’s starting value and has an initial value of “0.”
As Pac-Man explores the space inside the game, two main things happen:
1) Q drops as negative things occur after a given state/action.
2) Q increases as positive things occur after a given state/action.
In Q-learning, the machine learns to match the action for a given state that
generates or preserves the highest level of Q. It learns initially through the
process of random movements (actions) under different conditions (states).
The model records its results (rewards and penalties) and how they impact
its Q level and stores those values to inform and optimize its future actions.
While this sounds simple, implementation is computationally expensive and
beyond the scope of an absolute beginner’s introduction to machine
learning. Reinforcement learning algorithms aren’t covered in this book,
but, I’ll leave you with a link to a more comprehensive explanation of
reinforcement learning and Q-learning using the Pac-Man case study.
https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html
4
Compartment 1: Data
Stored in the first compartment of the toolbox is your data. Data constitutes
the input needed to train your model and generate predictions. Data comes
in many forms, including structured and unstructured data. As a beginner,
it’s best to start with (analyzing) structured data. This means that the data is
defined, organized, and labeled in a table, as shown in Table 3. Images,
videos, email messages, and audio recordings are examples of unstructured
data as they don’t fit into the organized structure of rows and columns.
Table 3: Bitcoin Prices from 2015-2017
Each column is known also as a vector . Vectors store your X and y values
and multiple vectors (columns) are commonly referred to as matrices . In
the case of supervised learning, y will already exist in your dataset and be
used to identify patterns in relation to the independent variables (X). The y
values are commonly expressed in the final vector, as shown in Figure 7.
Figure 7: The y value is often but not always expressed in the far-right vector
Scatterplots, including 2-D, 3-D, and 4-D plots, are also packed into the
first compartment of the toolbox with the data. A 2-D scatterplot consists of
a vertical axis (known as the y-axis) and a horizontal axis (known as the x-
axis) and provides the graphical canvas to plot variable combinations,
known as data points. Each data point on the scatterplot represents an
observation from the dataset with X values on the x-axis and y values on
the y-axis.
Figure 8: Example of a 2-D scatterplot. X represents days passed and y is Bitcoin price.
Compartment 2: Infrastructure
The second compartment of the toolbox contains your machine learning
infrastructure, which consists of platforms and tools for processing data. As
a beginner to machine learning, you are likely to be using a web application
(such as Jupyter Notebook) and a programming language like Python.
There are then a series of machine learning libraries, including NumPy,
Pandas, and Scikit-learn, which are compatible with Python. Machine
learning libraries are a collection of pre-compiled programming routines
frequently used in machine learning that enable you to manipulate data and
execute algorithms with minimal use of code.
You will also need a machine to process your data, in the form of a physical
computer or a virtual server. In addition, you may need specialized libraries
for data visualization such as Seaborn and Matplotlib, or a standalone
software program like Tableau, which supports a range of visualization
techniques including charts, graphs, maps, and other visual options.
With your infrastructure sprayed across the table (hypothetically of course),
you’re now ready to build your first machine learning model. The first step
is to crank up your computer. Standard desktop computers and laptops are
both sufficient for working with smaller datasets that are stored in a central
location, such as a CSV file. You then need to install a programming
environment, such as Jupyter Notebook, and a programming language,
which for most beginners is Python.
Python is the most widely used programming language for machine
learning because:
a) It’s easy to learn and operate.
b) It’s compatible with a range of machine learning libraries.
c) It can be used for related tasks, including data collection (web
scraping) and data piping (Hadoop and Spark).
Other go-to languages for machine learning include C and C++. If you’re
proficient with C and C++, then it makes sense to stick with what you
know. C and C++ are the default programming languages for advanced
machine learning because they can run directly on the GPU (Graphical
Processing Unit). Python needs to be converted before it can run on the
GPU, but we’ll get to this and what a GPU is later in the chapter.
Next, Python users will need to import the following libraries: NumPy,
Pandas, and Scikit-learn. NumPy is a free and open-source library that
allows you to efficiently load and work with large datasets, including
merging datasets and managing matrices.
Scikit-learn provides access to a range of popular shallow algorithms,
including linear regression, clustering techniques, decision trees, and
support vector machines. Shallow learning algorithms refer to learning
algorithms that predict outcomes directly from the input features. Non-
shallow algorithms or deep learning, meanwhile, produce an output based
on preceding layers in the model (discussed in Chapter 13 in reference to
artificial neural networks) rather than directly from the input features. [11]
Finally, Pandas enables your data to be represented as a virtual
spreadsheet that you can control and manipulate using code. It shares many
of the same features as Microsoft Excel in that it allows you to edit data and
perform calculations. The name Pandas derives from the term “panel data,”
which refers to its ability to create a series of panels, similar to “sheets” in
Excel. Pandas is also ideal for importing and extracting data from CSV
files.
Compartment 3: Algorithms
Now that the development environment is set up and you’ve chosen your
programming language and libraries, you can next import your data directly
from a CSV file. You can find hundreds of interesting datasets in CSV
format from kaggle.com. After registering as a Kaggle member, you can
download a dataset of your choosing. Best of all, Kaggle datasets are free,
and there’s no cost to register as a user. The dataset will download directly
to your computer as a CSV file, which means you can use Microsoft Excel
to open and even perform basic algorithms such as linear regression on your
dataset.
Next is the third and final compartment that stores the machine learning
algorithms. Beginners typically start out using simple supervised learning
algorithms such as linear regression, logistic regression, decision trees, and
k -nearest neighbors. Beginners are also likely to apply unsupervised
learning in the form of k -means clustering and descending dimension
algorithms.
Visualization
No matter how impactful and insightful your data discoveries are, you need
a way to communicate the results to relevant decision-makers. This is
where data visualization comes in handy to highlight and communicate
findings from the data to a general audience. The visual story conveyed
through graphs, scatterplots, heatmaps, box plots, and the representation of
numbers as shapes make for quick and easy storytelling.
In general, the less informed your audience is, the more important it is to
visualize your findings. Conversely, if your audience is knowledgeable
about the topic, additional details and technical terms can be used to
supplement visual elements. To visualize your results, you can draw on a
software program like Tableau or a Python library such as Seaborn, which
are stored in the second compartment of the toolbox.
Compartment 2: Infrastructure
Given that advanced learners are dealing with up to petabytes of data,
robust infrastructure is required. Instead of relying on the CPU of a personal
computer, the experts typically turn to distributed computing and a cloud
provider such as Amazon Web Services (AWS) or Google Cloud Platform
to run their data processing on a virtual graphics processing unit (GPU). As
a specialized parallel computing chip, GPU instances are able to perform
many more floating-point operations per second than a CPU, allowing for
much faster solutions with linear algebra and statistics than with a CPU.
GPU chips were originally added to PC motherboards and video consoles
such as the PlayStation 2 and the Xbox for gaming purposes. They were
developed to accelerate the rendering of images with millions of pixels
whose frames needed to be continuously recalculated to display output in
less than a second. By 2005, GPU chips were produced in such large
quantities that prices dropped dramatically and they became almost a
commodity. Although popular in the video game industry, their application
in the space of machine learning wasn’t fully understood or realized until
quite recently. Kevin Kelly, in his novel The Inevitable: Understanding the
12 Technological Forces That Will Shape Our Future , explains that in
2009, Andrew Ng and a team at Stanford University made a discovery to
link inexpensive GPU clusters to run neural networks consisting of
hundreds of millions of connected nodes.
“Traditional processors required several weeks to calculate all the cascading
possibilities in a neural net with one hundred million parameters. Ng found
that a cluster of GPUs could accomplish the same thing in a day,” explains
Kelly. [12]
As mentioned, C and C++ are the preferred languages to directly edit and
perform mathematical operations on the GPU. Python can also be used and
converted into C in combination with a machine learning library such as
TensorFlow from Google. Although it’s possible to run TensorFlow on a
CPU, you can gain up to about 1,000x in performance using the GPU.
Unfortunately for Mac users, TensorFlow is only compatible with the
Nvidia GPU card, which is no longer available with Mac OS X. Mac users
can still run TensorFlow on their CPU but will need to run their workload
on the cloud if they wish to use a GPU.
Amazon Web Services, Microsoft Azure, Alibaba Cloud, Google Cloud
Platform, and other cloud providers offer pay-as-you-go GPU resources,
which may also start off free using a free trial program. Google Cloud
Platform is currently regarded as a leading choice for virtual GPU resources
based on performance and pricing. Google also announced in 2016 that it
would publicly release a Tensor Processing Unit designed specifically for
running TensorFlow, which is already used internally at Google.
DATA SCRUBBING
Like most varieties of fruit, datasets need upfront cleaning and human
manipulation before they’re ready for consumption. The “clean-up” process
applies to machine learning and many other fields of data science and is
known in the industry as data scrubbing . This is the technical process of
refining your dataset to make it more workable. This might involve
modifying and removing incomplete, incorrectly formatted, irrelevant or
duplicated data. It might also entail converting text-based data to numeric
values and the redesigning of features.
For data practitioners, data scrubbing typically demands the greatest
application of time and effort.
Feature Selection
To generate the best results from your data, it’s essential to identify which
variables are most relevant to your hypothesis or objective. In practice, this
means being selective in choosing the variables you include in your model.
Moreover, preserving features that don’t correlate strongly with the output
value can manipulate and derail the model’s accuracy. Let’s consider the
following data excerpt downloaded from kaggle.com documenting dying
languages.
Table 4: Endangered languages, database: https://www.kaggle.com/the-guardian/extinct-languages
This enables us to transform the dataset in a way that preserves and captures
information using fewer variables. The downside to this transformation is
that we have less information about the relationships between specific
products. Rather than recommending products to users according to other
individual products, recommendations will instead be based on associations
between product subtypes or recommendations of the same product
subtype.
Nonetheless, this approach still upholds a high level of data relevancy.
Buyers will be recommended health food when they buy other health food
or when they buy apparel (depending on the degree of correlation), and
obviously not machine learning textbooks—unless it turns out that there is a
strong correlation there! But alas, such a variable/category is outside the
frame of this dataset.
Remember that data reduction is also a business decision and business
owners in counsel with their data science team must consider the trade-off
between convenience and the overall precision of the model.
Row Compression
In addition to feature selection, you may need to reduce the number of rows
and thereby compress the total number of data points. This may involve
merging two or more rows into one, as shown in the following dataset, with
“Tiger” and “Lion” merged and renamed as “Carnivore.”
By merging these two rows (Tiger & Lion), the feature values for both rows
must also be aggregated and recorded in a single row. In this case, it’s
possible to merge the two rows because they possess the same categorical
values for all features except Race Time—which can be easily aggregated.
The race time of the Tiger and the Lion can be added and divided by two.
Numeric values are normally easy to aggregate given they are not
categorical. For instance, it would be impossible to aggregate an animal
with four legs and an animal with two legs! We obviously can’t merge these
two animals and set “three” as the aggregate number of legs.
Row compression can also be challenging to implement in cases where
numeric values aren’t available. For example, the values “Japan” and
“Argentina” are very difficult to merge. The values “Japan” and “South
Korea” can be merged, as they can be categorized as countries from the
same continent, “Asia” or “East Asia.” However, if we add “Pakistan” and
“Indonesia” to the same group, we may begin to see skewed results, as there
are significant cultural, religious, economic, and other dissimilarities
between these four countries.
In summary, non-numeric and categorical row values can be problematic to
merge while preserving the true value of the original data. Also, row
compression is usually less attainable than feature compression and
especially for datasets with a high number of features.
One-hot Encoding
After finalizing the features and rows to be included in your model, you
next want to look for text-based values that can be converted into numbers.
Aside from set text-based values such as True/False (that automatically
convert to “1” and “0” respectively), most algorithms are not compatible
with non-numeric data.
One method to convert text-based values into numeric values is one-hot
encoding , which transforms values into binary form, represented as “1” or
“0”—“True” or “False.” A “0,” representing False, means that the value
does not belong to a given feature, whereas a “1”—True or “hot”—
confirms that the value does belong to that feature.
Below is another excerpt from the dying languages dataset which we can
use to observe one-hot encoding.
Table 8: Endangered languages
Before we begin, note that the values contained in the “No. of Speakers”
column do not contain commas or spaces, e.g., 7,500,000 and 7 500 000.
Although formatting makes large numbers easier for human interpretation,
programming languages don’t require such niceties. Formatting numbers
can lead to an invalid syntax or trigger an unwanted result, depending on
the programming language—so remember to keep numbers unformatted for
programming purposes. Feel free, though, to add spacing or commas at the
data visualization stage, as this will make it easier for your audience to
interpret and especially when presenting large numbers.
On the right-hand side of the table is a vector categorizing the degree of
endangerment of nine different languages. We can convert this column into
numeric values by applying the one-hot encoding method, as demonstrated
in the subsequent table.
Table 9: Example of one-hot encoding
Using one-hot encoding, the dataset has expanded to five columns, and we
have created three new features from the original feature (Degree of
Endangerment). We have also set each column value to “1” or “0,”
depending on the value of the original feature. This now makes it possible
for us to input the data into our model and choose from a broader spectrum
of machine learning algorithms. The downside is that we have more dataset
features, which may slightly extend processing time. This is usually
manageable but can be problematic for datasets where the original features
are split into a large number of new features.
One hack to minimize the total number of features is to restrict binary cases
to a single column. As an example, a speed dating dataset on kaggle.com
lists “Gender” in a single column using one- hot encoding. Rather than
create discrete columns for both “Male” and “Female,” they merged these
two features into one. According to the dataset’s key, females are denoted as
“0” and males as “1.” The creator of the dataset also used this technique for
“Same Race” and “Match.”
Table 10: Speed dating results, database: https://www.kaggle.com/annavictoria/speed-dating-
experiment
Binning
Binning (also called bucketing) is another method of feature engineering
but is used for converting continuous numeric values into multiple binary
features called bins or buckets according to their range of values.
Whoa, hold on! Aren’t numeric values a good thing? Yes, in most cases
continuous numeric values are preferred as they are compatible with a
broader selection of algorithms. Where numeric values are not ideal, is in
situations where they list variations irrelevant to the goals of your analysis.
Let’s take house price evaluation as an example. The exact measurements
of a tennis court might not matter much when evaluating house property
prices; the relevant information is whether the property has a tennis court.
This logic probably also applies to the garage and the swimming pool,
where the existence or non-existence of the variable is generally more
influential than their specific measurements.
The solution here is to replace the numeric measurements of the tennis
court with a True/False feature or a categorical value such as “small,”
“medium,” and “large.” Another alternative would be to apply one-hot
encoding with “0” for homes that do not have a tennis court and “1” for
homes that do have a tennis court.
Normalization
While machine learning algorithms can run without using the next two
techniques, normalization and standardization help to improve model
accuracy when used with the right algorithm. The former (normalization)
rescales the range of values for a given feature into a set range with a
prescribed minimum and maximum, such as [0, 1] or [−1, 1]. By containing
the range of the feature, this technique helps to normalize the variance
among the dataset’s features which may otherwise be exaggerated by
another factor. The variance of a feature measured in centimeters, for
example, might distract the algorithm from another feature with a similar or
higher degree of variance but that is measured in meters or another metric
that downplays the actual variance of the feature.
Normalization, however, usually isn’t recommended for rescaling features
with an extreme range as the normalized range is too narrow to emphasize
extremely high or low feature values.
Standardization
A better technique for emphasizing high or low feature values is
standardization. This technique converts unit variance to a standard normal
distribution with a mean of zero and a standard deviation (σ) of one. [15] This
means that an extremely high or low value would be expressed as three or
more standard deviations from the mean.
Missing Data
Dealing with missing data is never a desired situation. Imagine unpacking a
jigsaw puzzle with five percent of the pieces missing. Missing values in
your dataset can be equally frustrating and interfere with your analysis and
the model’s predictions. There are, however, strategies to minimize the
negative impact of missing data.
One approach is to approximate missing values using the mode value. The
mode represents the single most common variable value available in the
dataset. This works best with categorical and binary variable types, such as
one to five-star rating systems and positive/negative drug tests respectively.
While it’s common to split the data 70/30 or 80/20, there is no set rule for
preparing a training-test split. Given the growing size of modern datasets
(with upwards of a million or more rows), it might be optimal to use a less
even split such as 90/10 as this will give you more data to train your model
while having enough data left over to test your model.
Before you split your data, it’s essential that you randomize the row order.
This helps to avoid bias in your model, as your original dataset might be
arranged alphabetically or sequentially according to when the data was
collected. If you don’t randomize the data, you may accidentally omit
significant variance from the training data that can cause unwanted
surprises when you apply the training model to your test data. Fortunately,
Scikit-learn provides a built-in command to shuffle and randomize your
data with just one line of code as demonstrated in Chapter 17.
After randomizing the data, you can begin to design your model and apply
it to the training data. The remaining 30 percent or so of data is put to the
side and reserved for testing the accuracy of the model later; it’s imperative
not to test your model with the same data you used for training. In the case
of supervised learning, the model is developed by feeding the machine the
training data and analyzing relationships between the features (X) of the
input data and the final output (y).
The next step is to measure how well the model performed. There is a range
of performance metrics and choosing the right method depends on the
application of the model. Area under the curve (AUC) – Receiver Operating
Characteristic (ROC) [16] , confusion matrix, recall, and accuracy are four
examples of performance metrics used with classification tasks such as an
email spam detection system. Meanwhile, mean absolute error and root
mean square error (RMSE) are commonly used to assess models that
provide a numeric output such as a predicted house value.
In this book, we use mean absolute error (MAE), which measures the
average of the errors in a set of predictions on a numeric/continuous scale,
i.e. how far is the regression hyperplane to a given data point. Using Scikit-
learn, mean absolute error is found by inputting the X values from the
training data into the model and generating a prediction for each row in the
dataset. Scikit-learn compares the predictions of the model to the correct
output (y) and measures the model’s accuracy. You’ll know that the model
is accurate when the error rate for the training and test dataset is low, which
means the model has learned the dataset’s underlying trends and patterns. If
the average recorded MAE or RMSE is much higher using the test data than
the training data, this is usually an indication of overfitting (discussed in
Chapter 11) in the model. Once the model can adequately predict the values
of the test data, it’s ready to use in the wild.
If the model fails to predict values from the test data accurately, check that
the training and test data were randomized. Next, you may need to modify
the model's hyperparameters. Each algorithm has hyperparameters; these
are your algorithm’s learning settings( and not the settings of the actual
model itself). In simple terms, hyperparameters control and impact how fast
the model learns patterns and which patterns to identify and analyze.
Discussion of algorithm hyperparameters and optimization is discussed in
Chapter 11 and Chapter 18.
Cross Validation
While split validation can be effective for developing models using existing
data, question marks naturally arise over whether the model can remain
accurate when used on new data. If your existing dataset is too small to
construct a precise model, or if the training/test partition of data is not
appropriate, this may later lead to poor predictions with live data.
Fortunately, there is a valid workaround for this problem. Rather than split
the data into two segments (one for training and one for testing), you can
implement what’s called cross validation . Cross validation maximizes the
availability of training data by splitting data into various combinations and
testing each specific combination.
Cross validation can be performed using one of two primary methods. The
first method is exhaustive cross validation , which involves finding and
testing all possible combinations to divide the original sample into a
training set and a test set. The alternative and more common method is non-
exhaustive cross validation, known as k-fold validation . The k -fold
validation technique involves splitting data into k assigned buckets and
reserving one of those buckets for testing the training model at each round.
To perform k -fold validation, data are randomly assigned to k number of
equal-sized buckets. One bucket is reserved as the test bucket and is used to
measure and evaluate the performance of the remaining (k -1) buckets.
Figure 13: k -fold validation
You can also find video tutorials on how to code models in Python using
algorithms mentioned in this book. You can find these free video tutorials at
https://scatterplotpress.teachable.com/p/ml-code-exercises .
7
LINEAR REGRESSION
As the “Hello World” of supervised learning algorithms, regression analysis
is a simple technique for predicting an unknown variable using the results
you do know. The first regression technique we’ll examine is linear
regression, which generates a straight line to describe linear relationships.
We’ll start by examining the basic components of simple linear regression
with one independent variable before discussing multiple regression with
multiple independent variables.
Using the Seinfeld TV sitcom series as our data, let’s start by plotting the
two following variables, with season number as the x coordinate and the
number of viewers per season (in millions) as the y coordinate.
We can now see the dataset plotted on the scatterplot, with an upward trend
in viewers starting at season 4 and the peak at season 9.
Let’s next define the independent and dependent variables. For this
example, we’ll use the number of viewers per season as the dependent
variable (what we want to predict) and the season number as the
independent variable.
Using simple linear regression, let’s now insert a straight line to describe
the upward linear trend of our small dataset.
Figure 15: Linear regression hyperplane
As shown in Figure 15, this regression line neatly dissects the full company
of data points. The technical term for the regression line is the hyperplane ,
and you’ll see this term used throughout your study of machine learning. In
a two-dimensional space, a hyperplane serves as a (flat) trendline, which is
how Google Sheets titles linear regression in their scatterplot customization
menu.
The goal of linear regression is to split the data in a way that minimizes the
distance between the hyperplane and the observed values. This means that
if you were to draw a perpendicular line (a straight line at an angle of 90
degrees) from the hyperplane to each data point on the plot, the aggregate
distance of each point would equate to the smallest possible distance to the
hyperplane. The distance between the best fit line and the observed values
is called the residual or error and the closer those values are to the
hyperplane, the more accurate the model’s predictions.
Figure 16: Error is the distance between the hyperplane and the observed value
The Slope
An important part of linear regression is the slope , which can be
conveniently calculated by referencing the hyperplane. As one variable
increases, the other variable will increase by the average value denoted by
the hyperplane. The slope is therefore helpful for formulating predictions,
such as predicting the number of season viewers for a potential tenth season
of Seinfeld. Using the slope, we can input 10 as the x coordinate and find
the corresponding y value, which in this case, is approximately 40 million
viewers.
Figure 17: Using the slope/hyperplane to make a prediction
While linear regression isn’t a fail-proof method for predicting trends, the
trendline does offer a basic reference point for predicting unknown or future
events.
Calculation Example
Although your programming language takes care of this automatically, it’s
interesting to know how simple linear regression works. We’ll use the
following dataset to break down the formula.
Table 12: Sample dataset
# The final two columns of the table are not part of the original dataset and have been added for
reference to complete the following formula.
Where:
Σ = Total sum
Σx = Total sum of all x values (1 + 2 + 1 + 4 + 3 = 11)
Σy = Total sum of all y values (3 + 4 + 2 + 7 + 5 = 21)
Σxy = Total sum of x*y for each row (3 + 8 + 2 + 28 + 15 = 56)
Σx2 = Total sum of x*x for each row (1 + 4 + 1 + 16 + 9 = 31)
n = Total number of rows. In the case of this example, n is equal to 5.
a=
((21 x 31) – (11 x 56)) / (5(31) – 112 )
(651 – 616) / (155 – 121)
35 / 34 = 1.029
b=
(5(56) – (11 x 21)) / (5(31) – 112 )
(280 – 231) / (155 – 121)
49 / 34 = 1.441
Insert the “a” and “b” values into the linear formula.
y = bx + a
y = 1.441x + 1.029
The linear formula y = 1.441x + 1.029 dictates how to draw the hyperplane.
Let’s now test the linear regression model by looking up the coordinates for
x = 2.
y = 1.441(x) + 1.029
y = 1.441(2) + 1.029
y = 3.911
In this case, the prediction is very close to the actual result of 4.0.
Discrete Variables
While the output (dependent variable) of linear regression must be
continuous in the form of a floating-point or integer (whole number) value,
the input (independent variables) can be continuous or categorical. For
categorical variables, i.e. gender, these variables must be expressed
numerically using one-hot encoding (0 or 1) and not as a string of letters
(male, female).
Variable Selection
Before finishing this chapter, it’s important to address the dilemma of
variable selection and choosing an appropriate number of independent
variables. On the one hand, adding more variables helps to account for
more potential factors that control patterns in the data. On the other hand,
this rationale only holds if the variables are relevant and possess some
correlation/linear relationship with the dependent variable.
The expansion of independent variables also creates more relationships to
consider. In simple linear regression, we saw a one-to-one relationship
between two variables, whereas in multiple linear regression there is a
many-to-one relationship. In multiple linear regression, not only are the
independent variables potentially related to the dependent variable, but they
are also potentially related to each other.
Figure 19: Simple linear regression (above) and multiple linear regression (below)
Judging by the upward linear trend, we can see that these two variables are
partly correlated. However, if we were to insert a linear regression
hyperplane, there would be significant residuals/error on both sides of the
hyperplane to confirm that these two variables aren’t strongly or directly
correlated and we can definitely include both these variables in our
regression model.
The following heatmap, shown in Figure 21, also confirms a modest
correlation score of 0.6 between total_bill and size.
1) The dependent variable for this model should be which variable?
A) size
B) total_bill and tip
C) total_bill
D) tip
2) From looking only at the data preview above, which variable(s)
appear to have a linear relationship with total_bill?
A) smoker
B) total_bill and size
C) time
D) smoker
3) It’s important for the independent variables to be strongly
correlated with the dependent variable and one or more of the other
independent variables. True or False?
ANSWERS
1) D, tip
LOGISTIC REGRESSION
As demonstrated in the previous chapter, linear regression is useful for
quantifying relationships between variables to predict a continuous
outcome. Total bill and size (number of guests) are both examples of
continuous variables.
However, what if we want to predict a categorical variable such as “new
customer” or “returning customer”? Unlike linear regression, the dependent
variable (y) is no longer a continuous variable (such as total tip) but rather a
discrete categorical variable.
Rather than quantify the linear relationship between variables, we need to
use a classification technique such as logistic regression.
Logistic regression is still a supervised learning technique but produces a
qualitative prediction rather than a quantitative prediction. This algorithm is
often used to predict two discrete classes, e.g., pregnant or not pregnant .
Given its strength in binary classification, logistic regression is used in
many fields including fraud detection, disease diagnosis, emergency
detection, loan default detection, or to identify spam email through the
process of discerning specific classes, e.g., non-spam and spam.
Using the sigmoid function, logistic regression finds the probability of
independent variables (X) producing a discrete dependent variable (y) such
as “spam” or “non-spam.”
Where:
x = the independent variable you wish to transform
e = Euler's constant, 2.718
Figure 22: A sigmoid function used to classify data points
The sigmoid function produces an S-shaped curve that can convert any
number and map it into a numerical value between 0 and 1 but without ever
reaching those exact limits. Applying this formula, the sigmoid function
converts independent variables into an expression of probability between 0
and 1 in relation to the dependent variable. In a binary case, a value of 0
represents no chance of occurring, and 1 represents a certain chance of
occurring. The degree of probability for values located between 0 and 1 can
be found according to how close they rest to 0 (impossible) or 1 (certain
possibility).
Based on the found probabilities of the independent variables, logistic
regression assigns each data point to a discrete class. In the case of binary
classification (shown in Figure 22), the cut-off line to classify data points is
0.5. Data points that record a value above 0.5 are classified as Class A, and
data points below 0.5 are classified as Class B. Data points that record a
result of precisely 0.5 are unclassifiable but such instances are rare due to
the mathematical component of the sigmoid function.
Following the logistic transformation using the Sigmoid function, the data
points are assigned to one of two classes as presented in Figure 23.
Figure 23: An example of logistic regression
1) Which three variables (in their current form) could we use as the
dependent variable to classify penguins?
2) Which row(s) contains missing values?
3) Which variable in the dataset preview is binary?
ANSWERS
1) species, island, or sex
3) sex
(Species and island might also be binary but we can’t judge from the
screenshot alone.)
9
k -NEAREST NEIGHBORS
Another popular classification technique in machine learning is k -nearest
neighbors (k -NN). As a supervised learning algorithm, k -NN classifies
new data points based on their position to nearby data points.
In many ways, k -NN is similar to a voting system or a popularity contest.
Imagine you’re the new kid at school and you need to know how to dress in
order to fit in with the rest of the class. On your first day at school, you see
six of the nine students sitting closest to you with their sleeves rolled-up.
Based on numerical supremacy and close proximity, the following day you
also make the decision to roll up your sleeves.
Let’s now look at another example.
Figure 25: An example of k- NN clustering used to predict the class of a new data point
Here in Figure 25, the data points have been classified into two classes, and
a new data point, whose class is unknown, is added to the plot. Using k -
NN, we can predict the category of the new data point based on its position
to the existing data points.
First, though, we need to set “k ” to determine how many data points we
want to use to classify the new data point. If we set k to 3, k -NN analyzes
the new data point’s position in respect to the three nearest data points
(neighbors). The outcome of selecting the three closest neighbors returns
two Class B data points and one Class A data point. Defined by k (3), the
model’s prediction for determining the category of the new data point is
Class B as it returns two out of the three nearest neighbors.
The chosen number of neighbors identified, defined by k , is crucial in
determining the results. In Figure 25, you can see that the outcome of
classification changes by altering k from “3” to “7.” It’s therefore useful to
test numerous k combinations to find the best fit and avoid setting k too low
or too high. Setting k too low will increase bias and lead to misclassification
and setting k too high will make it computationally expensive. Setting k to
an uneven number will also help to eliminate the possibility of a statistical
stalemate and an invalid result. Five is the default number of neighbors for
this algorithm using Scikit-learn.
Given that the scale of the individual variables has a major impact on the
output of k -NN, the dataset usually needs to be scaled to standardize
variance as discussed in Chapter 5. This transformation will help to avoid
one or more variables with a high range unfairly pulling the focus of the k -
NN model.
In regards to what type of data to use with k -NN, this algorithm works best
with continuous variables. It is still possible to use binary categorical
variables represented as 0 and 1, but the results are more likely to be
informed by the binary splits relative to the dispersion across other
variables as visualized in Figure 26.
Figure 26: One binary variable and two continuous variables
Above, we can see that the horizontal x-axis is binary (0 or 1), which splits
the data into two distinct sides. Moreover, if we switch one of the existing
continuous variables to a binary variable (as shown in Figure 27), we can
see that the distance between variables is influenced even more greatly by
the outcome of the binary variables.
If you do wish to examine binary variables, it’s therefore best to only
include critical binary variables for k -NN analysis.
3) One-hot encoding (to convert the variable into a numerical identifier of
0 or 1)
1
0
k -MEANS CLUSTERING
The next method of analysis involves grouping or clustering data points that
share similar attributes using unsupervised learning. An online business, for
example, wants to examine a segment of customers that purchase at the
same time of the year and discern what factors influence their purchasing
behavior. By understanding a given cluster of customers, they can then form
decisions regarding which products to recommend to customer groups using
promotions and personalized offers. Outside of market research, clustering
can also be applied to other scenarios, including pattern recognition, fraud
detection, and image processing.
One of the most popular clustering techniques is k -means clustering. As an
unsupervised learning algorithm, k -means clustering attempts to divide
data into k number of discrete groups and is highly effective at uncovering
new patterns. Examples of potential groupings include animal species,
customers with similar features, and housing market segmentation.
Figure 28: Comparison of original data and clustered data using k- means
Each data point can be assigned to only one cluster, and each cluster is
discrete. This means that there’s no overlap between clusters and no case of
nesting a cluster inside another cluster. Also, all data points, including
anomalies, are assigned to a centroid irrespective of how they impact the
final shape of the cluster. However, due to the statistical force that pulls
nearby data points to a central point, clusters will typically form an
elliptical or spherical shape.
After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of the data points in each cluster, which can be
found by calculating the average x and y values of the data points contained
in each cluster.
Next, take the mean value of the data points in each cluster and plug in
those x and y values to update your centroid coordinates. This will most
likely result in one or more changes to the location of your centroid(s). The
total number of clusters, however, remains the same as you are not creating
new clusters but rather updating their position on the scatterplot. Like
musical chairs, the remaining data points rush to the closest centroid to
form k number of clusters.
Should any data point on the scatterplot switch clusters with the changing
of centroids, the previous step is repeated. This means, again, calculating
the average mean value of the cluster and updating the x and y values of
each centroid to reflect the average coordinates of the data points in that
cluster.
Once you reach a stage where the data points no longer switch clusters after
an update in centroid coordinates, the algorithm is complete, and you have
your final set of clusters.
The following diagrams break down the full algorithmic process.
Figure 32: Two clusters are formed after calculating the Euclidean distance of the remaining data
points to the centroids.
Figure 33: The centroid coordinates for each cluster are updated to reflect the cluster’s mean value.
The two previous centroids stay in their original position and two new centroids are added to the
scatterplot . Lastly, as one data point has switched from the right cluster to the left cluster, the
centroids of both clusters need to be updated one last time.
Figure 34: Two final clusters are produced based on the updated centroids for each cluster
For this example, it took two iterations to successfully create our two
clusters. However, k -means clustering is not always able to reliably identify
a final combination of clusters. In such cases, you will need to switch
tactics and utilize another algorithm to formulate your classification model.
Also, be aware that you may need to rescale the input features using
standardization before running the k -means algorithm. This will help to
preserve the true shape of the clusters and avoid exaggerated variance from
affecting the final output (i.e. over-stretched clusters).
Setting k
When setting “k” for k -means clustering, it’s important to find the right
number of clusters. In general, as k increases, clusters become smaller and
variance falls. However, the downside is that neighboring clusters become
less distinct from one another as k increases. If you set k to the same
number of data points in your dataset, each data point automatically
becomes a standalone cluster. Conversely, if you set k to 1, then all data
points will be deemed as homogenous and fall inside one large cluster.
Needless to say, setting k to either extreme does not provide any worthwhile
insight.
In order to optimize k , you may wish to use a scree plot for guidance. A
scree plot charts the degree of scattering (variance) inside a cluster as the
total number of clusters increase. Scree plots are famous for their iconic
“elbow,” which reflects several pronounced kinks in the plot’s curve. A
scree plot compares the Sum of Squared Error (SSE) for each variation of
total clusters. SSE is measured as the sum of the squared distance between
the centroid and the other neighbors inside the cluster. In a nutshell, SSE
drops as more clusters are produced.
Figure 35: A scree plot
A. k=2
B. k = 100
C. k = 12
D. k=3
2) What mathematical technique might we use to find the appropriate
number of clusters?
A. Big elbow method
B. Mean absolute error
C. Scree plot
D. One-hot encoding
3) Which variable requires data scrubbing?
ANSWERS
1) 12
(Given there are 12 months in a year, there may be some reoccurring
patterns in regards to the number of passengers for each month.)
2) C, Scree plot
3) Month
(This variable needs to be converted into a numerical identifier in order to
measure its distance to other variables.)
1
1
Figure 36: Example of hyperparameters in Python for the algorithm gradient boosting
Shooting targets, as seen in Figure 37, are not a visualization technique used
in machine learning but can be used here to explain bias and variance. [19]
Imagine that the center of the target, or the bull’s-eye, perfectly predicts the
correct value of your data. The dots marked on the target represent an
individual prediction of your model based on the training or test data
provided. In certain cases, the dots will be densely positioned close to the
bull’s-eye, ensuring that predictions made by the model are close to the
actual values and patterns found in the data. In other cases, the model’s
predictions will lie more scattered across the target. The more the
predictions deviate from the bull’s-eye, the higher the bias and the less
reliable your model is at making accurate predictions.
In the first target, we can see an example of low bias and low variance. The
bias is low because the model’s predictions are closely aligned to the center,
and there is low variance because the predictions are positioned densely in
one general location.
The second target (located on the right of the first row) shows a case of low
bias and high variance. Although the predictions are not as close to the
bull’s-eye as the previous example, they are still near to the center, and the
bias is therefore relatively low. However, there is a high variance this time
because the predictions are spread out from each other.
The third target (located on the left of the second row) represents high bias
and low variance and the fourth target (located on the right of the second
row) shows high bias and high variance.
Ideally, you want a situation where there’s both low variance and low bias.
In reality, however, there’s a trade-off between optimal bias and optimal
variance. Bias and variance both contribute to error but it’s the prediction
error that you want to minimize, not the bias or variance specifically.
Like learning to ride a bicycle for the first time, finding an optimal balance
is one of the more challenging aspects of machine learning. Peddling
algorithms through the data is the easy part; the hard part is navigating bias
and variance while maintaining a state of balance in your model.
The new data point is a circle, but it’s located incorrectly on the left side of
the logistic (A) decision boundary (designated for stars). The new data
point, though, remains correctly located on the right side of the SVM (B)
decision boundary (designated for circles) courtesy of ample “support”
supplied by the margin.
Figure 42: Mitigating anomalies
While the examples discussed so far have comprised two features plotted on
a two-dimensional scatterplot, SVM’s real strength lies with high-
dimensional data and handling multiple features. SVM has numerous
advanced variations available to classify high-dimensional data using
what’s called the Kernel Trick. This is an advanced solution to map data
from a low-dimensional to a high-dimensional space when a dataset cannot
be separated using a linear decision boundary in its original space.
Transitioning from a two-dimensional to a three-dimensional space, for
example, allows us to use a linear plane to split the data within a 3-D area.
In other words, the kernel trick lets us classify data points with non-linear
characteristics using linear classification in a higher dimension.
Figure 44: In this example, the decision boundary provides a non-linear separator between the data in
a 2-D space but transforms into a linear separator between data points when projected into a 3-D
space
2) For this model, all variables except for island could potentially be
used as independent variables.
(Penguins living on islands with abundant food sources and few
predators, for example, may have a more balanced ratio of male and
female penguins and grow larger in size. Of course, the only way to find
out is to test the relationship between the island and individual variables
using correlation analysis and other exploratory data analysis
techniques.)
The brain, for example, contains interconnected neurons with dendrites that
receive inputs. From these inputs, the neuron produces an electric signal
output from the axon and emits these signals through axon terminals to
other neurons. Similarly, artificial neural networks consist of interconnected
decision functions, known as nodes, which interact with each other through
axon-like edges .
The nodes of a neural network are separated into layers and generally start
with a wide base. This first layer consists of raw input data (such as
numeric values, text, image pixels or sound) divided into nodes. Each input
node then sends information to the next layer of nodes via the network’s
edges.
Figure 46: The nodes, edges/weights, and sum/activation function of a basic neural network
Each edge in the network has a numeric weight that can be altered based on
experience. If the sum of the connected edges satisfies a set threshold,
known as the activation function , this activates a neuron at the next layer. If
the sum of the connected edges does not meet the set threshold, the
activation function fails, which results in an all or nothing arrangement.
Moreover, the weights assigned to each edge are unique, which means the
nodes fire differently, preventing them from producing the same solution.
Using supervised learning, the model’s predicted output is compared to the
actual output (that’s known to be correct), and the difference between these
two results is measured as the cost or cost value . The purpose of training is
to reduce the cost value until the model’s prediction closely matches the
correct output. This is achieved by incrementally tweaking the network’s
weights until the lowest possible cost value is obtained. This particular
process of training the neural network is called back-propagation . Rather
than navigate from left to right like how data is fed into the network, back-
propagation rolls in reverse from the output layer on the right to the input
layer on the left.
The middle layers are considered hidden because, like human vision, they
covertly process objects between the input and output layers. When faced
with four lines connected in the shape of a square, our eyes instantly
recognize those four lines as a square. We don’t notice the mental
processing that is involved to register the four polylines (input) as a square
(output).
Neural networks work in a similar way as they break data into layers and
process the hidden layers to produce a final output. As more hidden layers
are added to the network, the model’s capacity to analyze complex patterns
also improves. This is why models with a deep number of layers are often
referred to as deep learning [23] to distinguish their deeper and superior
processing abilities.
While there are many techniques to assemble the nodes of a neural network,
the simplest method is the feed-forward network where signals flow only in
one direction and there’s no loop in the network. The most basic form of a
feed-forward neural network is the perceptron , which was devised in the
1950s by Professor Frank Rosenblatt.
Figure 48: Visual representation of a perceptron neural network
Input 1: x1 = 24
Input 2: x2 = 16
We then add a random weight to these two inputs, and they are sent to the
neuron for processing.
Figure 49: Weights are added to the perceptron
Weights
Input 1: 0.5
Input 2: -1
Although the perceptron produces a binary output (0 or 1), there are many
ways to configure the activation function. For this example, we will set the
activation function to ≥ 0. This means that if the sum is a positive number
or equal to zero, then the output is 1. Meanwhile, if the sum is a negative
number, the output is 0.
Figure 50: Activation function where the output (y) is 0 when x is negative, and the output (y) is 1
when x is positive
Thus:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -1.0 = -16
Sum (Σ): 12 + -16 = -4
As a numeric value less than zero, the result produces “0” and does not
trigger the perceptron’s activation function. Given this error, the perceptron
needs to adjust its weights in response.
Updated weights:
Input 1: 24 * 0.5 = 12
Input 2: 16 * -0.5 = -8
Sum (Σ): 12 + -16 = 4
As a positive outcome, the perceptron now produces “1” which triggers the
activation function, and if in a larger network, this would trigger the next
layer of analysis.
In this example, the activation function was ≥ 0. We could, though, modify
the activation threshold to follow a different rule, such as:
x > 3, y = 1
x ≤ 3, y = 0
Figure 51: Activation function where the output (y) is 0 when x is equal to or less than 3, and the
output (y) is 1 when x is greater than 3
Multilayer Perceptrons
The multilayer perceptron (MLP), as with other ANN techniques, is an
algorithm for predicting a categorical (classification) or continuous
(regression) target variable. Multilayer perceptrons are powerful because
they aggregate multiple models into a unified prediction model, as
demonstrated by the classification model shown in Figure 48.
Figure 53: A multilayer perceptron used to classify a social media user’s political preference
In this example, the MLP model is divided into three layers. The input layer
consists of four nodes representing an input feature used to predict a social
media user’s political preference: Age, City, Education, and Gender. A
function is then applied to each input variable to create a new layer of nodes
called the middle or hidden layer. Each node in the hidden layer represents
a function, such as a sigmoid function, but with its own unique
weights/hyperparameters. This means that each input variable, in effect, is
exposed to five different functions. Simultaneously, the hidden layer nodes
are exposed to all four features.
The final output layer for this model consists of two discrete outcomes:
Conservative Party or Democratic Party, which classifies the sample user’s
likely political preference. Note that the number of nodes at each layer will
vary according to the number of input features and the target variable(s).
In general, multilayer perceptrons are ideal for interpreting large and
complex datasets with no time or computational restraints. Less compute-
intensive algorithms, such as decision trees and logistic regression, for
example, are more efficient for working with smaller datasets. Given their
high number of hyperparameters, multilayer perceptrons also demand more
time and effort to tune than other algorithms. In regards to processing time,
a multilayer perceptron takes longer to run than most shallow learning
techniques including logistic regression but is generally faster than SVM.
Deep Learning
For analyzing less complex patterns, a basic multilayer perceptron or an
alternative classification algorithm such as logistic regression and k -nearest
neighbors can be put into practice. However, as patterns in the data become
more complicated—especially in the form of a model with a high number
of inputs such as image pixels—a shallow model is no longer reliable or
capable of sophisticated analysis because the model becomes exponentially
complicated as the number of inputs increases. A neural network, with a
deep number of layers, though, can be used to interpret a high number of
input features and break down complex patterns into simpler patterns, as
shown in Figure 54.
This deep neural network uses edges to detect different physical features to
recognize faces, such as a diagonal line. Like building blocks, the network
combines the node results to classify the input as, say, a human’s face or a
cat’s face and then advances further to recognize individual characteristics.
This is known as deep learning . What makes deep learning “deep” is the
stacking of at least 5-10 node layers.
Object recognition, as used by self-driving cars to recognize objects such as
pedestrians and other vehicles, uses upward of 150 layers and is a popular
application of deep learning. Other applications of deep learning include
time series analysis to analyze data trends measured over set time periods or
intervals, speech recognition, and text processing tasks including sentiment
analysis, topic segmentation, and named entity recognition. More usage
scenarios and commonly paired deep learning techniques are listed in Table
13.
Table 13: Common usage scenarios and paired deep learning techniques
As can be seen from this table, multilayer perceptrons (MLP) have largely
been superseded by new deep learning techniques such as convolution
networks, recurrent networks, deep belief networks, and recursive neural
tensor networks (RNTN). These more advanced versions of a neural
network can be used effectively across a number of practical applications
that are in vogue today. While convolution networks are arguably the most
popular and powerful of deep learning techniques, new methods and
variations are continuously evolving.
CHAPTER QUIZ
Using a multilayer perceptron , your job is to create a model to classify
the gender sex) of penguins that have been affected and rescued in a natural
disaster. However, you can only use the physical attributes of penguins to
train your model. Please note that this dataset has 244 rows.
DECISION TREES
The idea that artificial neural networks can be used to solve a wider
spectrum of learning tasks than other techniques has led some pundits to
hail ANN as the ultimate machine learning algorithm. While there is a
strong case for this argument, this isn’t to say that ANN fits the bill as a
silver bullet algorithm. In certain cases, neural networks fall short, and
decision trees are held up as a popular counterargument.
The huge amount of input data and computational resources required to
train a neural network is the first downside of any attempt to solve all
machine learning problems using this technique. Neural network-based
applications like Google's image recognition engine rely on millions of
tagged examples to recognize classes of simple objects (such as dogs) and
not every organization has the resources available to feed and power a
model of that size. The other major downside of neural networks is the
black-box dilemma, which conceals the model’s decision structure.
Decision trees, on the other hand, are transparent and easy to interpret. They
work with less data and consume less computational resources. These
benefits make decision trees a popular alternative to deploying a neural
network for less complex use cases.
Decision trees are used primarily for solving classification problems but can
also be used as a regression model to predict numeric outcomes.
Classification trees predict categorical outcomes using numeric and
categorical variables as input, whereas regression trees predict numeric
outcomes using numeric and categorical variables as input. Decision trees
can be applied to a wide range of use cases; from picking a scholarship
recipient, to predicting e-commerce sales, and selecting the right job
applicant.
Figure 55: Example of a regression tree
Part of the appeal of decision trees is they can be displayed graphically and
they are easy to explain to non-experts. When a customer queries why they
weren’t selected for a home loan, for example, you can share the decision
tree to show the decision-making process, which isn’t possible using a
black-box technique.
In this table we have ten employees, three input variables (Exceeded KPIs,
Leadership Capability, Aged < 30), and one output variable (Outcome). Our
aim is to classify whether an employee will be promoted/not promoted
based on the assessment of the three input variables.
Let’s first split the data by variable 1 (Exceeded Key Performance
Indicators):
- Six promoted employees who exceeded their KPIs (Yes).
- Four employees who did not exceed their KPIs and who were not
promoted (No).
This variable produces two homogenous groups at the next layer.
Of these three variables, variable 1 (Exceeded KPIs) produces the best split
with two perfectly homogenous groups. Variable 3 produces the second-
best outcome, as one leaf is homogenous. Variable 2 produces two leaves
that are heterogeneous. Variable 1 would therefore be selected as the first
binary question to split this dataset.
Whether it’s ID3 or another algorithm, this process of splitting data into
sub-partitions, known as recursive partitioning , is repeated until a stopping
criterion is met. A stopping point can be based on a range of criteria, such
as:
- When all leaves contain less than 3-5 items.
- When a branch produces a result that places all items in one binary
leaf.
Calculating Entropy
In this next section, we will review the mathematical calculations for
finding the variables that produce the lowest entropy.
As mentioned, building a decision tree starts with setting a variable as the
root node, with each outcome for that variable assigned a branch to a new
decision node, i.e. “Yes” and “No.” A second variable is then chosen to split
the variables further to create new branches and decision nodes.
As we want the nodes to collect as many instances of the same class as
possible, we need to select each variable strategically based on entropy, also
called information value . Measured in units called bits (using a base 2
logarithm expression), entropy is calculated based on the composition of
data points found in each node.
Using the following logarithm equation, we will calculate the entropy for
each potential variable split expressed in bits between 0 and 1.
Please note the logarithm equations can be quickly calculated online using
Google Calculator.
Yes: p1 [6,6] and p2 [0,6]
No: p1 [4,4] and p2 [0,4]
Step 2: Multiply entropy of the two nodes in accordance to the total number
of data points (10)
Yes: p1 [2,4] and p2 [2,4]
No: p1 [4,6] and p2 [2,6]
Step 2: Multiple entropy of the two nodes by total number of data points
(4/10) x 1 + (6/10) x 0.918
0.4 + 0.5508 = 0.9508
Step 2: Multiple entropy of the two nodes by total number of data points
(7/10) x 0.985 + (3/10) x 0
0.6895 + 0 = 0.6895
Results
Exceeded KPIs = 0 bits
Leadership Capability = 0.9508 bits
Aged < 30 = 0.6895 bits
Overfitting
A notable caveat of decision trees is their susceptibility to overfit the model
to the training data. Based on the patterns extracted from the training data, a
decision tree is precise at analyzing and decoding the first round of data.
However, the same decision tree may then fail to classify the test data, as
there could be rules that it’s yet to encounter or because the training/test
data split was not representative of the full dataset. Also, because decision
trees are formed by repeatedly splitting data points into partitions, a slight
change to how the data is split at the top or middle of the tree could
dramatically alter the final prediction and produce a different tree
altogether. The offender, in this case, is our greedy algorithm.
Starting with the first split of the data, the greedy algorithm picks a variable
that best partitions the data into homogenous groups. Like a kid seated in
front of a box of cupcakes, the greedy algorithm is oblivious to the future
repercussions of its short-term actions. The variable used to first split the
data does not guarantee the most accurate model at the end of production.
Instead, a less effective split at the top of the tree might produce a more
accurate model. Thus, although decision trees are highly visual and
effective at classifying a single set of data, they are also inflexible and
vulnerable to overfitting, especially for datasets with high pattern variance.
Bagging
Rather than aiming for the most efficient split at each round of recursive
partitioning, an alternative technique is to construct multiple trees and
combine their predictions. A popular example of this technique is bagging,
which involves growing multiple decision trees using a randomized
selection of input data for each tree and combining the results by averaging
the output (for regression) or voting (for classification).
A key characteristic of bagging is bootstrap sampling . For multiple
decision trees to generate unique insight, there needs to be an element of
variation and randomness across each model. There’s little sense in
compiling five or ten identical models. Bootstrap sampling overcomes this
problem by extracting a random variation of the data at each round, and in
the case of bagging, different variations of the training data are run through
each tree. While this doesn’t eliminate the problem of overfitting, the
dominant patterns in the dataset will appear in a higher number of trees and
emerge in the final class or prediction. As a result, bagging is an effective
algorithm for dealing with outliers and lowering the degree of variance
typically found with a single decision tree.
Random Forests
A closely related technique to bagging is random forests . While both
techniques grow multiple trees and utilize bootstrap sampling to randomize
the data, random forests artificially limit the choice of variables by capping
the number of variables considered for each split. In other words, the
algorithm is not allowed to consider all n variables at each partition.
In the case of bagging, the trees often look similar because they use the
same variable early in their decision structure in a bid to reduce entropy.
This means the trees’ predictions are highly correlated and closer to a single
decision tree in regards to overall variance. Random forests sidestep this
problem by forcing each split to consider a limited subset of variables,
which gives other variables a greater chance of selection, and by averaging
unique and uncorrelated trees, the final decision structure is less variable
and often more reliable. As the model is trained using a subset of variables
fewer than those actually available, random forests are considered a
weakly-supervised learning technique.
Boosting also mitigates the issue of overfitting and it does so using fewer
trees than random forests. While adding more trees to a random forest
usually helps to offset overfitting, the same process can cause overfitting in
the case of boosting and caution should be taken as new trees are added.
The tendency of boosting algorithms towards overfitting can be explained
by their highly-tuned focus of learning and reiterating from earlier mistakes.
Although this typically translates to more accurate predictions—superior to
that of most algorithms—it can lead to mixed results in the case of data
stretched by a high number of outliers. In general, machine learning models
should not fit too close to outlier cases, but this can be difficult for boosting
algorithms to obey as they are constantly reacting to errors observed and
isolated during production. For complex datasets with a large number of
outliers, random forests may be a preferred alternative approach to
boosting.
The other main downside of boosting is the slow processing speed that
comes with training a sequential decision model. As trees are trained
sequentially, each tree must wait for the previous tree, thereby limiting the
production scalability of the model and especially as more trees are added.
A random forest, meanwhile, is trained in parallel, making it faster to train.
The final downside, which applies to boosting as well as random forests and
bagging, is the loss of visual simplicity and ease of interpretation that
comes with using a single decision tree. When you have hundreds of
decision trees it becomes more difficult to visualize and interpret the overall
decision structure.
If, however, you have the time and resources to train a boosting model and
a dataset with consistent patterns, the final model can be extremely
worthwhile. Once deployed, predictions from the trained decision model
can be generated quickly and accurately using this algorithm, and outside of
deep learning, boosting is one of the most popular algorithms in machine
learning today.
CHAPTER QUIZ
Your task is to predict the body mass (body_mass_g) of penguins using the
penguin dataset and the random forests algorithm.
2) False
(Gradient boosting runs sequentially, making it slower to train. A random
forest is trained simultaneously, making it faster to train.)
ENSEMBLE MODELING
When making important decisions, we generally prefer to collate multiple
opinions as opposed to listening to a single perspective or the first person to
voice their opinion. Similarly, it’s important to consider and trial more than
one algorithm to find the best model for your data. In advanced machine
learning, it can even be advantageous to combine algorithms or models
using a method called ensemble modeling , which amalgamates outputs to
build a unified prediction model. By combining the output of different
models (instead of relying on a single estimate), ensemble modeling helps
to build a consensus on the meaning of the data. Aggregated estimates are
also generally more accurate than any one technique. It’s vital, though, for
the ensemble models to display some degree of variation to avoid
mishandling the same errors.
In the case of classification, multiple models are consolidated into a single
prediction using a voting system [25] based on frequency, or numeric
averaging in the case of regression problems. [26] , [27] Ensemble models can
also be divided into sequential or parallel and homogenous or
heterogeneous.
Let’s start by looking at sequential and parallel models. In the case of the
former, the model’s prediction error is reduced by adding weights to
classifiers that previously misclassified data. Gradient boosting and
AdaBoost (designed for classification problems) are both examples of
sequential models. Conversely, parallel ensemble models work concurrently
and reduce error by averaging. Random forests are an example of this
technique.
Ensemble models can be generated using a single technique with numerous
variations, known as a homogeneous ensemble, or through different
techniques, known as a heterogeneous ensemble. An example of a
homogeneous ensemble model would be multiple decision trees working
together to form a single prediction (i.e. bagging). Meanwhile, an example
of a heterogeneous ensemble would be the usage of k -means clustering or a
neural network in collaboration with a decision tree algorithm.
Naturally, it’s important to select techniques that complement each other.
Neural networks, for instance, require complete data for analysis, whereas
decision trees are competent at handling missing values. [28] Together, these
two techniques provide added benefit over a homogeneous model. The
neural network accurately predicts the majority of instances where a value
is provided, and the decision tree ensures that there are no “null” results that
would otherwise materialize from missing values using a neural network.
While the performance of an ensemble model outperforms a single
algorithm in the majority of cases, [29] the degree of model complexity and
sophistication can pose as a potential drawback. An ensemble model
triggers the same trade-off in benefits as a single decision tree and a
collection of trees, where the transparency and ease of interpretation of, say
decision trees, is sacrificed for the accuracy of a more complex algorithm
such as random forests, bagging or boosting. The performance of the model
will win out in most cases, but interpretability is an important factor to
consider when choosing the right algorithm(s) for your data.
In terms of selecting a suitable ensemble modeling technique, there are four
main methods: bagging, boosting, a bucket of models, and stacking.
As a heterogeneous ensemble technique, a bucket of models trains multiple
different algorithmic models using the same training data and then picks the
one that performed most accurately on the test data.
Bagging , as we know, is an example of parallel model averaging using a
homogenous ensemble, which draws upon randomly drawn data and
combines predictions to design a unified model.
Boosting is a popular alternative technique that is still a homogenous
ensemble but addresses error and data misclassified by the previous
iteration to produce a sequential model. Gradient boosting and AdaBoost
are both examples of boosting algorithms.
Stacking runs multiple models simultaneously on the data and combines
those results to produce a final model. Unlike boosting and bagging,
stacking usually combines outputs from different algorithms (heterogenous)
rather than altering the hyperparameters of the same algorithm
(homogenous). Also, rather than assigning equal trust to each model using
averaging or voting, stacking attempts to identify and add emphasis to well-
performing models. This is achieved by smoothing out the error rate of
models at the base level (known as level-0) using a weighting system,
before pushing those outputs to the level-1 model where they are combined
and consolidated into a final prediction.
DEVELOPMENT ENVIRONMENT
After examining the statistical underpinnings of numerous algorithms, it’s
time to turn our attention to the coding component of machine learning and
installing a development environment.
Although there are various options in regards to programming languages (as
outlined in Chapter 4), Python has been chosen for this three-part exercise
as it’s easy to learn and widely used in industry and online learning courses.
If you don't have any experience in programming or coding with Python,
there’s no need to worry. Feel free to skip the code and focus on the text
explanations to understand the steps involved. A primer on programming
with Python is also included in the Appendix section of this book.
As for our development environment, we will be installing Jupyter
Notebook, which is an open-source web application that allows for the
editing and sharing of code notebooks. Jupyter Notebook can be installed
using the Anaconda Distribution or Python’s package manager, pip. As an
experienced Python user, you may wish to install Jupyter Notebook via pip,
and there are instructions available on the Jupyter Notebook website
(http://jupyter.org/install.html) outlining this option. For beginners, I
recommend choosing the Anaconda Distribution option, which offers an
easy click-and-drag setup
(https://www.anaconda.com/products/individual/).
This installation option will direct you to the Anaconda website. From
there, you can select an Anaconda installer for Windows, macOS, or Linux.
Again, you can find instructions available on the Anaconda website as per
your choice of operating system.
After installing Anaconda to your machine, you’ll have access to a range of
data science applications including rstudio, Jupyter Notebook, and graphviz
for data visualization. For this exercise, select Jupyter Notebook by clicking
on “Launch” inside the Jupyter Notebook tab.
Figure 60: The Anaconda Navigator portal
Import Libraries
The first step of any machine learning project in Python is installing the
necessary code libraries. These libraries will differ from project to project
based on the composition of your data and what you wish to achieve, i.e.,
data visualization, ensemble modeling, deep learning, etc.
In the code snippet above is the example code to import Pandas, which is a
popular Python library used in machine learning.
If saved to Desktop on Windows, you would import the .csv file using a
structure similar to this example:
df = pd.read_csv('C:\\Users\\John\\Desktop\\Melbourne_housing_FULL.csv')
The default number of rows displayed using the head( ) command is five. To
set an alternative number of rows to display, enter the desired number
directly inside the parentheses as shown below in Figure 66.
df.head(10)
Figure 66: Previewing a dataframe with 10 rows
This now previews a dataframe with ten rows. You’ll also notice that the
total number of rows and columns (10 rows x 21 columns) is listed below
the dataframe on the left-hand side.
In this example, df.iloc[100 ] is used to find the row indexed at position 100 in
the dataframe, which is a property located in Airport West. Be careful to
note that the first row in a Python dataframe is indexed as 0. Thus, the
Airport West property is technically the 101st property contained in the
dataframe.
Print Columns
The final code snippet I’d like to introduce to you is column s , which is a
convenient method to print the dataset’s column titles. This will prove
useful later when configuring which features to select, modify or remove
from the model.
df.columns
Figure 68: Print columns
Again, “Run” the code to view the outcome, which in this case is the 21
column titles and their data type (dtype), which is ‘object.’ You may notice
that some of the column titles are misspelled. We’ll discuss this issue in the
next chapter.
1
7
1) Import Libraries
To build our model, we first need to import Pandas and a number of
functions from Scikit-learn, including gradient boosting (ensemble) and
mean absolute error to evaluate performance.
Import each of the following libraries by entering these exact commands in
Jupyter Notebook:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
2) Import Dataset
Use the pd.read_cs v command to load the Melbourne Housing Market dataset
(as we did in the previous chapter) into a Pandas dataframe.
df = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
Please also note that the property values in this dataset are expressed in
Australian Dollars—$1 AUD is approximately $0.77 USD (as of 2017).
3) Scrub Dataset
This next stage involves scrubbing the dataset. Remember, scrubbing is the
process of refining your dataset such as modifying or removing incomplete,
irrelevant or duplicated data. It may also entail converting text-based data to
numeric values and the redesigning of features.
It’s worthwhile to note that some aspects of data scrubbing may take place
prior to importing the dataset into the development environment. For
instance, the creator of the Melbourne Housing Market dataset misspelled
“Longitude” and “Latitude” in the head columns. As we will not be
examining these two variables in our model, there’s no need to make any
changes. If, however, we did choose to include these two variables in our
model, it would be prudent to amend this error in the source file.
From a programming perspective, spelling mistakes contained in the
column titles don’t pose a problem as long as we apply the same spelling to
perform our code commands. However, this misnaming of columns could
lead to human errors, especially if you are sharing your code with other
team members. To avoid confusion, it’s best to fix spelling mistakes and
other simple errors in the source file before importing the dataset into
Jupyter Notebook or another development environment. You can do this by
opening the CSV file in Microsoft Excel (or equivalent program), editing
the dataset, and then resaving it again as a CSV file.
While simple errors can be corrected in the source file, major structural
changes to the dataset such as removing variables or missing values are best
performed in the development environment for added flexibility and to
preserve the original dataset for future use. Manipulating the composition of
the dataset in the development environment is less permanent and is
generally easier and quicker to implement than doing so in the source file.
Scrubbing Process
Let’s remove columns we don’t wish to include in the model using the
delete command and entering the vector (column) titles we wish to remove.
# The misspellings of “longitude” and “latitude” are preserved here
del df['Address']
del df['Method']
del df['SellerG']
del df['Date']
del df['Postcode']
del df['Lattitude']
del df['Longtitude']
del df['Regionname']
del df['Propertycount']
Keep in mind too that it’s important to drop rows with missing values after
applying the delete command to remove columns (as shown in the previous
step). This way, there’s a better chance of preserving more rows from the
original dataset. Imagine dropping a whole row because it was missing the
value for a variable that would later be deleted such as a missing post code!
Next, let’s convert columns that contain non-numeric data to numeric
values using one-hot encoding. With Pandas, one-hot encoding can be
performed using the pd. get_dummie s method.
df = pd.get_dummies(df, columns = ['Suburb', 'CouncilArea', 'Type'])
This code command converts column values for Suburb, CouncilArea, and
Type into numeric values through the application of one-hot encoding.
Lastly, assign the dependent and independent variables with Price as y and
X as the remaining 11 variables (with Price dropped from the dataframe
using the dro p method).
X = df.drop('Price',axis=1)
y = df['Price']
4) Split the Dataset
We are now at the stage of splitting the data into training and test segments.
For this exercise, we’ll proceed with a standard 70/30 split by calling the
Scikit-learn command below with a test_siz e of “0.3” and shuffling the
dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)
The first line is the algorithm itself (gradient boosting) and comprises just
one line of code. The code below dictates the hyperparameters that
accompany this algorithm.
n_estimators states the number of decision trees. Recall that a high number
of trees generally improves accuracy (up to a certain point) but will
inevitably extend the model’s processing time. I have selected 150 decision
trees as an initial starting point.
learning_rate controls the rate at which additional decision trees influence
the overall prediction. This effectively shrinks the contribution of each tree
by the set learning_rat e . Inserting a low rate here, such as 0.1, should help to
improve accuracy.
max_depth defines the maximum number of layers (depth) for each
decision tree. If “None” is selected, then nodes expand until all leaves are
pure or until all leaves contain less than min_samples_lea f . Here, I have
chosen a high maximum number of layers (30), which will have a dramatic
effect on the final output, as we’ll soon see.
min_samples_split defines the minimum number of samples required to
execute a new binary split. For example, min_samples_split = 1 0 means there
must be ten available samples in order to create a new branch.
min_samples_leaf represents the minimum number of samples that must
appear in each child node (leaf) before a new branch can be implemented.
This helps to mitigate the impact of outliers and anomalies in the form of a
low number of samples found in one leaf as a result of a binary split. For
example, min_samples_leaf = 4 requires there to be at least four available
samples within each leaf for a new branch to be created.
max_features is the total number of features presented to the model when
determining the best split. As mentioned in Chapter 14, random forests and
gradient boosting restrict the number of features fed to each individual tree
to create multiple results that can be voted upon later.
If an integer (whole number), the model will consider max_feature s at each
split (branch). If the value is a float (e.g., 0.6), then max_feature s is the
percentage of total features randomly selected. Although it sets a maximum
number of features to consider in identifying the best split, total features
may exceed the set limit if no split can initially be made.
loss calculates the model's error rate. For this exercise, we are using hube r
which protects against outliers and anomalies. Alternative error rate options
include l s (least squares regression), la d (least absolute deviations), and
quantil e (quantile regression). Huber is actually a combination of least
squares regression and least absolute deviations.
To learn more about gradient boosting hyperparameters, please refer to the
Scikit-learn documentation for this algorithm. [31]
After setting the model’s hyperparameters, we’ll use the fit( ) function from
Scikit-learn to link the training data to the learning algorithm stored in the
variable mode l to train the prediction model.
model.fit(X_train, y_train)
Here, we input our y_trai n values, which represent the correct results from
the training dataset. The predict( ) function is called on the X_trai n set and
generates predictions. The mean_absolute_erro r function then compares the
difference between the actual values and the model’s predictions. The
second line of the code then prints the results to two decimal places
alongside the string (text) “ Training Set Mean Absolute Error : ”. The same
process is also repeated using the test data.
mae_test = mean_absolute_error(y_test, model.predict(X_test))
print ("Test Set Mean Absolute Error: %.2f" % mae_test)
Let’s now run the entire model by right-clicking and selecting “Run” or
navigating from the Jupyter Notebook menu: Cell > Run All.
Wait 30 seconds or longer for the computer to process the training model.
The results, as shown below, will then appear at the bottom of the notebook.
For this model, our training set’s mean absolute error is $27,834.12, and the
test set’s mean absolute error is $168,262.14. This means that on average,
the training set miscalculated the actual property value by $27,834.12. The
test set, meanwhile, miscalculated the property value by $168,262.14 on
average.
This means that our training model was accurate at predicting the actual
value of properties contained in the training data. While $27,834.12 may
seem like a lot of money, this average error value is low given the
maximum range of our dataset is $8 million. As many of the properties in
the dataset are in excess of seven figures ($1,000,000+), $27,834.12
constitutes a reasonably low error rate.
How did the model fare with the test data? The test data provided less
accurate predictions with an average error rate of $168,262.14. A high
discrepancy between the training and test data is usually an indicator of
overfitting in the model. As our model is tailored to patterns in the training
data, it stumbled when making predictions using the test data, which
probably contains new patterns that the model hasn’t seen. The test data, of
course, is likely to carry slightly different patterns and new potential
outliers and anomalies.
However, in this case, the difference between the training and test data is
exacerbated because we configured our model to overfit the training data.
An example of this issue was setting max_dept h to “30.” Although placing a
high maximum depth improves the chances of the model finding patterns in
the training data, it does tend to lead to overfitting.
Lastly, please take into account that because the training and test data are
shuffled randomly, and data is fed to decision trees at random, the predicted
results will differ slightly when replicating this model on your own
machine.
MODEL OPTIMIZATION
In the previous chapter we built our first supervised learning model. We
now want to improve its prediction accuracy with future data and reduce the
effects of overfitting. A good starting point is to modify the model’s
hyperparameters. Holding the other hyperparameters constant, let’s begin
by adjusting the maximum depth from “30” to “5.” The model now
generates the following results:
Although the mean absolute error of the training set is now higher, this
helps to reduce the issue of overfitting and should improve the model’s
performance. Another step to optimize the model is to add more trees. If we
set n_estimator s to 250, we now see these results from the model:
This second optimization reduces the training set’s absolute error rate by
approximately $11,000 and there is a smaller gap between the training and
test results for mean absolute error. [32]
Together, these two optimizations underline the importance of
understanding the impact of individual hyperparameters. If you decide to
replicate this supervised machine learning model at home, I recommend
that you test modifying each of the hyperparameters individually and
analyze their impact on mean absolute error using the training data. In
addition, you’ll notice changes in the machine’s processing time based on
the chosen hyperparameters. Changing the maximum number of branch
layers ( max_dept h ), for example, from “30” to “5” will dramatically reduce
total processing time. Processing speed and resources will become an
important consideration when you move on to working with larger datasets.
Another important optimization technique is feature selection. Earlier, we
removed nine features from the dataset but now might be a good time to
reconsider those features and test whether they have an impact on the
model’s prediction accuracy. “SellerG” would be an interesting feature to
add to the model because the real estate company selling the property might
have some impact on the final selling price.
Alternatively, dropping features from the current model may reduce
processing time without having a significant impact on accuracy—or may
even improve accuracy. When selecting features, it’s best to isolate feature
modifications and analyze the results, rather than applying various changes
at once.
While manual trial and error can be a useful technique to understand the
impact of variable selection and hyperparameters, there are also automated
techniques for model optimization, such as grid search . Grid search allows
you to list a range of configurations you wish to test for each
hyperparameter and methodically test each of those possible
hyperparameters. An automated voting process then takes place to
determine the optimal model. As the model must examine each possible
combination of hyperparameters, grid search does take a long time to run!
[33]
It sometimes helps to run a relatively coarse grid search using
consecutive powers of 10 (i.e. 0.01, 0.1, 1, 10) and then run a finer grid
search around the best value identified. [34] Example code for grid search
using Scikit-learn is included at the end of this chapter.
Another way of optimizing algorithm hyperparameters is the randomized
search method using Scikit-learn’s RandomizedSearchCV. This method
trials far more hyperparameters per round than grid search (which only
changes one single hyperparameter per round) as it uses a random value for
each hyperparameter at each round. Randomized search also makes it
simple to specify the number of trial rounds and control computing
resources. Grid search, meanwhile, runs based on the full number of
hyperparameter combinations, which isn’t obvious from looking at the code
and might take more time than expected.
Finally, if you wish to use a different supervised machine learning
algorithm and not gradient boosting, the majority of the code used in this
exercise can be reused. For instance, the same code can be used to import a
new dataset, preview the dataframe, remove features (columns), remove
rows, split and shuffle the dataset, and evaluate mean absolute error. The
official website http://scikit-learn.org is also a great resource to learn more
about other algorithms as well as gradient boosting used in this exercise.
To learn how to input and test an individual house valuation using the
model we have built in these two chapters, please see this more advanced
tutorial available on the Scatterplot Press website:
http://www.scatterplotpress.com/blog/bonus-chapter-valuing-individual-property/ . In addition,
if you have trouble implementing the model using the code found in this
book, please contact the author by email for assistance
([email protected]).
Code for the Optimized Model
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
# Set up algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators = 250,
learning_rate = 0.1,
max_depth = 5,
min_samples_split = 4,
min_samples_leaf = 6,
max_features = 0.6,
loss = 'huber'
)
# Input algorithm
model = ensemble.GradientBoostingRegressor()
# Set the configurations that you wish to test. To minimize processing time, limit num. of variables or
experiment on each hyperparameter separately.
hyperparameters = {
'n_estimators': [200, 300],
'max_depth': [4, 6],
'min_samples_split': [3, 4],
'min_samples_leaf': [5, 6],
'learning_rate': [0.01, 0.02],
'max_features': [0.8, 0.9],
'loss': ['ls', 'lad', 'huber']
}
Other Resources
To further your study of machine learning, I strongly recommend enrolling
in the free Andrew Ng Machine Learning course on Coursera and also
check out OCDevel’s podcast series: Machine Learning Guide , which is
the best put together audio resource available for beginners.
Also, if you enjoyed the pace of this introduction to machine learning, you
may also like to read the next two books in the series, Machine Learning
with Python for Beginners and Machine Learning: Make Your Own
Recommender System . These two books build on the knowledge you’ve
gained here and aim to extend your knowledge of machine learning with
practical coding exercises in Python.
THANK YOU
Thank you for purchasing this book. You now have a baseline
understanding of the key concepts in machine learning and are ready to
tackle this challenging subject in earnest. This includes learning the vital
programming component of machine learning.
If you have any direct feedback, both positive and negative, or suggestions
to improve this book, please feel free to send me an email at
[email protected]. This feedback is highly valued, and I
look forward to hearing from you. Please also note that under Amazon’s
Matchbook program, you can add the Kindle version of this book (valued at
$3.99 USD) to your Amazon Kindle library free of charge.
Finally, I would like to express my gratitude to my colleagues Jeremy
Pedersen and Rui Xiong for their assistance in kindly sharing practical tips
and sections of code used in this book as well as my two editors Chris Dino
(Red to Black Editing) and again Jeremy Pedersen.
BUG BOUNTY
We offer a financial reward to readers for locating errors or bugs in this
book. Some apparent errors could be mistakes made in interpreting a
diagram or following along with the code in the book, so we invite all
readers to contact the author first for clarification and a possible reward,
before posting a one-star review! Just send an email to
[email protected] explaining the error or mistake you
encountered.
This way, we can also supply further explanations and examples over email
to calibrate your understanding, or in cases where you’re right and we’re
wrong, we offer a monetary reward through PayPal or Amazon gift card.
This way you can make a tidy profit from your feedback, and we can update
the book to improve the standard of content for future readers.
FURTHER RESOURCES
This section lists relevant learning materials for readers that wish to
progress further in the field of machine learning. Please note that certain
details listed in this section, including prices, may be subject to change in
the future.
| Machine Learning |
Machine Learning
Format: Free Coursera course
Presenter: Andrew Ng
Suggested Audience: Beginners (especially those with a preference for
MATLAB)
A free and expert introduction from Adjunct Professor Andrew Ng, one of
the most influential figures in this field. This course is a virtual rite of
passage for anyone interested in machine learning.
| Basic Algorithms |
Machine Learning With Random Forests And Decision Trees: A Visual
Guide For Beginners
Format: E-book
Author: Scott Hartshorn
Suggested Audience: Established beginners
A short, affordable ($3.20 USD), and engaging read on decision trees and
random forests with detailed visual examples, useful practical tips, and
clear instructions.
| The Future of AI |
The Inevitable: Understanding the 12 Technological Forces That Will
Shape Our Future
Format: E-Book, Book, Audiobook
Author: Kevin Kelly
Suggested Audience: All (with an interest in the future)
A well-researched look into the future with a major focus on AI and
machine learning by The New York Times Best Seller, Kevin Kelly. It
provides a guide to twelve technological imperatives that will shape the
next thirty years.
| Programming |
Learning Python , 5th Edition
Format: E-Book, Book
Author: Mark Lutz
Suggested Audience: All (with an interest in learning Python)
A comprehensive introduction to Python published by O’Reilly Media.
| Recommender Systems |
The Netflix Prize and Production Machine Learning Systems: An
Insider Look
Format: Blog
Author: Mathworks
Suggested Audience: All
A very interesting blog post demonstrating how Netflix applies machine
learning to formulate movie recommendations.
Recommender Systems
Format: Coursera course
Presenter: The University of Minnesota
Cost: Free 7-day trial or included with $49 USD Coursera subscription
Suggested Audience: All
Taught by the University of Minnesota, this Coursera specialization covers
fundamental recommender system techniques including content-based and
collaborative filtering as well as non-personalized and project-association
recommender systems.
.
| Deep Learning |
Deep Learning Simplified
Format: Blog
Channel: DeepLearning.TV
Suggested Audience: All
A short video series to get you up to speed with deep learning. Available for
free on YouTube.
| Future Careers |
Will a Robot Take My Job?
Format: Online article
Author: The BBC
Suggested Audience: All
Check how safe your job is in the AI era leading up to the year 2035.
Comments
Adding comments is good practice in computer programming to signpost
the purpose and content of your code. In Python, comments can be added to
your code using the # (hash) character. Everything placed after the hash
character (on that line of code) is then ignored by the Python interpreter.
Example:
# Import Melbourne Housing dataset from my Downloads folder
dataframe = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
In this example, the second line of code will be executed, while the first line of code will be ignored
by the Python interpreter.
Arithmetic in Python
Commonly used arithmetical operators in Python are displayed in Table 18.
Variable Assignment
In computer programming, the role of a variable is to store a data value in
the computer’s memory for later use. This enables earlier code to be
referenced and manipulated by the Python interpreter calling that variable
name. You can select any name for the variable granted it fits with the
following rules:
It contains only alpha-numeric characters and underscores (A-Z,
0-9, _ )
It starts with a letter or underscore and not a number
It does not imitate a Python keyword such as “return”
Python, though, does not support blank spaces between variable keywords
and an underscore must be used to bridge variable keywords.
Example:
my_dataset = 8
The stored value (8) can now be referenced by calling the variable name
my_datase t . Variables also have a “variable” nature, in that we can reassign
the variable to a different value, such as:
Example:
my_dataset = 8 + 8
The value of th e my_datase t is now 16.
It’s important to note that the equals operator in Python does not serve the
same function as equals in mathematics. In Python, the equals operator
assigns variables but does not follow mathematical logic. If you wish to
solve a mathematical equation in Python you can simply run the code
without adding an equals operator.
Example:
2+2
Python will return 4 in this case.
Importing Libraries
From web scraping to gaming applications, the possibilities of Python are
dazzling but coding everything from scratch constitutes a difficult and time-
consuming process. This is where libraries, as a collection of pre-
written code and standardized routines, come into play. Rather than write
scores of code in order to plot a simple graph or scrape content from the
web, you can use one line of code from a given library to execute a highly
advanced function.
There is an extensive supply of free libraries available for web scraping,
data visualization, data science, etc., and the most common libraries for
machine learning are Scikit-learn, Pandas, and NumPy. The NumPy and
Pandas libraries can be imported in one line of code, whereas for Scikit-
learn, you’ll need to specify individual algorithms or functions over
multiple lines of code.
Example:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
Using the code above, you can call code commands from NumPy, Pandas,
and Nearest Neighbors from Scikit-learn by calling n p , p d , and
NearestNeighbor s in any section of your code below. You can find the import
command for other Scikit-learn algorithms and different code libraries by
referencing their documentation online.
Importing a Dataset
CSV datasets can be imported into your Python development environment
as a Pandas dataframe (tabular dataset) from your host file using the Pandas
command pd.read_csv( ) . Note that the host file name should be enclosed in
single or double-quotes inside the parentheses.
You will also need to assign a variable to the dataset using the equals
operator, which will allow you to call the dataset in other sections of your
code. This means that anytime you call datafram e , for example, the Python
interpreter recognizes you are directing the code to the dataset imported and
stored using that variable name.
Example:
dataframe = pd.read_csv('~/Downloads/Melbourne_housing_FULL.csv')
Indexing
Indexing is a method of selecting a single element from within a data type,
such as a list or string. Each element in a data type is numerically indexed
beginning at 0, and elements can be indexed by calling the index number
inside square brackets.
Example:
my_string = "hello_world"
my_string[1]
Indexing returns the value e in this example.
Example:
my_list = [10, 20 , 30 , 40]
my_list[0]
Indexing returns the value 10 in this example.
Slicing
Rather than pull a single element from a collection of data, you can use
slicing to grab a customized subsection of elements using a colon (:).
Example:
my_list = [10, 20, 30, 40]
my_list[:3]
Slicing, here, goes up to but does not include the element at index position 3, thereby returning the
values 10 , 20 , and 30 .
Example:
my_list = [10, 20, 30, 40]
my_list[1:3]
Slicing, here, starts at 1 and goes up to but does not include the element at index position 3, thereby
returning the values 20 and 30 in this example.
OTHER BOOKS BY THE AUTHOR
Machine Learning with Python for Beginners
Progress in ML by learning how to code in Python in order to build your
own prediction models and solve real-life problems.
Machine Learning: Make Your Own Recommender System
Learn how to make your own ML recommender system in an afternoon
using Python.
Data Analytics for Absolute Beginners
Make better decisions using every variable with this deconstructed
introduction to data analytics.
Statistics for Absolute Beginners
Master the fundamentals of inferential and descriptive statistics with a mix
of practical demonstrations, visual examples, historical origins, and plain
English explanations.
SKILLSHARE COURSE
Introduction to Machine Learning Concepts for Absolute Beginners
This class covers the basics of machine learning in video format. After
completing this class, you can push on to more complex video-based
classes available on Skillshare.
[1]
“Will A Robot Take My Job?”, The BBC , accessed December 30, 2017,
http://www.bbc.com/news/technology-34066941
[2]
Nick Bostrom, “Superintelligence: Paths, Dangers, Strategies,” Oxford University Press, 2016.
[3]
Bostrom also quips that two decades is close to the remaining duration of a typical forecaster’s
career.
[4]
Matt Kendall, “Machine Learning Adoption Thwarted by Lack of Skills and Understanding,”
Nearshore Americas , accessed May 14, 2017, http://www.nearshoreamericas.com/machine-learning-
adoption-understanding
[5]
Arthur Samuel, “ Some Studies in Machine Learning Using the Game of Checkers ,” IBM
Journal of Research and Development, Vol. 3, Issue. 3, 1959.
[6]
Bruce Schneir, “Data and Goliath: The Hidden Battles to Collect Your Data and Control Your
World,” W. W. Norton & Company , First Edition, 2016.
[7]
Remco Bouckaert, Eibe Frank, Mark Hall, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann & Ian Witten, “WEKA—Experiences with a Java Open-Source Project ,” Journal of
Machine Learning Research , Edition 11,
https://www.cs.waikato.ac.nz/ml/publications/2010/bouckaert10a.pdf
[8]
Data mining was originally known by other names including “database mining” and “information
retrieval.” The discipline became known as “knowledge discovery in databases” and “data mining” in
the 1990s.
[9]
Jiawei Han, Micheline Kamber & Jian Pei, “Data Mining: Concepts and Techniques (The Morgan
Kaufmann Series in Data Management Systems),” Morgan Kauffmann, 3rd Edition, 2011.
[10]
“Unsupervised Machine Learning Engine,” DataVisor , accessed May 19, 2017,
https://www.datavisor.com/unsupervised-machine-learning-engine
[11]
Aside from artificial neural networks, most learning algorithms qualify as shallow.
[12]
Kevin Kelly, “The Inevitable: Understanding the 12 Technological Forces That Will Shape Our
Future ,” Penguin Books , 2016.
[13]
“What is Torch?” Torch , accessed April 20, 2017, http://torch.ch
[14]
Pascal Lamblin, “MILA and the future of Theano,” Google Groups Theano Users Forum ,
https://groups.google.com/forum/#!topic/theano-users/7Poq8BZutbY
[15]
Standard deviation is a measure of spread among data points. It measures variability by
calculating the average squared distance of all data observations from the mean of the dataset.
[16]
The term owes its name to its origins in the field of radar engineering.
[17]
Although the linear formula is written differently in other disciplines, y = bx + a is the preferred
format used in statistics and machine learning. This formula could also be expressed using the
notation of y = β 0 +β 1 x 1 + e, where β 0 is the intercept, β 1 is the slope, and e is the residual or
error.)
[18]
Brandon Foltz, “Logistic Regression,” YouTube ,
https://www.youtube.com/channel/UCFrjdcImgcQVyFbK04MBEhA
[19]
Prratek Ramchandani, “Random Forests and the Bias-Variance Tradeoff,” Towards Data Science
, https://towardsdatascience.com/random-forests-and-the-bias-variance-tradeoff-3b77fee339b4
[20]
Random and/or useless information that obscures the key meaning of the data.
[21]
In Scikit-learn, the default for the C hyperparameter is 1.0 and the strength of the regularization
(the penalty for overfitting) is inversely proportional to C. This means any value below 1.0
effectively adds regularization to the model, and the penalty is squared L2 (L2 is calculated as the
square root of the sum of the squared vector values).
[22]
It’s generally good practice to train the model twice—with and without standardization—and
compare the performance of the two models.
[23]
Geoffrey Hinton et al. published a paper in 2006 on recognizing handwritten digits using a deep
neural network which they named deep learning .
[24]
Scott Hartshorn, “Machine Learning With Random Forests And Decision Trees: A Visual Guide
For Beginners,” Scott Hartshorn , 2016.
[25]
The class that receives the most votes is taken as the final output.
[26]
Generally, the more votes or numeric outputs that are taken into consideration the more accurate
the final prediction.
[27]
The aim of approaching regression problems is to produce a numeric prediction, such as the price
of a house, rather than to predict a discrete class (classification).
[28]
Decision trees can treat missing values as another variable. For instance, when assessing weather
outlook, the data points can be classified as sunny, overcast , rainy or missing .
[29]
Ian H. Witten, Eibe Frank, Mark A. Hall, “Data Mining: Practical Machine Learning Tools and
Techniques,” Morgan Kaufmann , Third Edition, 2011.
[30]
“Dropna,” Pandas , https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html
[31]
“Gradient Boosting Regressor,” Scikit-learn , http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
[32]
In machine learning, the test data is used exclusively to assess model performance rather than
optimize the model. As the test data cannot be used to build and optimize the model, data scientists
commonly use a third independent dataset called the validation set . After building an initial model
with the training set, the validation set can be fed into the prediction model and used as feedback to
optimize the model’s hyperparameters. The test set is then used to assess the prediction error of the
final model.
[33]
Most readers of this book report waiting up to 30 minutes for the model to run.
[34]
Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems,” O’Reilly Media , 2017.
[35]
Mike McGrath, “Python in easy steps: Covers Python 3.7,” In Easy Steps Limited , Second
Edition, 2018.