Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
Ebook586 pages4 hours

The Art of Machine Learning: A Hands-On Guide to Machine Learning with R

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Learn to expertly apply a range of machine learning methods to real data with this practical guide.

Packed with real datasets and practical examples, The Art of Machine Learning will help you develop an intuitive understanding of how and why ML methods work, without the need for advanced math.

As you work through the book, you’ll learn how to implement a range of powerful ML techniques, starting with the k-Nearest Neighbors (k-NN) method and random forests, and moving on to gradient boosting, support vector machines (SVMs), neural networks, and more.

With the aid of real datasets, you’ll delve into regression models through the use of a bike-sharing dataset, explore decision trees by leveraging New York City taxi data, and dissect parametric methods with baseball player stats. You’ll also find expert tips for avoiding common problems, like handling “dirty” or unbalanced data, and how to troubleshoot pitfalls.

You’ll also explore:

  • How to deal with large datasets and techniques for dimension reduction
  • Details on how the Bias-Variance Trade-off plays out in specific ML methods
  • Models based on linear relationships, including ridge and LASSO regression
  • Real-world image and text classification and how to handle time series data

Machine learning is an art that requires careful tuning and tweaking. With The Art of Machine Learning as your guide, you’ll master the underlying principles of ML that will empower you to effectively use these models, rather than simply provide a few stock actions with limited practical use.

Requirements: A basic understanding of graphs and charts and familiarity with the R programming language
LanguageEnglish
PublisherNo Starch Press
Release dateJan 9, 2024
ISBN9781718502116
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R

Related to The Art of Machine Learning

Related ebooks

Intelligence (AI) & Semantics For You

View More

Reviews for The Art of Machine Learning

Rating: 5 out of 5 stars
5/5

1 rating1 review

What did you think?

Tap to rate

Review must be at least 10 words

  • Rating: 5 out of 5 stars
    5/5

    Nov 13, 2024

    Thank You This Is Very Good, Maybe This Can Help You
    Download Full Ebook Very Detail Here :
    https://amzn.to/3XOf46C
    - You Can See Full Book/ebook Offline Any Time
    - You Can Read All Important Knowledge Here
    - You Can Become A Master In Your Business

Book preview

The Art of Machine Learning - Norman Matloff

INTRODUCTION

Machine learning! With such a science fiction-ish name, one might expect it to be technology that is strictly reserved for highly erudite specialists. Not true.

Actually, machine learning (ML) can easily be explained in commonsense terms, and anyone with a good grasp of charts, graphs, and the slope of a line should be able to both understand and productively use ML. Of course, as the saying goes, The devil is in the details, and one must work one’s way through those details. But ML is not rocket science, in spite of it being such a powerful tool.

0.1 What Is ML?

ML is all about prediction. Does a patient have a certain disease? Will a customer switch from her current cell phone service to another? What is actually being said in this rather garbled audio recording? Is that bright spot observed by a satellite a forest fire or just a reflection?

We predict an outcome from one or more features. In the disease diagnosis example, the outcome is having the disease or not, and the features may be blood tests, family history, and so on.

All ML methods involve a simple idea: similarity. In the cell phone service example, how do we predict the outcome for a certain customer? We look at past customers and select the ones who are most similar in features (size of bill, lateness record, yearly income, and so on) to our current customer. If most of those similar customers bolted, we predict the same for the current one. Of course, we are not guaranteed that outcome, but it is our best guess.

0.2 The Role of Math in ML Theory and Practice

Many ML methods are based on elegant mathematical theory, with support vector machines (SVMs) being a notable example. However, knowledge of this theory has very little use in terms of being able to apply SVM well in actual applications.

To be sure, a good intuitive understanding of how ML methods work is essential to effective use of ML in practice. This book strives to develop in the reader a keen understanding of the intuition, without using advanced mathematics. Indeed, there are very few equations in this book.

0.3 Why Another ML Book?

There are many great ML books out there, of course, but none really empower the reader to use ML effectively in real-world problems. In many cases, the problem is that the books are too theoretical, but I am equally concerned that the applied books tend to be cookbooks (too recipe-oriented) that treat the subject in a Step 1, Step 2, Step 3 manner. Their focus is on the syntax and semantics of ML software, with the result that while the reader may know the software well, the reader is not positioned to use ML well.

I wrote this book because:

There is a need for a book that uses the R language but is not about R. This is a book on ML that happens to use R for examples and not a book about the use of R in ML.

There is a need for an ML book that recognizes that ML is an art, not a science. (Hence the title of this book.)

There is a need for an ML book that avoids advanced math but addresses the point that, in order to use ML effectively, one does need to understand the concepts well—the why and how of ML methods. Most applied ML books do too little in explaining these things.

All three of these bullets go back to the anti-cookbook theme. My goal is, then, this:

I would like those who use ML to not only know the definition of random forests but also be ready to cogently explain how the various hyperparameters in random forests may affect overfitting. MLers also should be able to give a clear account of the problems of p-hacking in feature engineering.

We will empower the reader with strong, practical, real-world knowledge of ML methods—their strengths and weaknesses, what makes them work and fail, what to watch out for. We will do so without much formal math and will definitely take a hands-on approach, using prominent software packages on real datasets. But we will do so in a savvy manner. We will be informed consumers.

0.4 Recurring Special Sections

There are special recurring themes and sections throughout this book:

Bias vs. Variance

Numerous passages explain in concrete terms—no superstition!—how these two central notions play out for each specific ML method.

Pitfalls

Numerous sections with the Pitfall title warn the reader of potential problems and show how to avoid them.

0.5 Background Needed

What kind of background will the reader need to use this book profitably?

No prior exposure to ML or statistics is assumed.

As to math in general, the book is mostly devoid of formal equations. As long as the reader is comfortable with basic graphs, such as histograms and scatterplots, and simple algebra notions, such as the slope of a line, that is quite sufficient.

The book does assume some prior background in R coding, such as familiarity with vectors, factors, data frames, and functions. The R command line (> prompt, Console in RStudio) is used throughout. Readers without a background in R, or those wishing to have a review, may find my fasteR tutorial useful: https://github.com/matloff/fasteR.

Make sure R and the qeML package are installed on your computer. For the package, the preferred installation source is GitHub, as it will always have the most up-to-date version of the package. You’ll need the devtools package; if you don’t already have it, type:

install.packages('devtools')

Then, to install qeML, type:

install_github('https://github.com/matloff/qeML')

The qeML package will also be on the CRAN R code repository but updated less frequently.

0.6 The qe*-Series Software

Most of the software used here will come from popular R packages:

e1071

gbm

glmnet

keras

randomForest

Readers can use these packages directly if they wish. But in order to keep things simple and convenient for readers, we usually will be using wrappers for the functions in those packages, which are available in my package, qeML. This is a big help in two ways:

The wrappers provide a uniform interface.

That uniform interface is also simple.

For instance, consider day1, a bike rental dataset used at various points in this book. We wish to predict tot, total ridership. Here’s how we would do that using random forests, an ML topic covered in this book:

qeRF(day1,'tot')

For support vector machines, another major topic, the call would be

qeSVM(day1,'tot')

and so on. Couldn’t be simpler! No preparatory code, say, to define a model; just call one of the qe functions and go! The prefix qe- stands for quick and easy. One can also specify method-specific parameters, which we will do, but still, it will be quite simple.

For very advanced usage, this book shows how to use those packages directly.

0.7 The Book’s Grand Plan

Here is the path we’ll take. The first three chapters introduce general concepts that recur throughout the book, as well as specific machine learning methods. The rough description of ML above—predict on the basis of similar cases—is most easily developed using an ML method known as k-nearest neighbors (k-NN). Part I of the book will play two roles. First, it will cover k-NN in detail. Second, it will introduce the reader to general concepts that apply to all ML methods, such as choice of hyperparameters. In k-NN, the number of similar cases, usually denoted k, is the hyperparameter. For k-NN, what is the Goldilocks value of k—not too small and not too large? Again, choice of hyperparameters is key in most ML methods, and it will be introduced via k-NN.

Part II will then present a natural extension of k-NN, tree-based methods, specifically random forests and gradient boosting. These methods work in a flowchart-like manner, asking questions about features one at a time. In the disease diagnosis example given before, the first question might be, Is the patient over age 50? The next might be something like, Is the patient’s body mass index below 20.2? In the end, this process partitions the patients into small groups in which the members are similar to each other, so it’s like k-NN. But the groups do take different forms from k-NN, and tree methods often outperform k-NN in prediction accuracy and are considered a major ML tool.

Part III discusses methods based on linear relationships. Readers who have some background in linear regression analysis will recognize some of this, though again, no such background is assumed. This part closes with a discussion of the LASSO and ridge regression, which have the tantalizing property of deliberately shrinking down some classical linear regression estimates.

Part IV involves methods based on separating lines and planes. Consider again the cell phone service example. Say we plot the data for the old customers who left the service using the color blue in our graph. Then on the same graph, we plot those who remained loyal in red. Can we find a straight line that separates most of the blue points from most of the red points? If so, we will predict the action of the new customer by checking which side of the line his case falls on. This description not only fits SVM but also fits, in a sense, the most famous ML method, neural networks, which we cover as well.

Finally, Part V introduces several specific types of ML applications, such as image classification.

It’s often said that no one ML method works best in all applications. True, but hopefully this book’s structure will impart a good understanding of similarities and differences between the methods, appreciating where each fits in the grand scheme of things.

There is a website for the book at http://heather.cs.ucdavis.edu/artofml, which contains code, errata, new examples, and more.

0.8 One More Point

In reading this book, keep in mind that the prose is just as important as the code. Avoid the temptation to focus only on the code and graphs. A page that is all prose—no math, no graphs, and no code—may be one of the most important pages in the book. It is there that you will learn the all-important why of ML, such as why choice of hyperparameters is so vital. The prose is crucial to your goal of becoming adept at ML with the most insight and predictive power!

Keep in mind that those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight. Formal math is minimized here, but note that this means the math will give way to prose that describes many key issues.

So, let’s get started. Happy ML-ing!

PART I

PROLOGUE, AND NEIGHBORHOOD-BASED METHODS

1

REGRESSION MODELS

In this chapter, we’ll introduce regression functions. Such functions give the mean of one variable in terms of one or more others—for instance, the mean weight of children in terms of their age. All ML methods are regression methods in some form, meaning that they use the data we provide to estimate regression functions.

We’ll present our first ML method, k-nearest neighbors (k-NN), and apply it to real data. We’ll also weave in concepts that will recur throughout the book, such as dummy variables, overfitting, p-hacking, dirty data, and so on. We’ll introduce many of these concepts only briefly for the time being in order to give you a bird’s-eye view of what we’ll return to in detail later: ML is intuitive and coherent but easier to master if taken in stages. Reader, please be prepared for frequent statements like We’ll cover one aspect for now, with further details later.

Before you begin, make sure you have R and the qeML and regtools packages, version 1.7 or newer for the latter, installed on your computer. (Run packageVersion('regtools') to check.) All code displays in this book assume that the user has already made the calls to load the packages:

library(regtools)

library(qeML)

So, let’s look at our first example dataset.

1.1 Example: The Bike Sharing Dataset

Before we introduce k-NN, we’ll need to have some data to work with. Let’s start with this dataset from the UC Irvine Machine Learning Repository, which contains the Capital Bikeshare system’s hourly and daily count of bike rentals between 2011 and 2012, with corresponding information on weather and other quantities. A more detailed description of the data is available at the UC Irvine Machine Learning Repository.¹

The dataset is included as the day dataset in regtools by permission of the data curator. Note, though, that we will use a slightly modified version, day1 (also included in regtools), in which the numeric weather variables are given in their original scale rather than transformed to the interval [0,1].

Our main interest will be in predicting total ridership for a day.

SOME TERMINOLOGY

Say we wish to predict ridership from temperature and humidity. Standard ML parlance refers to the variables used for prediction—in this case, temperature and humidity—as features.

If the variable to be predicted is numeric, say, ridership, there is no standard ML term for it. We’ll just refer to it as the outcome variable. But if the variable to be predicted is an R factor—that is, a categorical variable—it is called a label.

For instance, later in this book we will analyze a dataset on diseases of human vertebrae. There are three possible outcomes or categories: normal (NO), disk hernia (DH), or spondylolisthesis (SL). The column in our dataset showing the class of each patient, NO, DH, or SL, would be the labels column.

Our dataset, say, day1 here, is called the training set. We use it to make predictions in future cases, in which the features are known but the outcome variable is unknown. We are predicting the latter.

1.1.1 Loading the Data

The data comes in hourly and daily forms, with the latter being the one in the regtools package. Load the data:

> data(day1)

With any dataset, it’s always a good idea to first take a look around. What variables are included in this data? What types are they, say, numeric or R factor? What are their typical values? One way to do this is to use R’s head() function to view the top of the data:

head(day1)

 

  instant     dteday season yr mnth holiday

1       1 2011-01-01      1  0    1       0

2       2 2011-01-02      1  0    1       0

3       3 2011-01-03      1  0    1       0

4       4 2011-01-04      1  0    1       0

5       5 2011-01-05      1  0    1       0

6       6 2011-01-06      1  0    1       0

  weekday workingday weathersit     temp

1       6          0          2 8.175849

2       0          0          2 9.083466

3       1          1          1 1.229108

4       2          1          1 1.400000

5       3          1          1 2.666979

6       4          1          1 1.604356

      atemp      hum windspeed casual registered

1  7.999250 0.805833 10.749882    331        654

2  7.346774 0.696087 16.652113    131        670

3 -3.499270 0.437273 16.636703    120       1229

4 -1.999948 0.590435 10.739832    108       1454

5 -0.868180 0.436957 12.522300     82       1518

6 -0.608206 0.518261  6.000868     88       1518

   tot

1  985

2  801

3 1349

4 1562

5 1600

6 1606

nrow(day1)

 

[1] 731

We see there are 731 rows (that is, 731 different days), with data on the date, nature of the date (such as weekday), and weather conditions (such as the temperature, temp, and humidity, hum). The last three columns measure ridership from casual users, registered users, and the total.

You can find more information on the dataset with the ?day1 command.

1.1.2 A Look Ahead

We will get to actual analysis of this data shortly. For now, here is a preview. Say we wish to predict total ridership for tomorrow, based on specific weather conditions and so on. How will we do that with k-NN?

We will search through our data, looking for data points that match or nearly match those same weather conditions and other variables. We will then average the ridership values among those data points, and that will be our predicted ridership for this new day.

Too simple to be true? No, not really; the above description is accurate. Of course, the old saying The devil is in the details applies, but the process is indeed simple. But first, let’s address some general issues.

1.2 Machine Learning and Prediction

ML is fundamentally about prediction. Before we get into the details of our first ML method, we should be sure we know what prediction means.

Consider the bike sharing dataset. Early in the morning, the manager of the bike sharing service might want to predict the total number of riders for the day. The manager can do so by analyzing the relations between the features—the various weather conditions, the work status of the day (weekday, holiday), and so on. Of course, predictions are not perfect, but if they are in the ballpark of what turns out to be the actual number, they can be quite helpful. For instance, they can help the manager decide how many bikes to make available, with pumped-up tires and so on. (An advanced version would be to predict the demand for bikes at each station so that bikes could be reallocated accordingly.)

1.2.1 Predicting Past, Present, and Future

The famous baseball player and malapropist Yogi Berra once said, Prediction is hard, especially about the future. Amusing as this is, he had a point; in ML, prediction can refer not only to the future but also to the present or even the past. For example, a researcher may wish to estimate the mean wages workers made back in the 1700s. Or a physician may wish to make a diagnosis as to whether a patient has a particular disease, based on blood tests, symptoms, and so on, guessing their condition in the present, not the future. So when we in the ML field talk of prediction, don’t take the pre- too literally.

1.2.2 Statistics vs. Machine Learning in Prediction

A common misconception is that ML is concerned with prediction, while statisticians do inference—that is, confidence intervals and testing for quantities of interest—but prediction is definitely an integral part of the field of statistics.

There is sometimes a friendly rivalry between the statistics and ML communities, even down to a separate terminology for each (see Appendix B). Indeed, statisticians sometimes use the term statistical learning to refer to the same methods known in the ML world as machine learning!

As a former statistics professor who has spent most of his career in a computer science department, I have a foot in both camps. I will present ML methods in computational terms, but with some insights informed by statistical principles.

HISTORICAL NOTE

Many of the methods treated in this book, which compose part of the backbone of ML, were originally developed in the statistics community. These include k-NN, decision trees or random forests, logistic regression, and L1/L2 shrinkage. These evolved from the linear models formulated way back in the 19th century, but which later statisticians felt were inadequate for some applications. The latter consideration sparked interest in methods that had less restrictive assumptions, leading to the invention first of k-NN and later of other techniques.

On the other hand, two other prominent ML methods, support vector machines (SVMs) and neural networks, have been developed almost entirely outside of statistics, notably in university computer science departments. (Another method, boosting, began in computer science but has had major contributions from both factions.) Their impetus was not statistical at all. Neural networks, as we often hear in the media, were studied originally as a means to understand the workings of the human brain. SVMs were viewed simply in computer science algorithmic terms—given a set of data points of two classes, how can we compute the best line or plane separating them?

1.3 Introducing the k-Nearest Neighbors Method

Our featured method in this chapter will be k-nearest neighbors, or k-NN. It’s arguably the oldest ML method, going back to the early 1950s, but it is still widely used today, especially in applications in which the number of features is small (for reasons that will become clear later). It’s also simple to explain and easy to implement—the perfect choice for this introductory chapter.

1.3.1 Predicting Bike Ridership with k-NN

Let’s first look at using k-NN to predict bike ridership from a single feature: temperature. Say the day’s temperature is forecast to be 28 degrees centigrade. How should we predict ridership for the day, using the 28 figure and our historical ridership dataset (our training set)? A person without a background in ML might suggest looking at all the days in our data, culling out those of temperature closest to 28 (there may be few or none with a temperature of exactly 28), and then finding the average ridership on those days. We would use that number as our predicted ridership for this day.

Actually, this intuition is correct! This, in fact, is the basis for many common ML methods, as we’ll discuss further in Section 1.6 on the regression function. For now, just know that k-NN takes the form of simply averaging over the similar cases—that is, over the neighboring data points. The quantity k is the number of neighbors we use. We could, say, take

Enjoying the preview?
Page 1 of 1