The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
5/5
()
About this ebook
Packed with real datasets and practical examples, The Art of Machine Learning will help you develop an intuitive understanding of how and why ML methods work, without the need for advanced math.
As you work through the book, you’ll learn how to implement a range of powerful ML techniques, starting with the k-Nearest Neighbors (k-NN) method and random forests, and moving on to gradient boosting, support vector machines (SVMs), neural networks, and more.
With the aid of real datasets, you’ll delve into regression models through the use of a bike-sharing dataset, explore decision trees by leveraging New York City taxi data, and dissect parametric methods with baseball player stats. You’ll also find expert tips for avoiding common problems, like handling “dirty” or unbalanced data, and how to troubleshoot pitfalls.
You’ll also explore:
- How to deal with large datasets and techniques for dimension reduction
- Details on how the Bias-Variance Trade-off plays out in specific ML methods
- Models based on linear relationships, including ridge and LASSO regression
- Real-world image and text classification and how to handle time series data
Machine learning is an art that requires careful tuning and tweaking. With The Art of Machine Learning as your guide, you’ll master the underlying principles of ML that will empower you to effectively use these models, rather than simply provide a few stock actions with limited practical use.
Requirements: A basic understanding of graphs and charts and familiarity with the R programming language
Related to The Art of Machine Learning
Related ebooks
Ultimate Machine Learning with Scikit-Learn: Unleash the Power of Scikit-Learn and Python to Build Cutting-Edge Predictive Modeling Applications and Unlock Deeper Insights Into Machine Learning (English Edition) Rating: 0 out of 5 stars0 ratingsPython Automation Mastery: From Novice To Pro Rating: 0 out of 5 stars0 ratingsThe Comprehensive Guide to Machine Learning Algorithms and Techniques Rating: 5 out of 5 stars5/5Demystifying Artificial intelligence: Simplified AI and Machine Learning concepts for Everyone (English Edition) Rating: 0 out of 5 stars0 ratingsJavaScript for Beginners Rating: 5 out of 5 stars5/5C Programming For Beginners: The Complete Step-By-Step Guide To Mastering The C Programming Language Like A Pro Rating: 0 out of 5 stars0 ratingsModern Full-Stack React Projects: Build, maintain, and deploy modern web apps using MongoDB, Express, React, and Node.js Rating: 0 out of 5 stars0 ratingsA Guide To All Programming and Coding Languages Rating: 0 out of 5 stars0 ratingsThe Future of Artificial Intelligence Rating: 0 out of 5 stars0 ratingsMastering Python: A Comprehensive Crash Course for Beginners Rating: 0 out of 5 stars0 ratingsPrinciples of Programming: Java Level 1 Rating: 0 out of 5 stars0 ratingsLearn Programming by Coding Like a Professional: Create Games, Apps, & Programs Rating: 0 out of 5 stars0 ratingsCoding for beginners The basic syntax and structure of coding Rating: 0 out of 5 stars0 ratingsIan Talks Python A-Z Rating: 0 out of 5 stars0 ratingsCompTIA Linux+ (Plus) Certification The Ultimate Study Guide to Ace the Exam Rating: 0 out of 5 stars0 ratingsAI and ML Applications for Decision-Making in Financial Literacy Rating: 0 out of 5 stars0 ratingsBeyond Silicon Rating: 5 out of 5 stars5/5Python Apps on Visual Studio Code: Develop apps and utilize the true potential of Visual Studio Code (English Edition) Rating: 0 out of 5 stars0 ratingsTCP/IP: Network+ Protocols And Campus LAN Switching Fundamentals Rating: 0 out of 5 stars0 ratingsThe Wide World of Coding: The People and Careers behind the Programs Rating: 0 out of 5 stars0 ratingsElements of Android Q Rating: 0 out of 5 stars0 ratingsUltimate SwiftUI Handbook for iOS Developers: A complete guide to native app development for iOS, macOS, watchOS, tvOS, and visionOS Rating: 0 out of 5 stars0 ratingsLearn HTML and CSS from beginner to expert: Learn HTML5, CSS3, Flexbox, and CSS Grid from the beginning Rating: 0 out of 5 stars0 ratingsBuilding an Operating System with Rust: A Practical Guide Rating: 0 out of 5 stars0 ratingsMining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2 Rating: 0 out of 5 stars0 ratings
Intelligence (AI) & Semantics For You
Writing AI Prompts For Dummies Rating: 0 out of 5 stars0 ratingsArtificial Intelligence: A Guide for Thinking Humans Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5ChatGPT Millionaire: Work From Home and Make Money Online, Tons of Business Models to Choose from Rating: 5 out of 5 stars5/53550+ Most Effective ChatGPT Prompts Rating: 0 out of 5 stars0 ratingsThe Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Midjourney Mastery - The Ultimate Handbook of Prompts Rating: 5 out of 5 stars5/580 Ways to Use ChatGPT in the Classroom Rating: 5 out of 5 stars5/5The ChatGPT Revolution: How to Simplify Your Work and Life Admin with AI Rating: 0 out of 5 stars0 ratingsA Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®) Rating: 4 out of 5 stars4/5Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures Rating: 4 out of 5 stars4/5THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION Rating: 5 out of 5 stars5/5The Roadmap to AI Mastery: A Guide to Building and Scaling Projects Rating: 3 out of 5 stars3/5Coding with AI For Dummies Rating: 1 out of 5 stars1/5AI Money Machine: Unlock the Secrets to Making Money Online with AI Rating: 5 out of 5 stars5/5Generative AI For Dummies Rating: 2 out of 5 stars2/5AI Investing For Dummies Rating: 0 out of 5 stars0 ratingsArtificial Intelligence For Dummies Rating: 3 out of 5 stars3/5100M Offers Made Easy: Create Your Own Irresistible Offers by Turning ChatGPT into Alex Hormozi Rating: 5 out of 5 stars5/5Make Money with ChatGPT: Your Guide to Making Passive Income Online with Ease using AI: AI Wealth Mastery Rating: 2 out of 5 stars2/5Thinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5
Reviews for The Art of Machine Learning
1 rating1 review
- Rating: 5 out of 5 stars5/5
Nov 13, 2024
Thank You This Is Very Good, Maybe This Can Help You
Download Full Ebook Very Detail Here :
https://amzn.to/3XOf46C
- You Can See Full Book/ebook Offline Any Time
- You Can Read All Important Knowledge Here
- You Can Become A Master In Your Business
Book preview
The Art of Machine Learning - Norman Matloff
INTRODUCTION
Machine learning! With such a science fiction-ish name, one might expect it to be technology that is strictly reserved for highly erudite specialists. Not true.
Actually, machine learning (ML) can easily be explained in commonsense terms, and anyone with a good grasp of charts, graphs, and the slope of a line should be able to both understand and productively use ML. Of course, as the saying goes, The devil is in the details,
and one must work one’s way through those details. But ML is not rocket science, in spite of it being such a powerful tool.
0.1 What Is ML?
ML is all about prediction. Does a patient have a certain disease? Will a customer switch from her current cell phone service to another? What is actually being said in this rather garbled audio recording? Is that bright spot observed by a satellite a forest fire or just a reflection?
We predict an outcome from one or more features. In the disease diagnosis example, the outcome is having the disease or not, and the features may be blood tests, family history, and so on.
All ML methods involve a simple idea: similarity. In the cell phone service example, how do we predict the outcome for a certain customer? We look at past customers and select the ones who are most similar in features (size of bill, lateness record, yearly income, and so on) to our current customer. If most of those similar customers bolted, we predict the same for the current one. Of course, we are not guaranteed that outcome, but it is our best guess.
0.2 The Role of Math in ML Theory and Practice
Many ML methods are based on elegant mathematical theory, with support vector machines (SVMs) being a notable example. However, knowledge of this theory has very little use in terms of being able to apply SVM well in actual applications.
To be sure, a good intuitive understanding of how ML methods work is essential to effective use of ML in practice. This book strives to develop in the reader a keen understanding of the intuition, without using advanced mathematics. Indeed, there are very few equations in this book.
0.3 Why Another ML Book?
There are many great ML books out there, of course, but none really empower the reader to use ML effectively in real-world problems. In many cases, the problem is that the books are too theoretical, but I am equally concerned that the applied books tend to be cookbooks
(too recipe-oriented
) that treat the subject in a Step 1, Step 2, Step 3 manner. Their focus is on the syntax and semantics of ML software, with the result that while the reader may know the software well, the reader is not positioned to use ML well.
I wrote this book because:
There is a need for a book that uses the R language but is not about R. This is a book on ML that happens to use R for examples and not a book about the use of R in ML.
There is a need for an ML book that recognizes that ML is an art, not a science. (Hence the title of this book.)
There is a need for an ML book that avoids advanced math but addresses the point that, in order to use ML effectively, one does need to understand the concepts well—the why and how of ML methods. Most applied
ML books do too little in explaining these things.
All three of these bullets go back to the anti-cookbook
theme. My goal is, then, this:
I would like those who use ML to not only know the definition of random forests but also be ready to cogently explain how the various hyperparameters in random forests may affect overfitting. MLers also should be able to give a clear account of the problems of p-hacking
in feature engineering.
We will empower the reader with strong, practical, real-world knowledge of ML methods—their strengths and weaknesses, what makes them work and fail, what to watch out for. We will do so without much formal math and will definitely take a hands-on approach, using prominent software packages on real datasets. But we will do so in a savvy manner. We will be informed consumers.
0.4 Recurring Special Sections
There are special recurring themes and sections throughout this book:
Bias vs. Variance
Numerous passages explain in concrete terms—no superstition!—how these two central notions play out for each specific ML method.
Pitfalls
Numerous sections with the Pitfall
title warn the reader of potential problems and show how to avoid them.
0.5 Background Needed
What kind of background will the reader need to use this book profitably?
No prior exposure to ML or statistics is assumed.
As to math in general, the book is mostly devoid of formal equations. As long as the reader is comfortable with basic graphs, such as histograms and scatterplots, and simple algebra notions, such as the slope of a line, that is quite sufficient.
The book does assume some prior background in R coding, such as familiarity with vectors, factors, data frames, and functions. The R command line (> prompt, Console in RStudio) is used throughout. Readers without a background in R, or those wishing to have a review, may find my fasteR tutorial useful: https://github.com/matloff/fasteR.
Make sure R and the qeML package are installed on your computer. For the package, the preferred installation source is GitHub, as it will always have the most up-to-date version of the package. You’ll need the devtools package; if you don’t already have it, type:
install.packages('devtools')
Then, to install qeML, type:
install_github('https://github.com/matloff/qeML')
The qeML package will also be on the CRAN R code repository but updated less frequently.
0.6 The qe*-Series Software
Most of the software used here will come from popular R packages:
e1071
gbm
glmnet
keras
randomForest
Readers can use these packages directly if they wish. But in order to keep things simple and convenient for readers, we usually will be using wrappers for the functions in those packages, which are available in my package, qeML. This is a big help in two ways:
The wrappers provide a uniform interface.
That uniform interface is also simple.
For instance, consider day1, a bike rental dataset used at various points in this book. We wish to predict tot, total ridership. Here’s how we would do that using random forests, an ML topic covered in this book:
qeRF(day1,'tot')
For support vector machines, another major topic, the call would be
qeSVM(day1,'tot')
and so on. Couldn’t be simpler! No preparatory code, say, to define a model; just call one of the qe functions and go! The prefix qe- stands for quick and easy.
One can also specify method-specific parameters, which we will do, but still, it will be quite simple.
For very advanced usage, this book shows how to use those packages directly.
0.7 The Book’s Grand Plan
Here is the path we’ll take. The first three chapters introduce general concepts that recur throughout the book, as well as specific machine learning methods. The rough description of ML above—predict on the basis of similar cases—is most easily developed using an ML method known as k-nearest neighbors (k-NN). Part I of the book will play two roles. First, it will cover k-NN in detail. Second, it will introduce the reader to general concepts that apply to all ML methods, such as choice of hyperparameters. In k-NN, the number of similar cases, usually denoted k, is the hyperparameter. For k-NN, what is the Goldilocks
value of k—not too small and not too large? Again, choice of hyperparameters is key in most ML methods, and it will be introduced via k-NN.
Part II will then present a natural extension of k-NN, tree-based methods, specifically random forests and gradient boosting. These methods work in a flowchart-like manner, asking questions about features one at a time. In the disease diagnosis example given before, the first question might be, Is the patient over age 50? The next might be something like, Is the patient’s body mass index below 20.2? In the end, this process partitions the patients into small groups in which the members are similar to each other, so it’s like k-NN. But the groups do take different forms from k-NN, and tree methods often outperform k-NN in prediction accuracy and are considered a major ML tool.
Part III discusses methods based on linear relationships. Readers who have some background in linear regression analysis will recognize some of this, though again, no such background is assumed. This part closes with a discussion of the LASSO and ridge regression, which have the tantalizing property of deliberately shrinking down some classical linear regression estimates.
Part IV involves methods based on separating lines and planes. Consider again the cell phone service example. Say we plot the data for the old customers who left the service using the color blue in our graph. Then on the same graph, we plot those who remained loyal in red. Can we find a straight line that separates most of the blue points from most of the red points? If so, we will predict the action of the new customer by checking which side of the line his case falls on. This description not only fits SVM but also fits, in a sense, the most famous ML method, neural networks, which we cover as well.
Finally, Part V introduces several specific types of ML applications, such as image classification.
It’s often said that no one ML method works best in all applications. True, but hopefully this book’s structure will impart a good understanding of similarities and differences between the methods, appreciating where each fits in the grand scheme of things.
There is a website for the book at http://heather.cs.ucdavis.edu/artofml, which contains code, errata, new examples, and more.
0.8 One More Point
In reading this book, keep in mind that the prose is just as important as the code. Avoid the temptation to focus only on the code and graphs. A page that is all prose—no math, no graphs, and no code—may be one of the most important pages in the book. It is there that you will learn the all-important why of ML, such as why choice of hyperparameters is so vital. The prose is crucial to your goal of becoming adept at ML with the most insight and predictive power!
Keep in mind that those dazzling ML successes you’ve heard about come only after careful, lengthy tuning and thought on the analyst’s part, requiring real insight. This book aims to develop that insight. Formal math is minimized here, but note that this means the math will give way to prose that describes many key issues.
So, let’s get started. Happy ML-ing!
PART I
PROLOGUE, AND NEIGHBORHOOD-BASED METHODS
1
REGRESSION MODELS
In this chapter, we’ll introduce regression functions. Such functions give the mean of one variable in terms of one or more others—for instance, the mean weight of children in terms of their age. All ML methods are regression methods in some form, meaning that they use the data we provide to estimate regression functions.
We’ll present our first ML method, k-nearest neighbors (k-NN), and apply it to real data. We’ll also weave in concepts that will recur throughout the book, such as dummy variables, overfitting, p-hacking, dirty
data, and so on. We’ll introduce many of these concepts only briefly for the time being in order to give you a bird’s-eye view of what we’ll return to in detail later: ML is intuitive and coherent but easier to master if taken in stages. Reader, please be prepared for frequent statements like We’ll cover one aspect for now, with further details later.
Before you begin, make sure you have R and the qeML and regtools packages, version 1.7 or newer for the latter, installed on your computer. (Run packageVersion('regtools') to check.) All code displays in this book assume that the user has already made the calls to load the packages:
library(regtools)
library(qeML)
So, let’s look at our first example dataset.
1.1 Example: The Bike Sharing Dataset
Before we introduce k-NN, we’ll need to have some data to work with. Let’s start with this dataset from the UC Irvine Machine Learning Repository, which contains the Capital Bikeshare system’s hourly and daily count of bike rentals between 2011 and 2012, with corresponding information on weather and other quantities. A more detailed description of the data is available at the UC Irvine Machine Learning Repository.¹
The dataset is included as the day dataset in regtools by permission of the data curator. Note, though, that we will use a slightly modified version, day1 (also included in regtools), in which the numeric weather variables are given in their original scale rather than transformed to the interval [0,1].
Our main interest will be in predicting total ridership for a day.
SOME TERMINOLOGY
Say we wish to predict ridership from temperature and humidity. Standard ML parlance refers to the variables used for prediction—in this case, temperature and humidity—as features.
If the variable to be predicted is numeric, say, ridership, there is no standard ML term for it. We’ll just refer to it as the outcome variable. But if the variable to be predicted is an R factor—that is, a categorical variable—it is called a label.
For instance, later in this book we will analyze a dataset on diseases of human vertebrae. There are three possible outcomes or categories: normal (NO), disk hernia (DH), or spondylolisthesis (SL). The column in our dataset showing the class of each patient, NO, DH, or SL, would be the labels column.
Our dataset, say, day1 here, is called the training set. We use it to make predictions in future cases, in which the features are known but the outcome variable is unknown. We are predicting the latter.
1.1.1 Loading the Data
The data comes in hourly and daily forms, with the latter being the one in the regtools package. Load the data:
> data(day1)
With any dataset, it’s always a good idea to first take a look around. What variables are included in this data? What types are they, say, numeric or R factor? What are their typical values? One way to do this is to use R’s head() function to view the top of the data:
> head(day1)
instant dteday season yr mnth holiday
1 1 2011-01-01 1 0 1 0
2 2 2011-01-02 1 0 1 0
3 3 2011-01-03 1 0 1 0
4 4 2011-01-04 1 0 1 0
5 5 2011-01-05 1 0 1 0
6 6 2011-01-06 1 0 1 0
weekday workingday weathersit temp
1 6 0 2 8.175849
2 0 0 2 9.083466
3 1 1 1 1.229108
4 2 1 1 1.400000
5 3 1 1 2.666979
6 4 1 1 1.604356
atemp hum windspeed casual registered
1 7.999250 0.805833 10.749882 331 654
2 7.346774 0.696087 16.652113 131 670
3 -3.499270 0.437273 16.636703 120 1229
4 -1.999948 0.590435 10.739832 108 1454
5 -0.868180 0.436957 12.522300 82 1518
6 -0.608206 0.518261 6.000868 88 1518
tot
1 985
2 801
3 1349
4 1562
5 1600
6 1606
>
nrow(day1)
[1] 731
We see there are 731 rows (that is, 731 different days), with data on the date, nature of the date (such as weekday), and weather conditions (such as the temperature, temp, and humidity, hum). The last three columns measure ridership from casual users, registered users, and the total.
You can find more information on the dataset with the ?day1 command.
1.1.2 A Look Ahead
We will get to actual analysis of this data shortly. For now, here is a preview. Say we wish to predict total ridership for tomorrow, based on specific weather conditions and so on. How will we do that with k-NN?
We will search through our data, looking for data points that match or nearly match those same weather conditions and other variables. We will then average the ridership values among those data points, and that will be our predicted ridership for this new day.
Too simple to be true? No, not really; the above description is accurate. Of course, the old saying The devil is in the details
applies, but the process is indeed simple. But first, let’s address some general issues.
1.2 Machine Learning and Prediction
ML is fundamentally about prediction. Before we get into the details of our first ML method, we should be sure we know what prediction
means.
Consider the bike sharing dataset. Early in the morning, the manager of the bike sharing service might want to predict the total number of riders for the day. The manager can do so by analyzing the relations between the features—the various weather conditions, the work status of the day (weekday, holiday), and so on. Of course, predictions are not perfect, but if they are in the ballpark of what turns out to be the actual number, they can be quite helpful. For instance, they can help the manager decide how many bikes to make available, with pumped-up tires and so on. (An advanced version would be to predict the demand for bikes at each station so that bikes could be reallocated accordingly.)
1.2.1 Predicting Past, Present, and Future
The famous baseball player and malapropist Yogi Berra once said, Prediction is hard, especially about the future.
Amusing as this is, he had a point; in ML, prediction can refer not only to the future but also to the present or even the past. For example, a researcher may wish to estimate the mean wages workers made back in the 1700s. Or a physician may wish to make a diagnosis as to whether a patient has a particular disease, based on blood tests, symptoms, and so on, guessing their condition in the present, not the future. So when we in the ML field talk of prediction,
don’t take the pre-
too literally.
1.2.2 Statistics vs. Machine Learning in Prediction
A common misconception is that ML is concerned with prediction, while statisticians do inference—that is, confidence intervals and testing for quantities of interest—but prediction is definitely an integral part of the field of statistics.
There is sometimes a friendly rivalry between the statistics and ML communities, even down to a separate terminology for each (see Appendix B). Indeed, statisticians sometimes use the term statistical learning to refer to the same methods known in the ML world as machine learning!
As a former statistics professor who has spent most of his career in a computer science department, I have a foot in both camps. I will present ML methods in computational terms, but with some insights informed by statistical principles.
HISTORICAL NOTE
Many of the methods treated in this book, which compose part of the backbone of ML, were originally developed in the statistics community. These include k-NN, decision trees or random forests, logistic regression, and L1/L2 shrinkage. These evolved from the linear models formulated way back in the 19th century, but which later statisticians felt were inadequate for some applications. The latter consideration sparked interest in methods that had less restrictive assumptions, leading to the invention first of k-NN and later of other techniques.
On the other hand, two other prominent ML methods, support vector machines (SVMs) and neural networks, have been developed almost entirely outside of statistics, notably in university computer science departments. (Another method, boosting, began in computer science but has had major contributions from both factions.) Their impetus was not statistical at all. Neural networks, as we often hear in the media, were studied originally as a means to understand the workings of the human brain. SVMs were viewed simply in computer science algorithmic terms—given a set of data points of two classes, how can we compute the best line or plane separating them?
1.3 Introducing the k-Nearest Neighbors Method
Our featured method in this chapter will be k-nearest neighbors, or k-NN. It’s arguably the oldest ML method, going back to the early 1950s, but it is still widely used today, especially in applications in which the number of features is small (for reasons that will become clear later). It’s also simple to explain and easy to implement—the perfect choice for this introductory chapter.
1.3.1 Predicting Bike Ridership with k-NN
Let’s first look at using k-NN to predict bike ridership from a single feature: temperature. Say the day’s temperature is forecast to be 28 degrees centigrade. How should we predict ridership for the day, using the 28 figure and our historical ridership dataset (our training set)? A person without a background in ML might suggest looking at all the days in our data, culling out those of temperature closest to 28 (there may be few or none with a temperature of exactly 28), and then finding the average ridership on those days. We would use that number as our predicted ridership for this day.
Actually, this intuition is correct! This, in fact, is the basis for many common ML methods, as we’ll discuss further in Section 1.6 on the regression function. For now, just know that k-NN takes the form of simply averaging over the similar cases—that is, over the neighboring data points. The quantity k is the number of neighbors we use. We could, say, take