想了解机器学习?这 3 种算法你必须要知道(中英文对照)

本文介绍了机器学习领域的三大核心算法:回归(包括线性回归和逻辑回归)、决策树以及K-均值聚类。通过实际案例,展示了这些算法的具体应用,并讨论了它们各自的特点和优势。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

英文原文:https://dzone.com/articles/3-machine-learning-algorithms-you-need-to-know

翻译参考:https://www.oschina.net/translate/3-machine-learning-algorithms-you-need-to-know

Imagine you have some data-related problem that you want to solve. You have heard of all the amazing things that machine learning algorithms can achieve and want to try it for yourself — but you have no prior experience or knowledge in this area. You start googling some terms like “machine learning models” and “machine learning methodologies,” but after some time, you find yourself ready to give up, completely lost somewhere between the different algorithms.
假设有一些数据相关的问题亟待你解决。在此之前你听说过机器学习算法可以帮助解决这些问题,于是你想借此机会尝试一番,却苦于在此领域没有任何经验或知识。 你开始谷歌一些术语,如“机器学习模型”和“机器学习方法论”,但一段时间后,你发现自己完全迷失在了不同算法之间,于是你准备放弃。

Hold on, you brave guy!
朋友,请坚持下去!

Lucky for you, I’m going to walk you through three major categories of machine learning algorithms that will allow you to solve confidently a large range of data science problems.
幸运的是,在这篇文章中我将介绍三大类的机器学习算法,针对大范围的数据科学问题,相信你都能满怀自信去解决。

In the following post, we are going to talk about decision trees, clustering algorithms, and regression, point out their differences, and figure out how to choose the most suitable model for your case.
在接下来的文章中,我们将讨论决策树、聚类算法和回归,指出它们之间的差异,并找出如何为你的案例选择最合适的模型

Supervised vs. Unsupervised Learning
有监督的学习 vs. 无监督的学习

The cornerstone of your understanding is the categorization of the problems in two major classes, supervised and unsupervised learning, as any machine learning problem can be assigned to one of them.
理解机器学习的基础,就是要学会对有监督的学习和无监督的学习进行分类,因为机器学习中的任何一个问题,都属于这两大类的范畴。

In the case of supervised learning, we have a dataset that will be given to some algorithm as input. We already know what the format of the correct output should look like (assuming that there is some relationship between the input and the output).
在有监督学习的情况下,我们有一个数据集,它们将作为输入提供给一些算法。但前提是,我们已经知道正确输出的格式应该是什么样子(假设输入和输出之间存在一些关系)。

As we will see later on, regression and classification problems belong to this category.
我们随后将看到的回归和分类问题都属于这个类别。

On the other hand, unsupervised learning is applied in cases where we have no clue what the output should look like. In fact, we derive structures from data where the effect of the input variables is not known. Clustering problems are the major representatives of this class.
另一方面,在我们不知道输出应该是什么样子的情况下,就应该使用无监督学习。事实上,我们需要从输入变量的影响未知的数据中推导出正确的结构。聚类问题是这个类别的主要代表。

To make the above categorization more clear I’ll give you some real-world problems and try to classify them accordingly.
为了使上面的分类更清晰,我会列举一些实际的问题,并试着对它们进行相应的分类。

Example 1
Imagine you are running a real estate office. Given a new house’s characteristics, you want to predict at what price it is going to be sold based on previous sales of other houses that you have recorded. Your input dataset consists of the characteristics of multiple houses, like the number of bathrooms and lot size, while the variable you want to predict, often called target variable, is the price. Since you know the prices at which the houses of the dataset were sold, this is a supervised learning problem — and more specifically, a regression one.
示例一
假设你在经营一家房地产公司。考虑到新房子的特性,你要根据你以前记录的其他房屋的销售量来预测它的售价是多少。你输入的数据集包括多个房子的特性,比如卫生间的数量和大小等,而你想预测的变量(通常称为“目标变量”)就是价格。预测房屋的售价是一个有监督学习问题,更确切地说,是回归问题。

Example 2
Consider the results of a medical experiment that aims to predict whether someone is going to develop myopia based on some physical measurements and heredity. In this case, the input dataset consists of the person’s medical characteristics and the target variable is binary: 1 for those who are likely to develop myopia and 0 for those who aren’t. Since you know in advance the value of the target variable of the experiment’s participants (i.e. you know if they were myopic), this is again a supervised learning problem — and more specifically, a classification one.
示例二
假设一个医学实验的目的是预测一个人是否会因为一些体质测量和遗传导致近视程度加深。在这种情况下,输入的数据集是这个人的体质特征,而目标变量有两种:1 表示可能加深近视,而 0 表示不太可能。预测一个人是否会加深近视也是一个有监督学习问题,更确切地说,是分类问题。

Example 3
Suppose you have a company with many customers. Based on their recent interactions with your company, the products they’ve bought lately, and their demographics, you want to form groups of similar customers in order to handle them differently — i.e. offer exclusive discount coupons to some them. In this case, you will use the characteristics mentioned above as input into some algorithm and the algorithm will decide the number or kind of groups that should be formed. That’s a clear example of unsupervised learning problem since we do not have any a priori clue about how the output should look like.
示例三
假设你的公司拥有很多客户。根据他们最近与贵公司的互动情况、他们近期购买的产品以及他们的人口统计数据,你想要形成相似顾客的群体,以便以不同的方式应对他们 - 例如向他们中的一些人提供独家折扣券。在这种情况下,你将使用上述提及的特征作为算法的输入,而算法将决定应该形成的组的数量或类别。这显然是一个无监督学习的例子,因为我们没有任何关于输出会如何的线索,完全不知道结果会怎样。

That being said, it’s time to fulfill my promise and move on to some more specific algorithms…
接下来,我将介绍一些更具体的算法…

Regression
回归
To begin with, regression is not a single supervised technique but instead a whole category to which many techniques belong.
首先,回归不是一个单一的监督学习技术,而是一个很多技术所属的完整类别。

The main idea in regression is that given some input variables, we want to predict the value of the target. In the case of regression, the target variable is continuous — meaning that it can take any value within a specified range. Input variables, on the other hand, can be either discrete or continuous.
回归的主要思想是给定一些输入变量,我们要预测目标值。在回归的情况下,目标变量是连续的 - 这意味着它可以在指定的范围内取任何值。另一方面,输入变量可以是离散的也可以是连续的。

Among regression techniques, the most popular are linear and logistic. Let’s have a closer look at them.
在回归技术中,最流行的是线性回归和逻辑回归。让我们仔细研究一下。

Linear Regression
线性回归
In linear regression, we are trying to establish a relationship between the input variables and the target, which is expressed by a straight line, usually called regression line.
在线性回归中,我们尝试在输入变量和目标变量之间构建一段关系,并将这种关系用条直线表示,我们通常将其称为回归线。

For example, assuming that we have two input variables X1 and X2 and one target variable Y, this relationship can be mathematically expressed like:
例如,假设我们有两个输入变量 X1 和 X2,还有一个目标变量 Y,它们的关系可以用数学公式表示如下:

Y = a ∗ X 1 + b ∗ X 2 + c Y = a * X_1 + b*X_2 +c Y=aX1+bX2+c

Given the values of X1 and X2, we aim to tune a, b, and c so that Y will be as close to the actual value as possible.
假设 X1 和 X2 的值已知,我们需要将 a,b 和 c 进行调整,从而使 Y 能尽可能的接近真实值。

So, example time!
举个例子!

Suppose we have the famous Iris dataset, which contains measurements regarding the size of the sepal and the petal of iris flowers belonging to different types, i.e. Setosa, Versicolor, and Virginica.
假设我们拥有著名的 Iris 数据集,它提供了一些方法,能通过花朵的花萼大小以及花瓣大小判断花朵的类别,如:Setosa,Versicolor 和 Virginica。

Using R software, we are going to implement a linear regression to predict the length of the sepal given the petal’s width and length.
使用 R 软件,假设花瓣的宽度和长度已给定,我们将实施线性回归来预测萼片的长度。

Mathematically, we are going the values a and b in the following relationship:
在数学上,我们会通过以下公式来获取 a、b 值:

S e p a l L e n g t h = a ∗ P e t a l W i d t h + b ∗ P e t a l L e n g t h + c SepalLength = a * PetalWidth + b* PetalLength +c SepalLength=aPetalWidth+bPetalLength+c

The corresponding code is:
相应的代码如下所示:

# Load required packages
library(ggplot2)
# Load iris dataset
data(iris)
# Have a look at the first 10 observations of the dataset
head(iris)
# Fit the regression line
fitted_model <- lm(Sepal.Length ~ Petal.Width + Petal.Length, data = iris)
# Get details about the parameters of the selected model
summary(fitted_model)
# Plot the data points along with the regression line
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
    geom_point(alpha = 6/10) +
    stat_smooth(method = "lm", fill="blue", colour="grey50", size=0.5, alpha = 0.1)

The results of the linear regression are shown in the following graph, where the black points represent the initial data points while the blue line the fitted regression line, which is plotted for the estimated values a = -0.31955, b = 0.54178, and c = 4.19058 that result in the best possible predictions for the target variable, i.e. the sepal length.
线性回归的结果显示在下列图表中,其中黑点表示初始数据点,蓝线表示拟合回归直线,由此得出估算值:a= -0.31955,b = 0.54178 和 c = 4.19058,这个结果可能最接近实际值,即花萼的真实长度。

From now on, we will be able to make predictions regarding sepal length for new points just by applying the values of petal length and petal width into the defined linear relationship.
接下来,只要将花瓣长度和花瓣宽度的值应用到定义的线性关系中,就可以对花萼长度进行预测了。
在这里插入图片描述
Logistic Regression
逻辑回归

The main idea here is exactly the same as it was with linear regression. The difference is that the regression line is not straight anymore.
主要思想与线性回归完全相同。不同点是逻辑回归的回归线不再是直的。

Instead, the mathematical relationship we are trying to establish is of the following form:
我们要建立的数学关系是以下形式的:

Y = g ( a ∗ X 1 + b ∗ X 2 ) Y=g(a*X_1+b*X_2) Y=g(aX1+bX2)

…where g() is the logistic function.
g() 是一个对数函数。

Due to the properties of the logistic function, Y is continuous, ranging in [0,1] and can be interpreted as the probability of an event to happen.
根据该逻辑函数的性质,Y 是连续的,范围是 [0,1],可以被解释为一个事件发生的概率。

I know you love examples so I’ll show you one more!
再举个例子!

This time, we will experiment with the mtcars dataset, which comprises fuel consumption and ten aspects of automobile design and performance for 32 automobiles manufactured in 1973-1974.
这一次我们研究 mtcars 数据集,包含 1973-1974 年间 32 种汽车制造的汽车设计、十个性能指标以及油耗

Using R, we are going to predict the probability of an automobile of having automatic transmission (am = 0) or manual (am = 1) based on the measurements of V/S and Miles/(US) gallon.
使用 R,我们将在测量 V/S 和每英里油耗的基础上预测汽车的变速器是自动(AM = 0)还是手动(AM = 1)的概率。

a m = g ( a ∗ m p g + b ∗ v s + c ) : am = g(a * mpg + b* vs +c): am=g(ampg+bvs+c):

# Load required packages
library(ggplot2)
# Load data
data(mtcars)
# Keep a subset of the data features that includes on the measurement we are interested in
cars <- subset(mtcars, select=c(mpg, am, vs))
# Fit the logistic regression line
fitted_model <- glm(am ~ mpg+vs, data=cars, family=binomial(link="logit"))
# Plot the results
ggplot(cars, aes(x=mpg, y=vs, colour = am)) + geom_point(alpha = 6/10) +
 stat_smooth(method="glm",fill="blue", colour="grey50", size=0.5, alpha = 0.1, method.args=list(family="binomial"))

The results are shown in the following graph, where the black points represent the initial points of the dataset and the blue line the fitted logistic regression line for estimated a = 0.5359, b = -2.7957, and c = - 9.9183.
如下图所示,其中黑点代表数据集的初始点,蓝线代表闭合的对数回归线。估计 a = 0.5359,b = -2.7957,c = - 9.9183
在这里插入图片描述
We can observe that, as mentioned before, the logistic regression output values will lie only in the range [0,1] due to the form of the regression line.
我们可以观察到,和线性回归一样,对数回归的输出值回归线也在区间 [0,1] 内。

For any new automobile with given measurements for V/S and Miles/(US) gallon, we can now predict the probability that this car will have automatic transmission. Isn’t that utterly awesome?
对于任何新汽车的测量 V/S 和每英里油耗,我们可以预测这辆汽车将使用自动变速器。这是不是准确得吓人?

Decision Trees
决策树

Decision trees are the second type of machine learning algorithms that we are going to investigate. They are divided in regression and classification trees, and thus can be used in supervised learning problems.
决策树是我们要研究的第二种机器学习算法。它们被分成回归树和分类树,因此可以用于监督式学习问题。

Admittedly, decision trees are one of the most intuitive algorithms, as they mimic the way people decide in many cases. What they basically do is draw a “map” of all possible paths along with the corresponding result in each case.
无可否认,决策树是最直观的算法之一,因为它们模仿人们在多数情况下的决策方式。他们基本上做的是在每种情况下绘制所有可能路径的“地图”,并给出相应的结果。

A graphical representation will help understand better what exactly we are talking about.
图形表示有助于更好地理解我们正在探讨的内容。
在这里插入图片描述
Based on a tree like this, the algorithm can decide which path to follow at each step depending on the value of the corresponding criterion. The way that the algorithm chooses the splitting criteria along with the respective thresholds at each level depends on how informative the candidate variable is for the target variable and which setup minimizes the error of the prediction produced.
基于像上面这样的树,该算法可以根据相应标准中的值来决定在每个步骤要采用的路径。算法所选择的划分标准以及每个级别的相应阈值的策略,取决于候选变量对于目标变量的信息量多少,以及哪个设置可以最小化所产生的预测误差。

Yet another example is here!
再举一个例子!

This time, the dataset under examination is readingSkills. It includes information about students who took a certain exam along with their score.
这次测验的数据集是 readingSkills。它包含学生的考试信息及考试分数。

We are going to classify students into native speakers (nativeSpeaker = 1) or foreigners (nativeSpeaker= 0) based on multiple measures, including their score in the test, their shoe size, and their age.
我们将基于多重指标把学生分为两类,说母语者(nativeSpeaker = 1)或外国人(nativeSpeaker= 0),他们的考试分数、鞋码以及年龄都在指标范围内。

For this implementation in R, we first need to install the party package.
对于 R 中的实现,我们首先要安装 party 包:

# Include required packages
library(party)
library(partykit)
# Have a look at the first ten observations of the dataset
print(head(readingSkills))
input.dat <- readingSkills[c(1:105),]
# Grow the decision tree
output.tree <- ctree(
 nativeSpeaker ~ age + shoeSize + score,
 data = input.dat)
# Plot the results
plot(as.simpleparty(output.tree))

We can see that the first splitting criterion used was the score, due to its high importance in predicting the target variable, whereas shoe size wasn’t taken into consideration at all as it didn’t provide any helpful information regarding the language.
我们可以看到,它使用的第一个分类标准是分数,因为它对于目标变量的预测非常重要,鞋码则不在考虑的范围内,因为它没有提供任何与语言相关的有用的信息。
在这里插入图片描述
Now, if we have a new student and know their age and score on the test, we can predict whether they are a native speaker!
现在,如果我们多了一群新生,并且知道他们的年龄和考试分数,我们就可以预测他们是不是说母语的人。

Clustering Algorithms
聚类算法
Until now, we have talked only a little bit about supervised learning problems. Now, we are moving on to the study of clustering algorithms, a subset of unsupervised learning methods.
到目前为止,我们都在讨论有监督学习的有关问题。现在,我们要继续研究聚类算法,它是无监督学习方法的子集。

So, just a small revision…
因此,只做了一点改动…

With clustering, if we have some initial data at our disposal, we want to form groups so that the data points belonging to some group are similar and are different from data points of the other groups.
说到聚类,如果我们有一些初始数据需要支配,我们会想建立一个组,这样一来,其中一些组的数据点就是相同的,并且能与其他组的数据点区分开来。

The algorithm we are going to study is called “k-means” where k represents the number of produced clusters and is one of the most popular clustering methods.
我们将要学习的算法叫做 K-Means聚类(K-Means Clustering),也可以叫 K-Means 聚类,其中 k 表示产生的聚类的数量,这是最流行的聚类算法之一。

Remember the Iris dataset we used in the example before? We are going to use it again.
还记得我们前面用到的 Iris 数据集吗?这里我们将再次用到。

For the purpose of our study, we have plotted all the data points from the dataset using their petal measures, as shown below:
为了更好地研究,我们使用花瓣测量方法绘制出数据集的所有数据点,如图所示:在这里插入图片描述
Based solely on petal measures, we are going to cluster the data points into three groups using 3-means clustering.
仅仅基于花瓣的度量值,我们使用 3-均值聚类将数据点聚集成三组。

So how does the 3-means, or more generally, the k-means algorithm, work? The whole process can be summarized in just a few simple steps:
那么3-均值,或更普遍来说,k-聚类算法是怎样工作的呢?整个过程可以概括为几个简单的步骤:

  • Initialization step: For k=3 clusters, the algorithm randomly selects three points as centroids for each cluster.
    初始化步骤:例如 K = 3 簇,这个算法为每个聚类中心随机选择三个数据点。

  • Cluster assignment step: The algorithm goes through the rest of the data points and assigns each one of them to the closest cluster.
    群集分配步骤:该算法通过其余的数据点,并将其中的每一个分配给最近的群集。

  • Centroid move step: After cluster assignment, the centroid of each cluster is moved to the average of all points belonging to the cluster.
    重心移动步骤:在集群分配后,每个簇的质心移动到属于组的所有点的平均值。

Steps 2 and 3 are repeated multiple times until there is no change to be made regarding cluster assignments. The implementation of the k-means algorithm in R is easy and can be done with the following code:
步骤 2 和 3 重复多次,直到没有对集群分配作出更改为止。用 R 实现 k-聚类算法很简单,可以用下面的代码完成:

# Load required packages
library(ggplot2)
library(datasets)
# Load data
data(iris)
# Set seed to make results reproducible
set.seed(20)
# Implement k-means with 3 clusters
iris_cl <- kmeans(iris[, 3:4], 3, nstart = 20)
iris_cl$cluster <- as.factor(iris_cl$cluster)
# Plot points colored by predicted cluster
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris_cl$cluster)) + geom_point()

From the result, we can see that the algorithm split the data into three clusters, indicated by three different colors. We can also observe that these clusters were formed based on the size of the petals. More specifically, the red cluster includes flowers with small petals, the green cluster, iris with relatively large petals, while the blue one, iris with medium sized petals.
从结果中,我们可以看到,该算法将数据分成三个组,由三种不同的颜色表示。我们也可以观察到这三个组是根据花瓣的大小分的。更具体地说,红点代表小花瓣的花,绿点代表大花瓣的花,蓝点代表中等大小的花瓣的花。
在这里插入图片描述
What is important to note at this point is that in any clustering, the interpretation of the formed groups requires some expert knowledge in the field. In our case, if you are not a botanist, you probably won’t realize that what the k-means managed to do is cluster the iris into their different types, i.e. Setosa, Versicolor, and Virginica without having any knowledge about them!
注意的是,在任何聚类中,对分组的解释都需要在领域中的一些专业知识。在上一个例子中,如果你不是一个植物学家,你可能不会知道,k-均值做的是用花瓣大小给花分组,与 Setosa、 Versicolo r和 Virginica 的区别无关!

So if we plot the data again, this time colored by their species, we will see the similarity in the clusters.
因此,如果我们再次绘制数据,这一次由它们的物种着色,我们将看到集群中的相似性。

在这里插入图片描述

Summary
总结
We have come a long way from where we began. We’ve talked about regression (both linear and logistic), decision trees, and finally, k-means clustering. We also set up some simple, yet powerful, implementations of each one of these methods in R.
我们从一开始就走了很长的路。我们已经谈到回归(线性和逻辑)、决策树,以及最后的 K-均值聚类。我们还在R中为其中的每一个方法建立了一些简单而强大的实现。

So, what are the advantages of each algorithm over the others? Which one should you choose for a real-life problem?
那么,每种算法的优势是什么呢? 在处理现实生活中的问题时你该选择哪一个呢?

Firstly, the presented methods are not some toy algorithms — they are widely used in production systems around the world, and so depending on the task, they can be pretty powerful.
首先,这里所提出的方法都是在实际操作中被验证为行之有效的算法 - 它们在世界各地的产品系统中被广泛使用,所以根据任务情况选用,能发挥十分强大的作用。

Secondly, in order to answer the above question, you have to be clear what exactly you mean by advantages, as the relative benefits of each method can be expressed in many contexts, i.e. interpretability, robustness, computational time, etc.
其次,为了回答上述问题,你必须明确你所说的优势究竟意味着什么,因为每个方法的相对优势在不同情况下的呈现不同,比如可解释性、鲁棒性、计算时间等等。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值