机器学习指标_20种流行的机器学习指标第2部分排名统计指标-CSDN博客

本文深入探讨了20种常见的机器学习指标，重点关注第二部分的排名统计指标，包括它们在评估模型性能中的应用和重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

机器学习指标

介绍(Introduction)

In the first part of this post, I provided an introduction to 10 metrics used for evaluating classification and regression models. In this part, I am going to provide an introduction to the metrics used for evaluating models developed for ranking (AKA learning to rank), as well as metrics for statistical models. In particular, I will cover the talk about the below 5 metrics:

在本文的第一部分中，我介绍了用于评估分类和回归模型的10个指标。在这一部分中，我将介绍用于评估为排名而开发的模型(即AKA学习排名)的度量以及统计模型的度量。特别是，我将讨论以下5个指标：

Mean reciprocal rank (MRR)
平均倒数排名(MRR)
Precision at k
k精度
DCG and NDCG (normalized discounted cumulative gain)
DCG和NDCG(归一化折现累计收益)
Pearson correlation coefficient
皮尔逊相关系数
Coefficient of determination (R²)
测定系数(R²)

排名相关指标 (Ranking Related Metrics)

Ranking is a fundamental problem in machine learning, which tries to rank a list of items based on their relevance in a particular task (e.g. ranking pages on Google based on their relevance to a given query). It has a wide range of applications in E-commerce, and search engines, such as:

排名是在机器学习，它试图排名根据他们在一个特定的任务相关的项目列表的一个基本问题(根据相关度给定查询上谷歌排名例如页)。它在电子商务和搜索引擎中具有广泛的应用，例如：

Movie recommendation (as in Netflix, and YouTube),
电影推荐(如Netflix和YouTube )，
Page ranking on Google,
在Google上的网页排名，
Ranking E-commerce products on Amazon,
在亚马逊上对电子商务产品进行排名，
Query auto-completion,
查询自动完成
Image search on vimeo,
在vimeo上进行图片搜索，
Hotel search on Expedia/Booking.
在Expedia /预订上搜索酒店。

In learning to rank problem, the model tries to predict the rank (or relative order) of a list of items for a given task¹. The algorithms for ranking problem can be grouped into:

在学习排名问题时，模型会尝试预测给定任务¹的项目列表的排名(或相对顺序)。排名问题的算法可以分为：

Point-wise models: which try to predict a (matching) score for each query-document pair in the dataset, and use it for ranking the items.
逐点模型：尝试为数据集中的每个查询文档对预测一个(匹配)分数，并将其用于对项目进行排名。
Pair-wise models: which try to learn a binary classifier that can tell which document is more relevant to a query, given pair of documents.
逐对模型：在给定的成对文档中，尝试学习二进制分类器，该分类器可以告诉哪个文档与查询更相关。
List-wise models: which try to directly optimize the value of one of the above evaluation measures, averaged over all queries in the training data.
列表式模型：尝试直接优化上述评估方法之一的价值，对培训数据中的所有查询取平均值。

During evaluation, given the ground-truth order of the list of items for several queries, we want to know how good the predicted order of those list of items is.

在评估过程中，给定几个查询的项目列表的真实顺序，我们想知道这些项目列表的预测顺序有多好。

There are various metrics proposed for evaluating ranking problems, such as:

建议使用各种度量标准来评估排名问题，例如：

MRR
MRR
Precision@ K
精度@ K
DCG & NDCG
DCG和NDCG
MAP
地图
Kendall’s tau
肯德尔的牛头
Spearman’s rho
斯皮尔曼的罗

In this post, we focus on the first 3 metrics above, which are the most popular metrics for ranking problem.

在本文中，我们重点介绍上面的前3个指标，它们是排名问题中最受欢迎的指标。

Some of these metrics may be very trivial, but I decided to cover them for the sake of completeness. So feel free to skip over the the ones you are familiar with. Without any further due, let’s begin our journey.

其中一些指标可能非常琐碎，但是为了完整起见，我决定将它们涵盖在内。因此，随时跳过您熟悉的内容。无需再承担任何费用，让我们开始旅程。

11- MRR (11- MRR)

Mean reciprocal rank (MRR) is one of the simplest metrics for evaluating ranking models. MRR is essentially the average of the reciprocal ranks of “the first relevant item” for a set of queries Q, and is defined as:

平均倒数排名(MRR)是评估排名模型的最简单指标之一。 MRR本质上是一组查询Q的“第一个相关项目”的倒数排名的平均值，并定义为：

To illustrate this, let’s consider the below example, in which the model is trying to predict the plural form of English words by masking 3 guess. In each case, the correct answer is also given.

为了说明这一点，让我们考虑以下示例，其中该模型试图通过掩盖3个猜测来预测英语单词的复数形式。在每种情况下，都会给出正确的答案。

The MRR of this system can be found as:

该系统的MRR可以找到：

MRR= 1/3*(1/2+1/3+1/1)= 11/18

MRR = 1/3 *(1/2 + 1/3 + 1/1)= 11/18

One of the limitations of MRR is that, it only takes the rank of one of the items (the most relevant one) into account, and ignores other items (for example mediums as the plural form of medium is ignored). This may not be a good metric for cases that we want to browse a list of related items.

MRR的局限性之一是，它仅考虑一项(最相关的一项)的排名，而忽略了其他项(例如，由于忽略了复数形式的媒介，因此没有媒介)。对于我们要浏览相关项目列表的情况，这可能不是一个很好的指标。

12-k精度 (12- Precision at k)

Precision at k (P@k) is another popular metric, which is defined as “the number of relevant documents among the top k documents”:

k的精度(P @ k)是另一种流行的度量标准，其定义为“前k个文档中相关文档的数量”：

As an example, if you search for “hand sanitizer” on Google, and in the first page, 8 out of 10 links are relevant to hand sanitizer, then the P@10 for this query equals to 0.8.

例如，如果您在Google上搜索“洗手液”，并且在首页中，每10个链接中有8个与洗手液相关，那么此查询的P @ 10等于0.8。

Now to find the precision at k for a set of queries Q, you can find the average value of P@k for all queries in Q.

现在要找到一组查询Q在k处的精度，您可以找到Q中所有查询的P @ k平均值。

P@k has several limitations. Most importantly, it fails to take into account the positions of the relevant documents among the top k. Also it is easy to evaluate the model manually in this case, since only the top k results need to be examined to determine if they are relevant or not.

P @ k有几个限制。最重要的是，它没有考虑到相关文件在前k位中的位置。在这种情况下，手动评估模型也很容易，因为只需要检查前k个结果来确定它们是否相关。

Note that recall@k is another popular metric, which can be defined in a very similar way.

注意，recall @ k是另一种流行的度量，可以用非常相似的方式定义。

13- DCG和NDCG (13- DCG and NDCG)

Normalized Discounted Cumulative Gain (NDCG) is perhaps the most popular metric for evaluating learning to rank systems. In contrast to the previous metrics, NDCG takes the order and relative importance of the documents into account, and values putting highly relevant documents high up the recommended lists.

归一化贴现累积增益(NDCG)可能是评估学习等级系统最流行的指标。与以前的指标相比，NDCG会考虑文档的顺序和相对重要性，并且将高度相关的文档放在建议列表的上位。

Before giving the official definition NDCG, let’s first introduce two relevant metrics, Cumulative Gain (CG) and Discounted Cumulative Gain (DCG).

在给出正式定义NDCG之前，让我们首先介绍两个相关的度量标准：累积增益(CG)和折扣累积增益(DCG)。

Cumulative Gain (CG) of a set of retrieved documents is the sum of their relevance scores to the query, and is defined as below.

一组检索到的文档的累积增益(CG)是它们与查询的相关性得分之和，定义如下。

Here we assume that the relevance score of each document to a query is given (otherwise it is usually set to a constant value)

在这里，我们假设给出了每个文档与查询的相关性得分(否则通常设置为恒定值)

Discounted Cumulative Gain (DCG) is essentially the weighted version of CG, in which a logarithmic reduction factor is used to discount the relevance scores proportionally to the position of the results. This is useful, as in practice we want to give higher priority to the first few items (than the later ones) when analyzing the performance of a system. DCG is defined as:

贴现累积增益(DCG)本质上是CG的加权版本，其中使用对数减少因子来按比例将贴近度得分贴近结果位置。这很有用，因为在实践中，当我们分析系统性能时，我们希望给予头几个项目(比后面的项目)更高的优先级。 DCG定义为：

There is also another version of

还有另一个版本的

Normalized Discounted Cumulative Gain (NDCG) tries to further enhance DCG to better suit real world applications. Since the retrieved set of items may vary in size among different queries or systems, NDCG tries to compare the performance using the normalized version of DCG (by dividing it by DCG of the ideal system). In other words, it sorts documents of a result list by relevance, finds the highest DCG (achieved by an ideal system) at position p, and used to normalize DCG as:

归一化贴现累积增益(NDCG)试图进一步增强DCG，使其更适合实际应用。由于在不同的查询或系统中，检索到的项目集的大小可能有所不同，因此NDCG尝试使用归一化版本的DCG(通过将其除以理想系统的DCG)比较性能。换句话说，它按相关性对结果列表的文档进行排序，在位置p处找到最高的DCG(由理想系统实现)，并将DCG归一化为：

where the IDCG is the “ ideal discounted cumulative gain”, and is defined as below:

其中IDCG是“理想的折现累计收益”，定义如下：

NDCG is a popular metric, but has its own limitations too. One of its main limitations is that it does not penalize for bad documents in the result. It may not be suitable to measure performance of queries that may often have several equally good results (especially true when we are mainly interested in the first few results as it is done in practice).

NDCG是一种流行的指标，但也有其自身的局限性。它的主要限制之一是它不会对结果中的不良文件进行惩罚。它可能不适合用来衡量可能经常具有几个同样好的结果的查询的性能(特别是当我们对实践中所做的前几个结果感兴趣时，尤其如此)。

统计指标 (Statistical Metrics)

Although one can think of machine learning as applied statistics and therefore count all ML metrics as some kind of statistical metrics, there are a few metrics which are mostly used by statistician to evaluate the performance of statistical models. Some of the popular metrics here include: Pearson correlation coefficient, coefficient of determination (R²), Spearman’s rank correlation coefficient, p-value, and more². Here we briefly introduce correlation coefficient, and R-squared.

尽管可以将机器学习视为应用统计，因此将所有ML指标都视为某种统计指标，但统计人员大多使用一些指标来评估统计模型的性能。这里的一些流行指标包括：皮尔逊相关系数，确定系数(R²)，斯皮尔曼等级相关系数，p值等等。在这里，我们简要介绍相关系数和R平方。

14-皮尔逊相关系数 (14- Pearson Correlation Coefficient)

Pearson correlation coefficient is perhaps one of the most popular metrics in the whole statistics and machine learning area. Its application is so broad that is used in almost every aspects of statistical modeling, from feature selection and dimensionality reduction, to regularization and model evaluation and beyond³.

皮尔逊相关系数可能是整个统计和机器学习领域中最受欢迎的指标之一。它的应用是如此广泛，几乎用于统计建模的各个方面，从特征选择和降维到正则化和模型评估等等。

Correlation coefficient of two random variables (or any two vector/matrix) shows their statistical dependence.

两个随机变量(或任意两个向量/矩阵)的相关系数显示了它们的统计依赖性。

The linear correlation coefficient of two random variable X and Y is defined as below:

两个随机变量X和Y的线性相关系数定义如下：

Here \mu and \sigma denote the mean and standard variation of each variable, respectively.

\ mu和\ sigma分别表示每个变量的均值和标准差。

In most cases the underlying statistical distribution of variables are not known, and all we have is a N sample of that random variable (you can think of it as an N-dimensional vector). In those cases, we can use the Sample correlation coefficient of two N-dimensional vectors X, and Y, as below:

在大多数情况下，变量的基本统计分布是未知的，而我们所拥有的只是该随机变量的N个样本(您可以将其视为N维向量)。在这些情况下，我们可以使用两个N维向量X和Y的样本相关系数，如下所示：

The correlation coefficient of two variables is always a value in [-1,1]. Two variables are known to be independent if and only if their correlation is 0.

两个变量的相关系数始终是[-1,1]中的值。当且仅当它们的相关性为0时，两个变量才是已知的。

15-测定系数(R²) (15- Coefficient of Determination (R²))

Coefficient of determination or R², is formally defined as the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

确定系数或R 2正式定义为因变量中可从自变量预测的方差的比例。

To better understand what this means, let’s assume a dataset has N samples with corresponding target values of y_1, y_2, …, y_N. Let’s assume the corresponding predicted values of these samples by our model have values of f_1, f_2, …, f_N.

为了更好地理解这意味着什么，我们假设一个数据集包含N个样本，其目标值分别为y_1，y_2，…，y_N。假设我们的模型对这些样本的相应预测值的值分别为f_1，f_2，…，f_N。

Now we can define the below terms that are going to be used to calculate R²:

现在我们可以定义以下将用于计算R²的项：

The mean of observed data:
观测数据的平均值：

The total sum of squares (proportional to the variance of the data):
平方总和(与数据的方差成正比)：

The sum of squares of residuals, also called the residual sum of squares:
残差平方和，也称为残差平方和：

Then the most general definition of R² can be written as below:

那么R 2的最一般定义可以写成如下：

In the best case, the modeled values exactly match the observed values, which results in R²=1. A model which always predicts the mean value of the observed data would have an R²=0.

在最佳情况下，建模值与观测值完全匹配，从而导致R²= 1 。总是预测观测数据平均值的模型将具有R²= 0 。

概要 (Summary)

In this post, I provided an introduction to 5popular metrics used for evaluating the performance of ranking and statistical models. In the next part of this post, I am going to provide an introduction to 5 more advanced metrics used for assessing the performance of Computer Vision, NLP, and Deep Learning Models.

在这篇文章中，我介绍了用于评估排名和统计模型的性能的5种流行指标。在本文的下一部分中，我将介绍5个更高级的指标，这些指标用于评估计算机视觉，NLP和深度学习模型的性能。

翻译自: https://towardsdatascience.com/20-popular-machine-learning-metrics-part-2-ranking-statistical-metrics-22c3e5a937b6

机器学习指标