内容管理系统_内容-CSDN博客

本文深入探讨了K-最近邻居(K-NN)算法在多种真实世界案例中的应用，涵盖不平衡与平衡数据集处理、多类别分类、异常值影响、模型可解释性等多个方面，揭示了K-NN算法在不同场景下的优势与局限。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

内容管理系统

In this blog, we’ll talk about one of the most widely used machine learning algorithms for classification used in various real-world cases, which is the K-Nearest Neighbors (K-NN) algorithm. K-Nearest Neighbor (K-NN) is a simple, easy to understand, versatile, and one of the topmost machine learning algorithms that find its applications in a variety of fields.

在此博客中，我们将讨论在各种现实情况下使用的最广泛的机器学习分类算法，即K-最近邻居(K-NN)算法。 K最近邻居(K-NN)是一种简单，易于理解，通用的工具，并且是最顶级的机器学习算法之一，可在各种领域中找到其应用。

内容 (Contents)

Imbalanced and balanced datasets
不平衡和平衡的数据集
Multi-class classification
多类别分类
K-NN is a given Distance (or) similarity matrix
K-NN是给定的距离(或)相似度矩阵
Train and Test set differences
训练和测试集差异
Impact of outliers
离群值的影响
Scale and Column Standardization
规模和专栏标准化
Model Interpretability
模型可解释性
Feature Importance
功能重要性
Categorical Features
分类特征
Missing Value Imputation
缺失价值估算
Curse of Dimensionality
维度诅咒
Bias-Variance Trade-Off
偏差偏差权衡

To know How K-NN works, please read our previous blog, to read the blog visit here.

要了解K-NN的工作原理，请阅读我们以前的博客，并在此处阅读博客。

1.不平衡v / s余额数据集的情况 (1. Case of Imbalance v/s Balance Dataset)

First, we want to Know what is Imbalance Dataset?

首先，我们想知道什么是不平衡数据集？

Consider two-class classification, If there is a very high difference between the positive class and Negative class. Then we can say our dataset in Imbalance Dataset.

考虑二分类，如果在正类和负类之间非常高的差异。然后我们可以说不平衡数据 集中的数据集 。

Image for post — **Imbalanced dataset** **数据集不平衡**

If the number of positive classes and Negative classes is approximately the same. in the given data set. Then we can say our dataset in balance Dataset.

如果肯定类别和否定类别的数量近似相同。在给定的数据集中。然后，我们可以说我们的数据集处于余额数据 集中。

K-NN is very much impacted by an imbalanced dataset when we take the majority vote and sometimes it is dominated by majority class.

当我们进行多数表决时，K-NN很大程度上受不平衡数据集的影响，有时它由多数阶级主导。

如何解决数据集不平衡的问题？ (How to work-around an imbalanced dataset issue?)

Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.

数据不平衡并不总是一件坏事，在实际数据集中，总会存在一定程度的不平衡。就是说，如果不平衡程度相对较低，则对模型性能不会有太大影响。

Now, let’s cover a few techniques to solve the class imbalance problem.

现在，让我们介绍一些解决类不平衡问题的技术。

Under-Sampling

欠采样

Let assume I have a dataset “N” with 1000 data points. And ’N’ have two class one is n1 and another one is n2. These two classes have two different reviews Positive and Negative. Here n1 is a positive class and has 900 data points and n2 is a negative class and has 100 data points, so we can say n1 is a majority class because n1 has a big amount of data points and n2 is a minority class because n2 have less number of data points. To handle this Imbalanced dataset I will create a new dataset called m. Here I will take all (100)n2 datapoints as it is and I will take randomly (100)n1 data points and put into the dataset called m’. This is a sampling trick and its called Under-Sampling.

假设我有一个具有1000个数据点的数据集“ N ”。 “ N”有两个类别，一个是n1，另一个是n2。这两个类别有两个不同的评论正面和负面。这里n1是一个正类，具有900个数据点，n2是一个负类，具有100个数据点，所以可以说n1是多数类，因为n1拥有大量数据点，而n2是少数类，因为n2具有更少的数据点。为了处理这个不平衡的数据集，我将创建一个名为m的新数据集。在这里，我将按原样获取所有(100)n2个数据点，并随机获取(100)n1个数据点并将其放入称为m'的数据集中。这是一个采样技巧，称为欠采样。

Instead of using n1 and n2, we use m and n2 for modeling.

代替使用n1和n2，我们使用m和n2进行建模。

But in this approach, we did throwing away of data and we lose the information and it’s not a good idea. To solving this under-sampling problem we will introduce a new method called Over-Sampling.

但是在这种方法中，我们确实丢弃了数据，并且丢失了信息，这不是一个好主意。为了解决这个欠采样问题，我们将介绍一种称为过采样的新方法。

Over-Sampling

过度采样

This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.

此技术用于修改不相等的数据类以创建平衡的数据集。当数据量不足时，过采样方法会尝试通过增加稀有样本的大小来平衡。

Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept.

过度采样会增加培训集中少数群体成员的数量。过度采样的优点是不会保留原始训练集中的信息，因为会保留少数和多数类别的所有观察结果。

Over-sampling reduces the domination of the one class from the dataset.

过采样减少了数据集中一类的优势。

Instead of repeating we can also create artificially (or) synthetic points in that minority class region, which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class this technique is called Synthetic Minority Over-sampling Technique.

除了重复之外，我们还可以在该少数族裔区域中人工创建(或)合成点，该技术通过从少数族裔中出现的特征中随机采样特征来创建合成样本，该技术称为合成少数族裔过采样技术。

We can get high accuracy with imbalanced data for the ‘dumb model’, so we can’t use accuracy as a performance measure when we have an imbalanced dataset.

对于“哑模型”，使用不平衡的数据可以获得较高的准确性，因此当数据集不平衡时，不能将准确性用作性能指标。

2.多类别分类 (2. Multi-Class Classification)

Consider In an MNIST dataset the class label Y ∈ {0,1} is called binary classification.

考虑在MNIST数据集中，类标签Y∈ {0,1}被称为二进制分类。

Classification task with more than two classes called Multi-Class Classification. Consider In an MNIST dataset the class label Y ∈ {0,1,2,3,4,5} .

具有两个以上类的分类任务称为多类分类。考虑在MNIST数据集中，类标签Y∈ {0,1,2,3,4,5}。

K-NN is easily extendable to the Multi-Class Classifier because it just considers the Majority Vote.

K-NN很容易扩展到多分类器，因为它只考虑多数表决。

But in Machine Learning some types of classification algorithms like Logistic regression can’t extend to Multi-Class Classification.

但是在机器学习中，某些类型的分类算法(例如Logistic回归)无法扩展到多类分类。

As such, they cannot be used for multi-class classification tasks, at least not directly.

因此，它们不能用于多类分类任务，至少不能直接用于。

Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each. one such technique is One-Vs-Rest.

相反，可以使用启发式方法将多类分类问题分解为多个二进制分类数据集，并分别训练一个二进制分类模型。一种这样的技术是One-Vs-Rest。

One-vs-rest is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.

一对多休息是一种使用二进制分类算法进行多分类的启发式方法。它涉及将多类数据集拆分为多个二进制分类问题。然后针对每个二进制分类问题训练一个二进制分类器，并使用最有信心的模型进行预测。

For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:

例如，给定一个多类别分类问题，并为每个类别“红色”，“蓝色”和“绿色”提供示例。可以将其分为三个二进制分类数据集，如下所示：

Binary Classification Problem 1: red vs [blue, green]
二进制分类问题1 ：红色与[蓝色，绿色]
Binary Classification Problem 2: blue vs [red, green]
二进制分类问题2 ：蓝色与[红色，绿色]
Binary Classification Problem 3: green vs [red, blue]
二进制分类问题3 ：绿色与[红色，蓝色]

A possible downside of this approach is that it requires one model to be created for each class. For example, three classes require three models. This could be an issue for large datasets.

这种方法的可能缺点是，需要为每个类创建一个模型。例如，三个类需要三个模型。对于大型数据集，这可能是个问题。

To Know about One-Vs-Rest in sci-kit-learn visit here.

要了解sci-kit-learn中的One-Vs-Rest ，请访问此处。

3. K-NN作为给定的距离(或)相似度矩阵 (3. K-NN as a given Distance (or) similarity matrix)

A distance-based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbor (k-NN).

基于距离的分类是使用基于最近邻居(k-NN)的点对点距离对实例进行分类的流行方法之一。

The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance, or other specific distance measures).

距离度量的表示形式可以是可用的各种度量之一(例如，欧几里得距离，曼哈顿距离，马氏距离或其他特定距离度量)。

Instead of giving the data and label, if someone gives the similarity between the two products (or) distance between the two vectors K-NN works very well. Because K-NN internally calculates the distance between two points.

如果有人提供两个乘积之间的相似性(或)两个向量之间的距离，则K-NN会很好地工作，而不是提供数据和标签。因为K-NN在内部计算两点之间的距离。

4.训练和测试集差异 (4. Train and Test set differences)

when data is change over time than the distribution of train and test set difference will occur. If the distribution of the train and test set is different the model can’t give better results.

当数据随时间变化时，火车和测试装置的分布会发生差异。如果训练和测试集的分布不同，则该模型将无法给出更好的结果。

We want to check the distribution of the train and test set before we build a model.

我们想在建立模型之前检查火车和测试仪的分布。

But How can we know the train and test sets have different distributions?

但是我们怎么知道训练集和测试集具有不同的分布呢？

consider our dataset split into train and test set and both contain x and y, where x as given data points and y as a label.

考虑我们的数据集分为训练集和测试集，并且都包含x和y，其中x为给定的数据点，y为标签。

To find the distribution of train and test set we want to create a new dataset from our existing dataset.

为了找到训练集和测试集的分布，我们想从现有数据集中创建一个新的数据集。

consider in our new train set will be x’=concat(x,y) and y’=1 and the new test set will be x’=concat(x,y) and y’=0. For this new dataset apply binary classifiers like K-NN. After applying the binary classifier if we got results like below cases

考虑在我们的新火车集中将x'= concat(x，y)和y'= 1，而新测试集将是x'= concat(x，y)和y'= 0。对于此新数据集，请使用K-NN之类的二进制分类器。在应用二元分类器之后，如果我们得到以下情况的结果

Case 1:

情况1：

If we get low accuracy by model and the train and test set are almost overlapping than the distribution of train and test sets is very similar.

如果我们通过模型获得低精度，并且训练集和测试集几乎重叠，则训练集和测试集的分布非常相似。

Case 2:

情况2：

If we get medium accuracy by model and the train and test set are less overlapping than the distribution of train and test sets is not very similar.

如果我们通过模型获得中等精度，并且训练集和测试集的重叠程度不如训练集和测试集的分布不太相似。

Case 3:

情况3：

If we get high accuracy by model and the train and test set are very low in overlapping than the distribution of train and test sets are very different.

如果我们通过模型获得高精度，则火车和测试集的重叠率非常低，而火车和测试集的分布则非常不同。

If the train and test sets come from the same distribution are good, else features change over time, then we can design new features.

如果来自同一分布的训练集和测试集很好，否则功能会随着时间而变化，那么我们可以设计新功能。

5.异常值的影响 (5. Impact of outliers)

The model can be easily understood by seeing the decision surface.

通过查看决策面可以轻松理解模型。

Consider the above image if we apply K=1 then take the 1-NN then the decision surface changes. When K is small than the outlier impact on the model is more. When K is large than the outlier impact is less prone to the model.

如果我们应用K = 1，然后考虑1-NN，然后决策面发生变化，请考虑以上图像。当K小时，离群值对模型的影响更大。当K大于异常值时，模型的影响较小。

Techniques to remove outliers in K-NN

去除K-NN中离群值的技术

Local Outlier Factor (LOF): The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density.

局部离群因子(LOF)：局部离群因子基于局部密度的概念，其中局部性由k个最近的邻居给出，其距离用于估计密度。

To understand LOF let’s see some basic definitions.

要了解LOF，让我们看一些基本定义。

K-distance (Xi): The distance to the k-th Nearest Neighbor of Xi from Xi.

K距离(Xi)：从Xi到Xi的第k个最近邻居的距离。

The neighborhood of Xi: Set of all the points that belong to the K-NN of Xi.

Xi的邻域：属于Xi的K-NN的所有点的集合。

Let assume K=5 than the Neighborhood of Xi becomes, Set of all the points that belong to the 5-NN of Xi i.e, {x1,x2,x3,x4,x5}

假设比Xi的邻居变成K = 5，属于Xi的5-NN的所有点的集合，即{x1，x2，x3，x4，x5}

Reachability-distance (Xi, Xj):

可达距离(Xi，Xj)：

reach-dist(Xi, Xj) = max{k-distance(Xj), dist(Xi, Xj)}

到达距离(Xi，Xj)= max {k-距离(Xj)，dist(Xi，Xj)}

Basically, if point Xi is within the k neighbors of point Xj, the reach-dist(Xi, Xj) will be the k-distance of Xj. Otherwise, it will be the real distance between Xi and Xj. This is just a “smoothing factor”. For simplicity, consider this the usual distance between two points.

基本上，如果点Xi在点Xj的k个邻居内，则distance-dist(Xi，Xj)将是Xj的k-距离。否则，它将是Xi与Xj之间的实际距离。这只是一个“平滑因素”。为简单起见，请考虑这是两点之间的通常距离。

Local reachability density (LRD): To get the Lrd for a point, we will first calculate the reachability distance of a to all its k nearest neighbors and take the average of that number. The Lrd is then simply the inverse of that average reachability distance of Xi from its neighbor.

局部可达性密度(LRD)：要获取某个点的Lrd，我们将首先计算a到其所有k个最近邻居的可达性距离，并取该数字的平均值。然后，Lrd只是Xi与其邻居之间的平均可达距离的倒数。

By intuition, the local reachability density tells how far we have to travel from our point to reach the next point or cluster of points. The lower it is, the less dense it is, the longer we have to travel.

通过直觉，局部可达性密度指示我们必须从我们的点行进多远才能到达下一个点或点集。它越低，密度越低，我们旅行的时间就越长。

Local Outlier Factor (LOF): LOF is basically the multiplication of average Lrd of points in the neighborhood of Xi and inverse of Lrd of xi.

局部离群因子(LOF)： LOF基本上是Xi附近的点的平均Lrd与xi Lrd的逆的乘积。

which means the data point Xi has a density of its neighborhood is large and the density of Xi small then it's considered as outlier.

这意味着数据点Xi的邻域密度较大，而Xi的密度较小，因此被视为离群值。

When we applying LOF which point is large LOF that point is considered as outlier otherwise its an inlier.

当我们应用LOF时，该点是大LOF，则将该点视为离群值，否则视为离群值。

6.规模和专栏标准化 (6. Scale and Column Standardization)

All such distance-based algorithms are affected by the scale of the variables. KNN is a Distance-Based algorithm where KNN classifies data based on proximity to the K-Neighbors. Then, often we find that the features of the data we used are not at the same scale (or) units.

所有这些基于距离的算法都受变量规模的影响。 KNN是一种基于距离的算法，其中KNN根据接近度对数据进行分类到K邻居。然后，我们经常发现我们使用的数据的特征不在相同的比例(或)单位上。

An example is when we have features age and height. Obviously these two features have different units, the feature age is in the year and the height is in centimeter.

例如，当我们具有年龄和身高特征时。显然，这两个要素具有不同的单位，要素年龄在一年中高度以厘米为单位。

This unit difference causes Distance-Based algorithms such as KNN to not perform optimally, so it is necessary to rescaling features that have different units to have the same scale (or) units. Many ways can be used for rescaling features. In this story, I will discuss several ways of rescaling, namely Min-Max Scaling, Standard Scaling.

这个单位的差异导致基于距离的算法(例如KNN)无法达到最佳效果，因此有必要重新缩放具有不同单位的要素具有相同的比例单位。许多方法可用于重新缩放功能。在这个故事中，我将讨论几种缩放方法，即最小-最大缩放，标准缩放。

Before we build a model we want to rescale (or) standardize features is necessary.

在构建模型之前，我们需要重新缩放(或)标准化功能。

7.模型的可解释性 (7. Model Interpretability)

A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.

如果一个人的决策比另一个模型的决策更容易理解，那么该模型比另一个模型的解释性更好。

The model explains the results that models are called the Interpretable model.

该模型解释了将模型称为可解释模型的结果。

K-NN is interpretable when dimension d is small as k is small.

当维数d小而k小时，K-NN是可解释的。

8.功能重要性 (8. Feature Importance)

Feature Importance is telling us what are the important features in our model. but K-NN doesn’t give the feature importance internally.

特征重要性告诉我们模型中的重要特征是什么。但K-NN在内部并未赋予该功能以重要性。

To Know which features are important we want to apply some techniques i.e, Two popular members of the stepwise family are the forward selection and backward selection (also known as backward elimination) algorithms.

要知道哪些功能很重要，我们想应用一些技术，例如，逐步族的两个流行成员是正向选择和反向选择(也称为反向消除)算法。

Forward Selection: The procedure starts with an empty set of features. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set.

正向选择 ：该过程从一组空白功能开始。确定最佳原始功能并将其添加到精简集中。在每个后续迭代中，将剩余的最佳原始属性添加到集合中。

First, the best single feature is selected (i.e., using some criterion function like accuracy, AUC, etc).
首先，选择最佳的单一特征(即，使用某些标准功能，如准确性，AUC等)。
Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected.
然后，使用剩余特征之一和该最佳特征形成特征对，并选择最佳对。
Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected.
接下来，使用其余特征之一和这两个最佳特征形成三重特征，并选择最佳三元组。
This procedure continues until a predefined number of features is selected.
继续此过程，直到选择了预定义数量的功能。

Backward Elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. For given data I have already had some features, this technique tries to remove the feature in each iteration based on some performance metric.

向后消除 ：该过程从整套属性开始。在每个步骤中，它都会删除集合中剩余的最差属性。对于给定的数据，我已经具有一些功能，该技术尝试根据某些性能指标在每次迭代中删除该功能。

First, the criterion function is computed for all n features.
首先，针对所有n个特征计算标准函数。
Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded(i.e., using some criterion function like accuracy, AUC, etc).
然后，一次删除每个特征，为具有n-1个特征的所有子集计算标准函数，并丢弃最差的特征(即，使用某些标准函数，如准确性，AUC等)。
Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features.
接下来，一次删除其余n-1个特征中的每个特征，并丢弃最差的特征以形成具有n-2个特征的子集。
This procedure continues until a predefined number of features are left.
此过程将继续进行，直到剩下预定义数量的功能为止。

9.分类特征 (9. Categorical Features)

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values that represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country).

在许多实际的数据科学活动中，数据集将包含类别变量。这些变量通常存储为代表各种特征的文本值。一些示例包括颜色(“红色”，“黄色”，“蓝色”)，大小(“小”，“中”，“大”)或地理名称(州或国家)。

Label Encoding: Label encoders transform non-numerical labels into numerical labels. Each category is assigned a unique label starting from 0 and going on till n_categories — 1 per feature. Label encoders are suitable for encoding variables where alphabetical alignment or numerical value of labels is important. However, if you got nominal data, using label encoders may not be such a good idea.

标签编码：标签编码器将非数字标签转换为数字标签。为每个类别分配一个唯一的标签，从0开始，一直到n_categories —每个功能1。标签编码器适用于对标签的字母对齐或数字值很重要的变量进行编码。但是，如果您得到了名义数据，则使用标签编码器可能不是一个好主意。

OneHot encoding: One-hot encoding is the most widely used encoding scheme. It works by creating a column for each category present in the feature and assigning a 1 or 0 to indicate the presence of a category in the data.

OneHot编码： One-hot编码是使用最广泛的编码方案。它通过为功能中存在的每个类别创建一列并分配1或0来指示数据中类别的存在来工作。

Binary encoding: Binary encoding is not as intuitive as the above two approaches. Binary encoding works like this:

二进制编码：二进制编码不如上述两种方法直观。二进制编码如下：

The categories are encoded as ordinal, for example, categories like red, yellow, green are assigned labels as 1, 2, 3 (let’s assume).
类别被编码为序数，例如，红色，黄色，绿色等类别被分配为1、2、3(假设)。
These integers are then converted into binary code, so for example 1 becomes 001 and 2 becomes 010, and so on.
然后将这些整数转换为二进制代码，例如1变为001，2变为010，依此类推。

Binary encoding is good for high cardinality data as it creates very few new columns. Most similar values overlap with each other across many of the new columns. This allows many machine learning algorithms to learn the similarity of the values.

二进制编码对于高基数数据很有用，因为它创建的新列很少。大多数相似的值在许多新列中彼此重叠。这允许许多机器学习算法学习值的相似性。

If text features are present in the given dataset use Natural Language process techniques like Bag of words, TFIDF, Word2vec.

如果给定数据集中存在文本特征，则使用自然语言处理技术，例如单词袋，TFIDF，Word2vec。

To know more about categorical features visit here

要了解有关分类功能的更多信息，请访问此处

10.缺失价值估算 (10. Missing Value Imputation)

Due to data collection error and if data is corrupted then missing values occur in our dataset.

由于数据收集错误，并且如果数据损坏，则数据集中会丢失值。

如何解决缺失值 (How to work around missing values)

Imputation Techniques: By Mean, Median, and Mode of given data.
插补技术：平均数据的中位数和模式。
Imputation by Class label: Class label imputation technique is if positive class label data is missing take the mean of the positive label only, if the negative class label data is missing take the mean of the negative class label only.
按类别标签进行插补：类别标签插补技术是：如果缺少正类别标签数据，则仅取正标签的均值；如果缺少负类别标签数据，则仅取负列标签的均值。
Model-based imputation: To prepare a dataset for machine learning we need to fix missing values, and we can fix missing values by applying machine learning to that dataset! If we consider a column with missing data as our target variable, and existing columns with complete data as our predictor variables, then we may construct a machine learning model using complete records as our train and test datasets and the records with incomplete data as our generalization target.
基于模型的归因：要准备用于机器学习的数据集，我们需要修正缺失值，并且可以通过将机器学习应用于该数据集来修正缺失值！如果我们将缺少数据的列作为目标变量，将具有完整数据的现有列视为预测变量，那么我们可以使用完整记录作为训练和测试数据集，而将不完整数据的记录作为广义来构造机器学习模型目标。

This is a fully scoped-out machine learning problem. Most of the time K-NN is used in a Model-based imputation technique because it uses the nearest neighbor's strategy.

这是一个范围很广的机器学习问题。大多数时候，K-NN用于基于模型的插补技术中，因为它使用了最近邻居的策略。

11.维度诅咒 (11. Curse of Dimensionality)

In machine learning, it’s important to know as dimensionality increases the number of data points to perform good classification models increase exponentially.

在机器学习中，重要的是要知道维数会增加数据点的数量，从而执行良好的分类模型将呈指数增长。

Hughes phenomenon: When the size of the datasets is fixed performance decreases when dimensionality increases.

休斯现象：当数据集的大小固定时，维数增加时性能下降。

Distance Functions [Euclidean distance]: Intuition of distance in 3-D is not valid in high dimensionality spaces.

距离函数[欧几里得距离]： 3-D距离的直觉在高维空间中无效。

As dimensionality increases careful choice of the number of dimensions (features) to be used is the prerogative of the data scientist training the network. In general the smaller the size of the training set, the fewer features she should use. She must keep in mind that each feature increases the data set requirement exponentially.

随着维数的增加，谨慎选择要使用的维数(特征)数量是训练网络的数据科学家的特权。通常，训练集的大小越小，她应使用的功能就越少。她必须记住，每个功能都以指数方式增加了数据集需求。

As dimensionality increases for above image dist max(Xi)~dist min (Xi).

随着以上图像的尺寸增加，dist max(Xi)〜dist min(Xi)。

In K-NN when dimension d is high euclidian distance is not the good choice as a distance measure, use the cosine distance as a distance measure in high dimensional space.

在维数为d的高欧氏距离的K-NN中，作为距离度量不是一个好的选择，在高维空间中将余弦距离用作距离度量。

When dimension d is high and data points are dense the impact of dimension is high when data points are sparse the impact of dimension is low.

当维度d高且数据点密集时，维度的影响就高；而当数据点稀疏时，维度的影响则低。

When dimensionality increases the chances of overfitting to the model is increasing.

当维数增加时，对模型进行过度拟合的机会就会增加。

12.偏差-偏差的权衡 (12. Bias-Variance Trade-Off)

In the theory f machine learning Bias-Variance Trade-Off is the mathematical way to know the model is underfitting (or) overfitting.

在机器学习的理论中，偏差-偏差权衡是了解模型拟合不足(或过度拟合)的数学方法。

The model is good when the error on future unseen data of the model is low is given by,

当模型的未来看不见数据的误差很低时，该模型将是好的，

Generalization Error= Bias² + variance + Iirreducible error

泛化误差 =偏差²+方差+不可约误差

Generalization error is the error on future unseen data of the model, the bias error is due to underfitting, variance error is due to overfitting and the irreducible error is an error that we cannot further reducible for the given model.

泛化误差是模型未来未见数据的误差，偏差误差是由于拟合不足而引起的，方差误差是由于拟合过度而导致的，不可约误差是我们无法进一步减小给定模型的误差。

High bias means underfitting, error due to simplifying the assumptions about the model.

高偏差意味着拟合不足，由于简化了有关模型的假设而导致的误差。

High Variance means overfitting, how much a model changes as training data changes, small changes in a model result in a very different model,s and different decision surfaces.

高方差意味着过度拟合，模型随训练数据的变化而变化多少，模型的较小变化导致非常不同的模型，不同的决策面。

For a good model, low generalization error, no underfit, no overfit, and some amount of Irreducible error.

对于一个好的模型，低泛化误差，无欠拟合，无过度拟合以及一定程度的不可约误差。

↓ Generalization Error= ↓ Bias² + ↓ variance + Iirreducible error

↓ 泛化误差 =↓偏差²+↓方差+不可约误差