基于sklearn的机器学习 — K近邻（KNN）

最新推荐文章于 2025-03-24 15:55:06 发布

原创

最新推荐文章于 2025-03-24 15:55:06 发布

· 1.6k 阅读

15 ·

版权

文章标签：

#机器学习 #sklearn #人工智能

寻找观测值的最近邻

k近邻算法 (KNN) 的目标是识别给定测试点的最近邻，以便我们可以为该点分配一个类标签，因此确定距离的度量方法有助于形成决策边界，而决策边界可将测试点划分为不同的区域

要找到一个观测值的 k 个最近的观测值（邻居），可以使用 scikit-learn 的NearestNeighbors类，scikit-learn 提供了多种距离度量方法，默认情况下，NearestNeighbors使用闵可夫斯基距离（Minkowski distance）距离：

其中，xi 和 yi 是我们正在计算距离的两个值。

实际上，闵可夫斯基距离 (Minkowski Distance)是将多种距离公式（曼哈顿距离、欧式距离、切比雪夫距离）的一个推广。

当闵可夫斯基距离的超参数 p = 1时为曼哈顿距离(Manhattan distance)：

当p = 2 时为欧几里得距离(Euclidean distance)：

默认情况下，scikit-learn 中 p = 2。

下面基于鸢尾花数据集，使用NearestNeighbors来找到新建观测值new_observation在标准化特征空间中距离最近的两个点：

# Load libraries  
from sklearn import datasets 
from sklearn.neighbors import NearestNeighbors 
from sklearn.preprocessing import StandardScaler  

# Load data  
iris = datasets.load_iris() 
features = iris.data

# Create standardizer  
standardizer = StandardScaler()  

# Standardize features  
features_standardized = standardizer.fit_transform(features)  

# Two nearest neighbors  
nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)  

# Create an observation  
new_observation = [ 1, 1, 1, 1]  

# Find distances and indices of the observation's nearest neighbors 
distances, indices = nearest_neighbors.kneighbors([new_observation])  

# View the nearest neighbors 
features_standardized[indices]

# View distances 
distances

还可以使用metric参数设置距离度量，例如通过metric参数将距离度量方法设为欧式距离：

# Find two nearest neighbors based on Euclidean distance 
nearestneighbors_euclidean = NearestNeighbors( n_neighbors=2, metric='euclidean').fit(features_standardized)