機器學習無監督學習 - K-means-CSDN博客

本文链接：https://blog.csdn.net/weixin_52006675/article/details/139268176

Machine Learning - unsupervised learning 无监督学习

文章目录

K-means

K-means

content: Data Science, MachineLearning, data analyst

k-means clustering k-均值聚類

Find clusters of samples 查找樣本簇
Number of clusters must be specified必須定義簇的數量
Implemented in sklearn

Reference

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

Strengths
•Simple & fast and can be applied to high-dimensional large data
•Easy to implement

Weaknesses

Need to choose k (the number of clusters)
Sensitive to outliers
Prone to local minima and no guarantee of optimal solution (local optima)
- It may obtain different results running on the same dataset
Difficult to guess the correct “k”
Not suitable to discover clusters with non-convexshapes

Partitioning around medoid (PAM)

Partitioning around medoid (PAM, CLARA or CLARANS) is the robust version of the K-means algorithm.
Both algorithms attempt to minimize the squared-error (i.e., cost function) but the K-medoid algorithm is more robust to noise than K-means algorithm.
This algorithm uses compactness as clustering criteria instead of connectivity
- Not suitable for clustering non-spherical (arbitrary shaped) groups of objects
A disadvantage of PAM is that it may obtain different results for different runs on the same dataset (as the firstkmedoids are chosen randomly).

PAM is less sensitive to outliers than other partitioning algorithms.

Theory

The **KMeans** algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large numbers of samples and has been used across a large range of application areas in many different fields. KMeans 算法通过尝试将样本分成 n 组等方差的样本，最小化称为惯性或簇内平方和的标准（见下文）来对数据进行聚类。该算法需要指定簇的数量。它可以很好地扩展到大量样本，并已在许多不同领域的广泛应用领域中使用。

The k-means algorithm divides a set of $N$ sample $X$ into $K$ disjoint clusters $C$ , each described by the mean $μ_j$ of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from $X$ , although they live in the same space. 將一組N個樣本X劃分成K個不想交的簇C，每個簇由均值 $μ_j$ 集群中的樣本數。這些均值通常稱為簇“質心”；儘管他們位於同一空間，但通常他們不是來意X的點

The K-means algorithm aims to choose centroids that minimise the inertia(慣性最小的質心), or within-cluster sum-of-squares criterion(簇內平方和):

$\sum\limits_{i = 0}^n {\mathop {\min }\limits_{{\mu _j} \in C} } (||{X_i} - {\mu _j}|{|^2})$

Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:惯性可以被认为是衡量簇内部相干程度的指标。它有多种缺点

Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.惯性假设簇是凸的且各向同性的，但情况并非总是如此。它对细长的簇或不规则形状的流形响应很差。
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.惯性不是一个标准化的指标：我们只知道值越低越好，而零是最佳值。但在非常高维的空间中，欧几里德距离往往会变得膨胀（这是所谓的“维数诅咒”的一个例子）。在 k 均值聚类之前运行主成分分析 (PCA) 等降维算法可以缓解此问题并加快计算速度。

K-mans is often referred to as Lloyd’s algorithm

chooses the initial centroids, with the most basic method being to choose $k$ samples from dataset $X$ 选择初始质心，最基本的方法是从数据集中选择 $k$ 樣本 $X$
After initialization, K-means consists of looping between the two other steps
- assigns each sample to its nearest centroid 将每个样本分配到最近的质心
- creates new centroids by taking the mean value of all of the samples assigned to each previous centroid通过获取分配给每个先前质心的所有样本的平均值来创建新的质心

The difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold计算旧质心和新质心之间的差异，并且算法重复最后两个步骤，直到该值小于阈值。

K-means is equivalent to the expectation-maximization algorithm with a small, all-equal, diagonal covariance matrix

The algorithm can also be understood through the concept of Voronoi diagrams. First the Voronoi diagram of the points is calculated using the current centroids. Each segment in the Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of each segment. The algorithm then repeats this until a stopping criterion is fulfilled. Usually, the algorithm stops when the relative decrease in the objective function between iterations is less than the given tolerance value. This is not the case in this implementation: iteration stops when centroids move less than the tolerance.该算法也可以通过 Voronoi 图的概念来理解。首先使用当前质心计算点的 Voronoi 图。 Voronoi 图中的每个段都成为一个单独的簇。其次，将质心更新为每个段的平均值。然后算法重复此过程，直到满足停止标准。通常，当迭代之间目标函数的相对下降小于给定的容差值时，算法停止。在此实现中情况并非如此：当质心移动小于容差时，迭代停止。

Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of the centroids. As a result, the computation is often done several times, with different initializations of the centroids. One method to help address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn (use theinit='k-means++' parameter). This initializes the centroids to be (generally) distant from each other, leading to probably better results than random initialization, as shown in the reference. 如果有足够的时间，K 均值总是会收敛，但这可能会达到局部最小值。这高度依赖于质心的初始化。因此，计算通常会进行多次，并使用不同的质心初始化。帮助解决此问题的一种方法是 k-means++ 初始化方案，该方案已在 scikit-learn 中实现（使用 init='k-means++' 参数）。这会将质心初始化为（通常）彼此远离，从而可能比随机初始化得到更好的结果，如参考文献中所示

K-means++ can also be called independently to select seeds for other clustering algorithms, see sklearn.cluster.kmeans_plusplus for details and example usage.

確認參數

都是計算到聚類中心的距離只是方式不一樣更推薦第二種方法，第二種是機器學習過程中的結果，而不是格外手算的。可以避免錯誤。

from scipy.spatial.distance import cdist
K = range(1,11)
meandistortions = []
for k in K:
	kmeans = KMeans(n_clusters=k)
	kmeans.fit(data)
	#計算各個點分別到k個質心的距離，取其最小值作為其到所屬質心的距離，並計算這些點到各自所屬質心距離的平均距離
	meandistortions.append(
		sum(
			np.min(cdist(data,kmeans.cluster_centers_, 'euclidean'), axis=1)
		) / data.shape[0]
	)
# 繪製碎石圖
plt.plot(K, meandistortions, 'bx--')
plt.xlabel('k')
plt.show()
#選擇開始明顯減緩的點

other method:

wcss = [] # an intial array for the results of kmeans clustering algorithm
for i in range(1,11):
    kmeans_pca = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans_pca.fit(score_pca) #fit k-means using the transformed data from PCA
    wcss.append(kmeans_pca.inertia_)
#wcss 就是結果，根據可視化的結果來選擇應該多少個族群
plt.figure(figsize = (10,18))
plt.plot(range(1,11), wcss, marker = 'o', linestyle = '--')
plt.xlabel('Number of Cluster')
plt.ylabel('WCSS')
plt.title('k-means Clustering Combined with PCA')
plt.show()

`sklearn.cluster.KMeans`

from sklearn.cluster import KMeans
model = KMeans(n_clusters = n) # init='k-means++'可以加快迭代的速度
model.fit(data)
label = model.labels_#get the label
center = model.cluster_centers_ # center of the cluster
print(label)

parameter

n_clusters: int ,default=8

The number of clusters to form as well as the number of centroids to generate.要形成的簇的数量以及要生成的质心的数量。
init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’**

Method for initialization:
- ‘k-means++’ : selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence. The algorithm implemented is “greedy k-means++”. It differs from the vanilla k-means++ by making several trials at each sampling step and choosing the best centroid among them.‘k-means++’：根据点对整体惯性贡献的经验概率分布，使用采样来选择初始簇质心。该技术加快了收敛速度。实现的算法是“贪婪k-means++”。它与普通 k-means++ 的不同之处在于，它在每个采样步骤中进行多次试验并在其中选择最佳质心。
- ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.‘random’：从初始质心的数据中随机选择 n_clusters 个观测值（行）。
- If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.如果传递一个数组，它的形状应该是 (n_clusters, n_features) 并给出初始中心。
- If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.如果传递可调用函数，它应该采用参数 X、n_clusters 和随机状态并返回初始化。
n_init‘auto’ or int, default=10**

Number of times the k-means algorithm is run with different centroid seeds. The final results is the best output of n_init

consecutive runs in terms of inertia. Several runs are recommended for sparse high-dimensional problems (see

Clustering sparse data with k-means).使用不同质心种子运行 k 均值算法的次数。最终结果是连续 n_init 次运行中惯性方面的最佳输出。对于稀疏高维问题，建议进行多次运行（请参阅使用 k 均值对稀疏数据进行聚类）
random_set

attributes

cluster_centers_: ndarray of shape (n_clusters, n_features)
labels_: ndarray of shape (n_samples,)

Labels of each point
inertia_: float 可以用於衡量Cluster數量

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.样本到最近聚类中心的距离平方和，按样本权重（如果提供）进行加权
n_iter_: int 迭代的次數

Number of iterations run
n_features_in_： int

Number of features seen during fit.
feature_names_in_： ndarray of shape (n_features_in_,)

Names of features seen duringfit. Defined only when xhas feature names that are all strings. 拟合过程中看到的特征的名称。仅当 X 的功能名称均为字符串时定义

method

`fit`(X[, y, sample_weight]) fit （X[，y，样本权重]）	Compute k-means clustering.计算 k 均值聚类。
`fit_predict`(X[, y, sample_weight]) `fit_predict` （X[，y，样本权重]）	Compute cluster centers and predict cluster index for each sample.计算聚类中心并预测每个样本的聚类索引。
`fit_transform`(X[, y, sample_weight]) `fit_transform` （X[，y，样本权重]）	Compute clustering and transform X to cluster-distance space.计算聚类并将 X 变换到聚类距离空间。
`get_feature_names_out`([input_features]) `get_feature_names_out` （[输入特征]）	Get output feature names for transformation.获取用于转换的输出特征名称。
`get_metadata_routing`()	Get metadata routing of this object.获取该对象的元数据路由。
`get_params`([deep]) get_params （[深]）	Get parameters for this estimator.获取此估计器的参数。
`predict`(X[, sample_weight]) predict （X[，样本权重]）	Predict the closest cluster each sample in X belongs to.预测 X 中每个样本所属的最接近的簇。
`score`(X[, y, sample_weight]) score （X[，y，样本权重]）	Opposite of the value of X on the K-means objective.与 K 均值目标上的 X 值相反。
`set_fit_request`([, sample_weight]) set_fit_request （[，样本权重]）	Request metadata passed to the fit method.请求传递给 fit 方法的元数据。
`set_output`([, transform]) set_output （[，变换]）	Set output container. 设置输出容器。
`set_params`(params) set_params （参数）	Set the parameters of this estimator.设置该估计器的参数。
`set_predict_request`([, sample_weight]) set_predict_request （[，样本权重]）	Request metadata passed to the predict method.请求传递给 predict 方法的元数据。
`set_score_request`([, sample_weight]) set_score_request （[，样本权重]）	Request metadata passed to the score method.请求传递给 score 方法的元数据。
`transform`(X)	Transform X to a cluster-distance space.将 X 变换到簇距离空间。

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm. k 均值问题可以使用 Lloyd 算法或 Elkan 算法来解决。

The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration.平均复杂度由 O(k n T) 给出，其中 n 是样本数，T 是迭代次数

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.实际上，k-means 算法非常快（可用的最快的聚类算法之一），但它会陷入局部最小值。这就是为什么重新启动几次会很有用

If the algorithm stops before fully converging (because of tolor max_iter), labels_and cluster_centers_will not be consistent, i.e. the cluster_centers_will not be the means of the points in each cluster. Also, the estimator will reassign labels_after the last iteration to make labels_consistent with predicton the training set.如果算法在完全收敛之前停止（由于 tol 或 max_iter ）, 则 labels_ 和 cluster_centers_ 将不一致，即 < b4> 不会是每个簇中点的平均值。此外，估计器将在最后一次迭代后重新分配 labels_ 以使 labels_ 与训练集上的 predict 一致。

有效性評價

K-means 會有多個結果

有效性評價標準

Rand係數

輪廓係數(Silhouette Coefficient)

Silhouette plot illustrates robustness of each group(显示数量和平均轮廓宽度）

from sklearn.metrics import  silhouette_score
lkxs = silhouette_score(data,label)
print(lkxs)
means = np.mean(lkxs)
print(means)

輪廓係數的均值就能表示不同聚類數的好壞，因為我們可以寫一個循環來計算聚類數從2到n-1 的輪廓係數進行聚類評價

def juleipingjia(n):
	julei = KMeans(n_clusters=n)
	julei.fit(data)
	label = julei.laberls_
	lkxs = silhouette_samples(data, label, metric='euclidean')
	means = np.mean(lkxs)
return means

y=[]
for n in range(2,23):
	means = juleipingjia(n)
	y.append(means)
print(y)

分別計算聚類係數就能看出最佳聚類係數用這些點畫圖可以直接得到可視化的圖來觀察得到最佳係數是多少

Quantitative performance evaluation

from sklearn.metrics import silhouette_score
# Fit the KMeans model
kmeans_pca.fit_predict(score_pca)
# Calculate Ave Silhoutte Score for 6 clusters
score = silhouette_score(score_pca, kmeans_pca.labels_, metric='euclidean')
# Print the score
print('Silhouette ave Score (4 cluster analysis): %.3f' % score)

silhouette_coefficients = []
# We need a least 2 clusters for silhouette coefficient
for k in range(2, 10):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(score_pca)
    score = silhouette_score(score_pca, kmeans.labels_)
    silhouette_coefficients.append(score)

from yellowbrick.cluster import SilhouetteVisualizer 這個就是專門可視化的

from yellowbrick.cluster import SilhouetteVisualizer
model = KMeans(n_clusters=4)
visualizer = SilhouetteVisualizer(model)

visualizer.fit(df_PCA_2)    # Fit the data to the visualizer
visualizer.poof()

在这里插入图片描述

在 SilhouetteVisualizer 图中，得分较高的聚类具有较宽的轮廓，但内聚性较低的聚类将低于所有聚类的平均得分（绘制为垂直的红色虚线）

有負數的話說明存在一些問題

以每个簇为基础显示每个样本的轮廓系数，直观地评估簇之间的密度和间隔。通过平均每个样本的轮廓系数来计算分数，该系数计算为每个样本的平均簇内距离与平均最近簇距离之间的差，并通过最大值归一化。这会产生 -1 和 +1 之间的分数，其中接近 +1 的分数表示高度分离，接近 -1 的分数表示样本可能已分配到错误的簇

Silhouette analysis 結果是0的時候意味著位於兩個（可能是簇）之間

Calinski-Harabaz 指數

accuracy by Confusion Matrix

*from* sklearn.metrics *import* confusion_matrix, accuracy_score

# Extract the last characters of the index; the last character is the reference subgroup
last_characters = dataset.index.str[-1]  

true_labels = [int(float(x)) for x in last_characters]  # convert the string label into number 

predicted_labels = model.labels_  # obtained from the kmeans clustering 

# Create a confusion matrix
conf_matrix = confusion_matrix(true_labels, predicted_labels)

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)

# Display confusion matrix and accuracy
print("Confusion Matrix:")
print(conf_matrix)
print("\nAccuracy:", accuracy)

在这里插入图片描述

Visualisation

*import scipy.cluster.hierarchy as shc*

不止一個結果

How to choose a reliable clustering technique for your dataset

Evaluate your dataset from different aspects 從不同方面評估數據集
▪What type of features (e.g., numeric or categorical)? 什麼類型的特征
▪The size of the dataset (e.g., large or small)數據集的大小
▪Number of feature (i.e., attributes), Is it a high-dimensional dataset? 特征的數量
▪Assessing outliersand missing評估異常值和缺失
Consider consensus clustering考慮共識聚類
Evaluate the reliability (i.e., consistency/robust) of the clustering result評估聚類結果的可靠性（即一致性、穩健性）

Consensus clustering共識聚類

No knowledge about the number of clusters
Clustering methods are sensitive to initialisation settings 對初始化設置很敏感
The lack of a reliable validation technique when using clustering 使用聚类时缺乏可靠的验证技术
1. We need a measure of confidence for cluster numbers and cluster assignment我们需要对聚类编号和聚类分配的置信度进行衡量

Consensus clustering1 approach

Multiple runs of a clustering algorithm
- Determine the number of clusters and assess the stability of the discovered clusters
- In k-means clustering, using random restart
Aggregating the cluster (label) results of different clustering algorithms