kmeans聚类算法 c语言实现
时间: 2025-07-01 11:05:36 浏览: 8
### K-means Clustering Algorithm Implementation in C
K-means clustering is an iterative algorithm that partitions a dataset into `k` clusters based on distances. Below is a detailed implementation of the K-means algorithm in the **C programming language**, which includes functions for:
- Initializing centroids randomly.
- Calculating Euclidean distance between points.
- Assigning data points to the nearest centroid.
- Updating centroids based on cluster members.
- Repeating the assignment and update steps until convergence.
---
### Code Example: K-means in C
```c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define MAX_ITER 100
#define K 2 // Number of clusters
#define N 6 // Number of data points
#define D 2 // Dimensionality of data
// Function to calculate Euclidean distance
double euclidean_distance(double *a, double *b, int dim) {
double dist = 0.0;
for (int i = 0; i < dim; i++) {
dist += (a[i] - b[i]) * (a[i] - b[i]);
}
return sqrt(dist);
}
// Function to assign each point to the nearest centroid
void assign_clusters(double data[N][D], double centroids[K][D], int assignments[N]) {
for (int i = 0; i < N; i++) {
double min_dist = INFINITY;
int closest = 0;
for (int j = 0; j < K; j++) {
double dist = euclidean_distance(data[i], centroids[j], D);
if (dist < min_dist) {
min_dist = dist;
closest = j;
}
}
assignments[i] = closest;
}
}
// Function to update centroids based on current cluster assignments
void update_centroids(double data[N][D], double centroids[K][D], int assignments[N]) {
int count[K] = {0};
double sum[K][D] = {{0}};
for (int i = 0; i < N; i++) {
int cluster = assignments[i];
for (int j = 0; j < D; j++) {
sum[cluster][j] += data[i][j];
}
count[cluster]++;
}
for (int i = 0; i < K; i++) {
if (count[i] > 0) {
for (int j = 0; j < D; j++) {
centroids[i][j] = sum[i][j] / count[i];
}
} else {
// Re-initialize empty cluster to a random point
int rand_index = rand() % N;
for (int j = 0; j < D; j++) {
centroids[i][j] = data[rand_index][j];
}
}
}
}
// Check for convergence (if centroids don't change)
int has_converged(double old_centroids[K][D], double new_centroids[K][D]) {
for (int i = 0; i < K; i++) {
if (euclidean_distance(old_centroids[i], new_centroids[i], D) > 1e-4) {
return 0; // Not converged
}
}
return 1; // Converged
}
int main() {
srand(time(NULL));
// Sample data points (N x D)
double data[N][D] = {
{1.0, 2.0},
{1.5, 1.8},
{5.0, 8.0},
{8.0, 8.0},
{1.0, 0.6},
{9.0, 11.0}
};
// Initialize centroids randomly from data points
double centroids[K][D];
for (int i = 0; i < K; i++) {
int idx = rand() % N;
for (int j = 0; j < D; j++) {
centroids[i][j] = data[idx][j];
}
}
int assignments[N];
double old_centroids[K][D];
for (int iter = 0; iter < MAX_ITER; iter++) {
// Save current centroids
for (int i = 0; i < K; i++) {
for (int j = 0; j < D; j++) {
old_centroids[i][j] = centroids[i][j];
}
}
// Assign clusters
assign_clusters(data, centroids, assignments);
// Update centroids
update_centroids(data, centroids, assignments);
// Check for convergence
if (has_converged(old_centroids, centroids)) {
printf("Converged at iteration %d\n", iter);
break;
}
}
// Output final results
printf("Final Centroids:\n");
for (int i = 0; i < K; i++) {
printf("Centroid %d: ", i + 1);
for (int j = 0; j < D; j++) {
printf("%.2f ", centroids[i][j]);
}
printf("\n");
}
printf("Cluster Assignments:\n");
for (int i = 0; i < N; i++) {
printf("Data point %d -> Cluster %d\n", i, assignments[i]);
}
return 0;
}
```
---
### Explanation of Key Steps
The core idea behind K-means involves two major steps:
1. **Assignment Step**: Each data point is assigned to the nearest centroid using Euclidean distance[^1].
2. **Update Step**: The centroid positions are recalculated as the mean of all data points assigned to that cluster[^3].
This process repeats until the centroids stabilize, indicating convergence[^1].
---
### Time Complexity
The time complexity of K-means is generally expressed as $ O(T \cdot k \cdot n \cdot d) $, where:
- $ T $: Number of iterations.
- $ k $: Number of clusters.
- $ n $: Number of data points.
- $ d $: Dimensionality of the data.
This aligns with similar expressions used in other clustering techniques like Border-Peeling, which also considers additional overhead depending on the data structure used[^2].
---
### Notes on Optimization
- **Initialization Strategy**: Instead of purely random initialization, you can use **K-means++** for better initial centroid placement.
- **Distance Calculation**: Optimize by precomputing distances or using spatial indexing structures like k-d trees[^4].
- **Early Stopping**: Add thresholds for minimal centroid movement to reduce unnecessary iterations.
---
阅读全文
相关推荐



















