使用scikit-learn框架,实现Kmeans、 DBSCAN算法对手写字符、 鸢尾花卉进行识别 *不使用scikit-learn框架,复现Kmeans 、 DBSCAN算法对手写 字符/鸢尾花卉进行识别(更多细节要求见文档)
时间: 2025-01-01 22:37:34 浏览: 66
### 使用scikit-learn框架实现Kmeans和DBSCAN算法
#### 任务描述
使用scikit-learn框架,实现Kmeans和DBSCAN算法对**手写字符**和**鸢尾花卉**进行识别。
#### 实验步骤
1. **导入相关库**
```python
from sklearn.datasets import load_digits, load_iris
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import adjusted_rand_score
```
2. **载入数据**
- 手写字符数据
```python
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
```
- 鸢尾花卉数据
```python
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
```
3. **分割数据集**
```python
X_train_digits, X_test_digits, y_train_digits, y_test_digits = train_test_split(X_digits, y_digits, test_size=0.3, random_state=42)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)
```
4. **模型构建**
- Kmeans
```python
kmeans_digits = KMeans(n_clusters=10, random_state=42).fit(X_train_digits)
kmeans_iris = KMeans(n_clusters=3, random_state=42).fit(X_train_iris)
```
- DBSCAN
```python
dbscan_digits = DBSCAN(eps=0.3, min_samples=10).fit(X_train_digits)
dbscan_iris = DBSCAN(eps=0.3, min_samples=10).fit(X_train_iris)
```
5. **模型测试**
- Kmeans
```python
y_pred_kmeans_digits = kmeans_digits.predict(X_test_digits)
y_pred_kmeans_iris = kmeans_iris.predict(X_test_iris)
print("Kmeans (Digits) Adjusted Rand Score:", adjusted_rand_score(y_test_digits, y_pred_kmeans_digits))
print("Kmeans (Iris) Adjusted Rand Score:", adjusted_rand_score(y_test_iris, y_pred_kmeans_iris))
```
- DBSCAN
```python
y_pred_dbscan_digits = dbscan_digits.fit_predict(X_test_digits)
y_pred_dbscan_iris = dbscan_iris.fit_predict(X_test_iris)
print("DBSCAN (Digits) Adjusted Rand Score:", adjusted_rand_score(y_test_digits, y_pred_dbscan_digits))
print("DBSCAN (Iris) Adjusted Rand Score:", adjusted_rand_score(y_test_iris, y_pred_dbscan_iris))
```
### 不使用scikit-learn框架复现Kmeans和DBSCAN算法
#### 复现Kmeans算法
1. **回顾Kmeans算法流程**
- 初始化聚类中心
- 分配每个点到最近的聚类中心
- 更新聚类中心
- 重复分配和更新直到收敛
2. **定义相关变量和数据结构**
```python
import numpy as np
from scipy.spatial.distance import euclidean_distances
def initialize_centers(X, k):
return X[np.random.choice(X.shape[0], size=k, replace=False)]
def assign_labels(X, centers):
distances = euclidean_distances(X, centers)
return np.argmin(distances, axis=1)
def update_centers(X, labels, k):
new_centers = []
for i in range(k):
cluster_points = X[labels == i]
if len(cluster_points) > 0:
new_centers.append(np.mean(cluster_points, axis=0))
else:
new_centers.append(X[np.random.randint(0, X.shape[0])])
return np.array(new_centers)
def kmeans(X, k, max_iter=100):
centers = initialize_centers(X, k)
for _ in range(max_iter):
labels = assign_labels(X, centers)
new_centers = update_centers(X, labels, k)
if np.allclose(centers, new_centers):
break
centers = new_centers
return labels, centers
```
3. **应用复现的Kmeans算法**
```python
k = 10
y_pred_kmeans_digits_custom, _ = kmeans(X_test_digits, k)
y_pred_kmeans_iris_custom, _ = kmeans(X_test_iris, 3)
print("Custom Kmeans (Digits) Adjusted Rand Score:", adjusted_rand_score(y_test_digits, y_pred_kmeans_digits_custom))
print("Custom Kmeans (Iris) Adjusted Rand Score:", adjusted_rand_score(y_test_iris, y_pred_kmeans_iris_custom))
```
#### 复现DBSCAN算法
1. **回顾DBSCAN算法流程**
- 定义邻域
- 标记核心对象
- 形成簇
- 处理边界点和噪声点
2. **定义相关变量和数据结构**
```python
def region_query(X, point_idx, eps):
neighbors = []
for i, x in enumerate(X):
if euclidean_distances([X[point_idx]], [x])[0][0] < eps:
neighbors.append(i)
return neighbors
def expand_cluster(X, labels, point_idx, cluster_id, eps, min_samples):
seeds = region_query(X, point_idx, eps)
if len(seeds) < min_samples:
labels[point_idx] = -1
return False
else:
labels[point_idx] = cluster_id
for seed_idx in seeds:
labels[seed_idx] = cluster_id
while seeds:
current_point = seeds.pop(0)
result = region_query(X, current_point, eps)
if len(result) >= min_samples:
for i in result:
if labels[i] in [-1, 0]:
if labels[i] == -1:
labels[i] = cluster_id
if i not in seeds:
seeds.append(i)
return True
def dbscan(X, eps, min_samples):
labels = np.zeros(X.shape[0])
cluster_id = 0
for i in range(X.shape[0]):
if labels[i] != 0:
continue
if expand_cluster(X, labels, i, cluster_id + 1, eps, min_samples):
cluster_id += 1
return labels
```
3. **应用复现的DBSCAN算法**
```python
eps = 0.3
min_samples = 10
y_pred_dbscan_digits_custom = dbscan(X_test_digits, eps, min_samples)
y_pred_dbscan_iris_custom = dbscan(X_test_iris, eps, min_samples)
print("Custom DBSCAN (Digits) Adjusted Rand Score:", adjusted_rand_score(y_test_digits, y_pred_dbscan_digits_custom))
print("Custom DBSCAN (Iris) Adjusted Rand Score:", adjusted_rand_score(y_test_iris, y_pred_dbscan_iris_custom))
```
### 思考题
1. **对于同一个任务,DBSCAN与Kmeans效果是否存在差异?若存在差异,是什么原因造成的?**
- **答案**: 是的,存在差异。Kmeans假设数据是球形分布且需要预先指定聚类数量,而DBSCAN不需要预设聚类数量,能够发现任意形状的簇,并能处理噪声点。因此,在某些非球形或有噪声的数据集中,DBSCAN可能表现更好。
2. **对于聚类任务,什么时候该选择Kmeans,什么时候该选择DBSCAN?**
- **答案**:
- **Kmeans**: 当数据集中的簇近似为球形且已知簇的数量时,适合使用Kmeans。
- **DBSCAN**: 当数据集中的簇形状复杂、密度不均匀或存在噪声点时,适合使用DBSCAN。
### 实验报告
1. **实验报告模板** 已经上传云平台。
2. **报告要求** 控制在两页以内。
3. **报告命名规则** 班级_学号_姓名,如“软工1班_1234567_张三.pdf”。
4. **提交方式** 班长/学委收齐后打包发送至邮箱 `[email protected]`。
5. **打包文件命名方式** 班级_实验名称,如“软工1班_聚类”。
6. **提交截止时间** 12月2日19:59分。
阅读全文
相关推荐


















