## 1.2 Cross-Validation

: 以 10-fold 交叉验证为例。首先确定 $\alpha$ 需要实验的取值为 ${\,\alpha_1, \alpha_2, \dots, \alpha_n\,}$ 选择 $\alpha = \alpha_1$ 把数据集分为训练数据和测试数据，再将测试数据分为10份。 每次使用其中9份训练，在剩下1份中测试性能，记为 $p_11$。 轮换10次，得到一组性能数据 $p_{1j}, j = 1, 2, \dots, 10$ 取性能最好者得到 $p_{1max} = \max_j p_{1j}$ 类似地，再取其他 $\alpha_i$ 值，得到一组 $p_{imax}, i = 1, 2, \dots, n$ 最终 $\alpha = \arg \max_{\alpha_i} p_{imax}$

## 2.1 k-means

Sigma = [1, 0; 0, 1];
mu1 = [1, -1];
x1 = mvnrnd(mu1, Sigma, 200);
mu2 = [5.5, -4.5];
x2 = mvnrnd(mu2, Sigma, 200);
mu3 = [1, 4];
x3 = mvnrnd(mu3, Sigma, 200);
mu4 = [6, 4.5];
x4 = mvnrnd(mu4, Sigma, 200);
mu5 = [9, 0.0];
x5 = mvnrnd(mu5, Sigma, 200);

% obtain the 1000 data points to be clustered
X = [x1; x2; x3; x4; x5];

% Show the data point
plot(x1(:,1), x1(:,2), 'r.'); hold on;
plot(x2(:,1), x2(:,2), 'b.');
plot(x3(:,1), x3(:,2), 'k.');
plot(x4(:,1), x4(:,2), 'g.');
plot(x5(:,1), x5(:,2), 'm.');


(1) 代码见 Github Repo，主要包括三个文件：

• gen_k-means_data.jl 生成数据，输出到终端
• k-means.jl 聚类算法主要文件
• main_kmeans.jl 入口函数位置

(2) 多次运行统计如下

No. 初始中心 误差 类分布 最终中心
1 [6.6939,-5.4017],[6.7155,4.5134],[0.0397,-1.9298],[4.0997,-3.7419],[5.4827,3.2733] 0.1693 [201.0,198.0,198.0,201.0,202.0] [8.8699,-0.0387],[6.0164,4.5892],[1.0251,-0.9898],[5.5131,-4.5951],[1.0034,3.9247]
2 [3.4235,0.1799],[7.8776,-0.3014],[0.8074,-0.937],[7.7488,-0.1569],[6.567,4.4531] 0.1693 [202.0,201.0,198.0,201.0,198.0] [1.0034,3.9247],[5.5131,-4.5951],[1.0251,-0.9898],[8.8699,-0.0387],[6.0164,4.5892]
3 [5.902,5.4364],[0.9754,5.009],[6.9544,-2.9525],[8.7147,5.3574],[4.4354,3.48] 0.1693 [198.0,202.0,201.0,201.0,198.0] [6.0164,4.5892],[1.0034,3.9247],[5.5131,-4.5951],[8.8699,-0.0387],[1.0251,-0.9898]
4 [5.8873,5.5452],[1.1112,4.4041],[6.3762,2.7627],[0.9075,3.4441],[5.4542,-3.7635] 0.1693 [198.0,202.0,201.0,198.0,201.0] [6.0164,4.5892],[1.0034,3.9247],[8.8699,-0.0387],[1.0251,-0.9898],[5.5131,-4.5951]
5 [6.1372,-3.326],[1.577,0.0763],[1.565,3.2955],[2.7538,-1.3246],[0.96,-0.6246] 0.1693 [201.0,202.0,198.0,201.0,198.0] [8.8699,-0.0387],[1.0034,3.9247],[6.0164,4.5892],[5.5131,-4.5951],[1.0251,-0.9898]

## 2.2 spectral clustering

(1). 请编写一个谱聚类算法，实现“Normalized Spectral Clustering—Algorithm 3 (Ng算法）” (2). 设点对亲和性（即边权值）采用如下计算公式：

(1) 代码见 Github Repo，主要包括以下文件

• main_spectral.jl 入口函数
• spectral.jl 谱聚类函数
• spiral_data.jl 数据文件

(2) 先固定 $sigma=1$，调整 k 值变化如下表

k sigma prec
k=3 sigma=1 100%
k=5 sigma=1 100%
k=10 sigma=1 100%
k=20 sigma=1 98.5%
k=30 sigma=1 77.5%
k=50 sigma=1 78%
k=200 sigma=1 74.5%

k sigma prec
k=10 sigma=0.01 87.5%
k=10 sigma=0.1 100%
k=10 sigma=0.5 100%
k=10 sigma=1.0 100%
k=10 sigma=2.0 100%
k=10 sigma=5.0 100%
k=10 sigma=10.0 100%
k=10 sigma=15.0 100%
k=20 sigma=0.01 87.5%
k=20 sigma=0.1 100%
k=20 sigma=0.5 99.5%
k=20 sigma=1.0 98.5%
k=20 sigma=2.0 98.5%
k=20 sigma=5.0 98.5%
k=20 sigma=10.0 98.5%
k=20 sigma=15.0 98.5%
k=30 sigma=0.01 87.5%
k=30 sigma=0.1 100%
k=30 sigma=0.5 89.5%
k=30 sigma=1.0 78%
k=30 sigma=2.0 76%
k=30 sigma=5.0 75%
k=30 sigma=10.0 75%
k=30 sigma=15.0 75%