## 1. 混合密度方法

• 样本数据集合 $$D = \{x_k \mid k = 1, 2, \dots, n\}$$，共 n 个
• 类别集合 $$\{w_j \mid j = 1, 2, \dots, c\}$$, 共 c 个
• 每个类的先验概率 $$\alpha = \{ \alpha_j = P(w_j) \mid j = 1, 2, \dots, c \}$$
• 每个类的概率密度 $$p(x \mid w_j, \theta)$$，其中 $$\theta = \{\theta_j \mid j = 1, 2, \dots, c\}$$

$p(x \mid w_j, \theta) = p(x \mid w_j, \theta_j)$

### 混合密度

$p(x \mid \theta) = \sum_{j=1}^c P(w_j) p(x \mid w_j, \theta_j)$

\begin{align} l(\theta) &= p(D \mid \theta) \\ &= \prod_{k=1}^n p(x_k \mid \theta) \\ ll(\theta) &= \ln l(\theta) \\ &= \sum_{k=1}^n \ln p(x_k \mid \theta) \\ &= \sum_{k=1}^n \ln (\sum_{j=1}^c P(w_j) p(x \mid w_j, \theta_j)) \end{align}

\begin{align} 0 &= \frac{\partial}{\partial \theta_j} ll(\theta) \\ &= \sum_{k=1}^n \frac{1}{p(x_k \mid \theta)} \frac{\partial}{\partial \theta_j} \sum_{j=1}^c P(w_j) p(x \mid w_j, \theta_j) \\ &= \sum_{k=1}^n \frac{1}{p(x_k \mid \theta)} \frac{\partial}{\partial \theta_j} P(w_j) p(x \mid w_j, \theta_j) \\ &= \sum_{k=1}^n \frac{P(w_j)}{p(x_k \mid \theta)} \frac{\partial}{\partial \theta_j} p(x \mid w_j, \theta_j) \\ &= \sum_{k=1}^n \frac{P(w_j)}{p(x_k \mid \theta)} p(x \mid w_j, \theta_j) \frac{\partial}{\partial \theta_j} \ln p(x \mid w_j, \theta_j) \\ &= \sum_{k=1}^n \frac{p(x, w_j \mid \theta)}{p(x_k \mid \theta)} \frac{\partial}{\partial \theta_j} \ln p(x \mid w_j, \theta_j) \\ &= \sum_{k=1}^n p(w_j \mid x_k, \theta) \frac{\partial}{\partial \theta_j} \ln p(x \mid w_j, \theta_j) \\ \end{align}

$\max ll(\theta, \alpha) \\ s.t. \sum_{j=1}^c \alpha_j = 1$

$LL(\theta, \alpha, \lambda) = ll(\theta, \alpha) + \lambda(\sum_{j=1}^c \alpha_j - 1)$

\begin{align} \frac{\partial}{\partial \alpha_j} LL &= \lambda + \sum_{k=1}^n \frac{1}{p(x_k \mid \theta)} \frac{\partial}{\partial \alpha_j} \sum_{j=1}^c \alpha_j p(x_k \mid w_j, \theta_j) \\ &= \lambda + \sum_{k=1}^n \frac{p(x_k \mid w_j, \theta_j)}{p(x_k \mid \theta)} \\ &= 0 \end{align}

\begin{align} 0 &= \lambda \sum_{j=1}^c \alpha_j + \sum_{k=1}^n \sum_{j=1}^c \frac{p(x_k \mid w_j, \theta_j)\alpha_j}{p(x_k \mid \theta)} \\ 0 &= \lambda + \sum_{k=1}^n \sum_{j=1}^c \frac{p(x_k, w_j \mid \theta)}{p(x_k \mid \theta)} \\ 0 &= \lambda + \sum_{k=1}^n \frac{p(x_k \mid \theta)}{p(x_k \mid \theta)} \\ 0 &= \lambda + n \\ \lambda &= -n \end{align}

\begin{align} n &= \sum_{k=1}^n \frac{p(x_k \mid w_j, \theta_j)}{p(x_k \mid \theta)} \\ n \alpha_j &= \sum_{k=1}^n \frac{p(x_k \mid w_j, \theta_j)\alpha_j}{p(x_k \mid \theta)} \\ \alpha_j &= \frac{1}{n} \sum_{k=1}^n P(w_j \mid x_k, \theta) \end{align}

### 高斯混合密度

$p(x \mid \mu, \Sigma) = \frac{1}{\sqrt[d]{(2\pi)} \sqrt{\vert\Sigma^{-1}\vert}}\exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))$

\begin{align} \frac{\partial}{\partial \mu_j} \ln p(x \mid \mu_j, \Sigma_j) &= -\frac{1}{2} \frac{\partial}{\partial \mu_j} (x - \mu_j)^T \Sigma_j^{-1}(x - \mu_j) \\ &= \Sigma^{-1}(x - \mu_j) \end{align}

（20160611补充：搜到一个关于估计协相关矩阵的，将二次型视为一个矩阵的 trace 这招比较犀利：Wikipedia

### k-means

k-means 目标是最小化所有点到各自聚类中心的距离之和，实际上就是最大化 $$\Sigma_j = I$$ 时的似然函数，因此可以视为上面高斯混合密度的一个特例.

## 2. spectral clustering

### 基本想法

• 行和为0
• 必有一个0特征值和全为1的特征向量
• 是半正定矩阵
• 0特征值的重数是图的连通子图的数目

### 不同做法

• 图的最大分割
• 随机游走
• 摄动分析 [8], Ng 除了提出了摄动分析的解释之外也做了比较好的归一化工作，可以说他对谱聚类的贡献是突破性的

• $$L_{sym} = D^{-1/2}LD^{1/2}$$，相当于用顶点数归一化
• $$L_{rw} = D^{-1/2}L$$，等价于随机游走的方法，相当于用 volume 归一化

## 3. 其他方法

• LVQ: 见 [4] ，需要标注数据，通过标注数据来修正，不知道这还算不算聚类
• DBSCAN: 见 [4], 并没有什么特别之处，只是不停地以密度范围聚类而已，只不过工程实现里需要考虑距离索引的问题，可以用 geohash 或者 kd-tree 等

## Bibliography

[1]R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley & Sons, 2012.
[2]从参数估计到线性判别函数. http://libzx.so/chn/learning/2016/03/27/from-estimator-to-linear-discriminant.html 2016.
[3]Randal J. Barnes, Matrix Differentiation. http://www.atmos.washington.edu/~dennis/MatrixCalculus.pdf
[4]周志华, 机器学习. 清华大学出版社, 2016.
[5]B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural computation, vol. 10, no. 5, pp. 1299–1319, 1998.
[6]U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
[7]J. Shi and J. Malik, “Normalized cuts and image segmentation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 888–905, 2000.
[8]A. Y. Ng, M. I. Jordan, Y. Weiss, and others, “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, vol. 2, pp. 849–856, 2002.
[9]I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 551–556.
[10]A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, Jun. 2014.
[11]R. Xu, D. Wunsch, and others, “Survey of clustering algorithms,” Neural Networks, IEEE Transactions on, vol. 16, no. 3, pp. 645–678, 2005.