## Bayesian Decision

### decision theory

min error:

min risk:

for every single x, find the minimized error / risk to make the whole loss minimized.

### normal distribution

properties:

• linear: $A^TX \sim N(A^T\mu, A^T\Sigma A)$
• whiten: $A_w = \Phi \Lambda^{-1/2}$

## non-parametric estimation

region:

where:

• k: samples in the region
• n: the number of total samples

pointwise：

for a series convergent to p(x) (a single point), there’re 3 conditions:

two way to fix a variable:

• k-means, $k = \sqrt{n}$
• parzen window, $V_n = 1/h_n^d$

## parametric estimation

### MLE / BE:

key:

MLE find the max theta that $p(\theta\mid D) = \frac{p(D \mid \theta)\, p(\theta)}{Z}$

BE find max $p(\theta\mid D) = \int p(D\mid\theta)p(\theta)d\theta$

### EM:

in order to maximize $L(\theta) = \log P(Y\mid\theta)=\log(\int_Z P(Y\mid Z,\theta)P(Z\mid\theta)\,dZ\,)$

iteratively, using Jensen inequality to find a lower bound for $L(\theta) - L(\theta^i)$

to maximize the lower bound of $\theta$ gives the Q function

## SVM

### original model

using lagrange multiplier method, this is equivalent to

note the max part, in order to minimize it, discuss about both the condition violation and alpha value

then the lagrange duality may apply

## feature extraction

### PCA

• way1: minimize the variation of projected samples
• way2: minimize the difference(least square) of reconstructed samples

using way1:

using way2:

### LDA

to minimize the loss within classes and maximize the loss between classes

### KPCA

replace the reconstruction with kernels (???)

### LLE

using neighbor to reconstruct the data

1. find w that $\min_w\Vert x_i-\sum_{j\in N(i)}w_{ij}x_i\Vert^2$

equ. to (???)

2. find embedding y that (???)

result is eigenvectors of M (???)

## feature selection

• criterion
• searching methods: branch and bound, gene, forward/backward greedy algorithms

## clustering

### mixture

start from the mixture density:

### hierarchical

every points is a cluster -> every points is assigned into a big cluster

• find the minimal dist within a class that is the smallest
• use the max dist between class

### spectral

generalized graph nodes and edges for points and similarity.

simplest: L = D - W

## linear discriminant functions

### perceptron

• perceptron: $J = \sum(-a^Ty)$
• add a margin to the boundary: $a^Ty > b$
• single sample update and batch update

### MSE

solve:

• LMS
• pseudo-inverse

## model selection

no free lunch

visualize ROC curve: http://www.navan.name/roc/