Bayesian Decision

decision theory

min error:

min risk:

for every single x, find the minimized error / risk to make the whole loss minimized.

normal distribution

properties:

• linear: $A^TX \sim N(A^T\mu, A^T\Sigma A)$
• whiten: $A_w = \Phi \Lambda^{-1/2}$

non-parametric estimation

region:

where:

• k: samples in the region
• n: the number of total samples

pointwise：

for a series convergent to p(x) (a single point), there’re 3 conditions:

two way to fix a variable:

• k-means, $k = \sqrt{n}$
• parzen window, $V_n = 1/h_n^d$

parametric estimation

MLE / BE:

key:

MLE find the max theta that $p(\theta\mid D) = \frac{p(D \mid \theta)\, p(\theta)}{Z}$

BE find max $p(\theta\mid D) = \int p(D\mid\theta)p(\theta)d\theta$

EM:

in order to maximize $L(\theta) = \log P(Y\mid\theta)=\log(\int_Z P(Y\mid Z,\theta)P(Z\mid\theta)\,dZ\,)$

iteratively, using Jensen inequality to find a lower bound for $L(\theta) - L(\theta^i)$

to maximize the lower bound of $\theta$ gives the Q function

SVM

original model

using lagrange multiplier method, this is equivalent to

note the max part, in order to minimize it, discuss about both the condition violation and alpha value

then the lagrange duality may apply

feature extraction

PCA

• way1: minimize the variation of projected samples
• way2: minimize the difference(least square) of reconstructed samples

using way1:

using way2:

LDA

to minimize the loss within classes and maximize the loss between classes

KPCA

replace the reconstruction with kernels (???)

LLE

using neighbor to reconstruct the data

1. find w that $\min_w\Vert x_i-\sum_{j\in N(i)}w_{ij}x_i\Vert^2$

equ. to (???)

2. find embedding y that (???)

result is eigenvectors of M (???)

feature selection

• criterion
• searching methods: branch and bound, gene, forward/backward greedy algorithms

clustering

mixture

start from the mixture density:

hierarchical

every points is a cluster -> every points is assigned into a big cluster

• find the minimal dist within a class that is the smallest
• use the max dist between class

spectral

generalized graph nodes and edges for points and similarity.

simplest: L = D - W

linear discriminant functions

perceptron

• perceptron: $J = \sum(-a^Ty)$
• add a margin to the boundary: $a^Ty > b$
• single sample update and batch update

MSE

solve:

• LMS
• pseudo-inverse

model selection

no free lunch

visualize ROC curve: http://www.navan.name/roc/