# Language models revisited

This blog is inspired by an excellent slides *From Sequence Modeling to Translation* by Kyunghyun Cho. You can view it here.

## Why language models

When it comes to the NLP problem, we often want to know how **likely** a sentence is, that is, for a sentence \(s=w_1w_2...w_T\), we want to model the probability:

As we know, n-gram is the one of the most popular language models. n is typically 1, 2, 3 or even 4 for most of the time.

n-gram has made the following two assumptions:

- Markov independence: the current word depends on previous n words.
- Non-parametric Estimator: probability is directly counted.

namely, we get

\[P(s) = \prod_i P(w_i \vert w_{i-1}, ..., w_{i-n})\]where each probability on the right is computed by counting the occurrence of words in our training corpus, like

\[P(w_i \vert w_{i-1}) = \frac{count(w_{i-1} w_i)}{\sum_j count(w_{i-1} w_j)}\]n-gram models are facing two potential problems:

- Rare terms: It’s clear that our corpus don’t cover all the word that may appear in the reality, which leads to some zero probabilities in n-gram while it’s actually very possible.
- No generalization: You know one word
*w*is followed by another, but you cannot get the similar probability for other words close to*w*in their meaning.

To solve the first problem, people invented many smoothing methods. The word *smoothing* vividly explains what it does: to *smooth* the distribution for all n-grams, those with high probability are more or less lowered and the *spared probability* is then distributed to the n-grams with 0 probability.

There’re quite a lot of smoothing methods and people have done many comparisions for them.

- Laplace smoothing or additive smoothing
- Good-Turing
- Katz smoothing
- Jelinek-Mercer smoothing
- Witten-Bell smoothing
- Kneser-Ney smoothing
- Church-Gale

## Move to Neural Network Language Models(NNLM)

### Relax the 2nd assumption

When we convert the non-parametric estimator to a parametric one, we disgard the previous n-gram probability and turn to a function parameterized by, say, theta.

\[P(x_t \vert x_{t-n}, ..., x_{t-1}) \approx f_\theta(x_{t-n}, ..., x_{t-1})\]We denote the word embedding matrix with **W**, internal map with **U** and map with **V** back to vocabulary.

\(\forall t, Wx_t\) is the word’s embedding.

Concatenate all the adjacent words together, we get the context vector.

\[h_t = (x_t, x_{t-1}, ..., x_{t-n+1})\]The unnormalized probability is thus

\[y = V(\phi(Uh+b)+c)\]Using softmax on it will yield the final probability we want.

\[P(x_t=i\vert ...) = \frac{\exp(y_i)}{\sum_j\exp(y_j)}\]### Relax the first assumption

What if we do not preserve the n-th order markov assumption? We get

\[P(x_1, ..., x_T) \approx \prod_{t=1}^T P(x_t \vert x_{t-1}, ..., x_1)\]We thus use the recurrent network to produce it. Every probability can be computed by

\[h_t = h_{t-1} \oplus x_t \text{ and } y_t = \triangleright h_t\]Here two operators are based on data and network design.

- Addition
- Pop Out

The addition may be implemented as some affine transformation(with a nonlinear function), and the Pop may be another affine one. The final probability can also be optained using softmax.

To train such an RNN, use Back-Propagation Through Time(BPTT). I haven’t did any research on it now.

## Tips for RNN

RNN does have new problems, too.

**Gradient Vanishing**may occur when the train is long.- There’s also so-called
**Gradient Exploding**, too. - RNN may not remember things with too long distance.

For the third problem, Gated Recurrent Unit(GRU) or LSTM may help. We can also choose to use **Attention-based Models**, but that’s another question and needs a new post.

We may also adopt some **adaptive learning rate algorithm** (Adagrad, Adadelta, Adam, et al.) when training networks.

Theano gives the automatic differentiation, too.

## Bibliography

- Kyunghyun Cho. From Sequence Modeling to Translation. 2015. https://drive.google.com/file/d/0B16RwCMQqrtdNEhwbHN2bXJzdXM/view
- 宗成庆. 统计自然语言处理（第2版）. 清华大学出版社, 2013. Douban. Web. 21 Oct. 2015.

This work is licensed under a Creative Commons Attribution 4.0 International License.