# Language Modeling

## N-Gram

$$P(X_1=x_1,X_2=x_2,…X_n=x_n)=P(X_1=x_1)\prod_{i=2}^{n}{P(X_i=x_i|X_1=x_1,…,X_{i-1}=x_{i-1})}$$

$$P(X_1=x_1,X_2=x_2,…X_n=x_n)=\Pi_{i=1}^{n}{P(X_i=x_i|X_{i-2 }=x_{i-2},X_{i-1}=x_{i-1})}$$

$$P(X_1=x_1,X_2=x_2,…X_n=x_n)=\prod_{i=1}^{n}{P(X_i=x_i|X_{i-1}=x_{i-1})}$$

$$P(I, want, to ,eat, Chinese, food)\ =P(I)P(want|I)P(to|want)P(eat|to)P(Chinese|eat)P(food|Chinese)\ =0.25 1087/3437786/1215860/325619/938120/213\ =0.000154171$$

## Evaluating Language Models：Perplexity

$$\Pi_{i=1}^{m}p(x^{(i)})$$

$$\frac{1}{M}log_2\prod_{i=1}^mp(x^{(i)})\ =\frac{1}{M}\sum_{i=1}^{m}log_2p(x^{(i)})$$

## Smoothed Estimation of Trigram Models

### Linear interpolation

$$q_{ML}(w|u,v)=\frac{c(u,v,w)}{c(u,v)}\ q_{ML}(w|v)=\frac{c(v,w)}{c(v)}\ q_{ML}(w)=\frac{c(w)}{c()}$$

$$q(w|u,v)=\lambda_1q_{ML}(w|u,v)+\lambda_2q_{ML}(w|v)+\lambda_3q_{ML}(w)$$

$$L(\lambda_1,\lambda_2,\lambda_3)=\sum_{u,v,w}c’(u,v,w)log,q(w|u,v)\ =\sum_{u,v,w}c’(u,v,w)log(\lambda_1q_{ML}(w|u,v)+\lambda_2q_{ML}(w|v)+\lambda_3q_{ML}(w))$$

$$arg,,,,max_{\lambda_1,\lambda_2,\lambda_3},,L(\lambda_1,\lambda_2,\lambda_3)\ \lambda_1\geq 0,\lambda_2\geq 0,\lambda_3\geq 0\ \lambda_1+\lambda_2+\lambda_3=1$$

### Discounting Methods

notes上讲的也很清楚，就是设置一个参数$\beta$，且$\beta$在0到1之间。

$$\alpha (v)=1-\sum_{c(v,w)>0}\frac{c^*(v,w)}{c(v)}$$

$$A(v)={w:c(v,w)>0} \ B(v)={w:c(v,w)=0}$$
A(v)表示在语料库中有的，B(v)表示在语料库中没有的，主要是B(v)，虽然语料库中没有，但不能为0，所以将其变为N-1 gram，这里变为unigram，具体公式如下：
$$q_D(w|v)=\left{\begin{matrix} \frac{c^*(v,w)}{c(v)}& If ,,w\in A(v) \ \alpha(v) \frac{q_{ML}(w)}{\sum_{w \in B(v)}q_{ML}(w)} & If,,w\in B(v) \end{matrix}\right.$$

trigram模型同理，见notes。

$$\sum_{u,v,w}c’(u,v,w)log,q(w|u,v)$$