5.5 Maximum Likelihood Estimation

Previously, we have seen some definitions of common estimators and analyzed their properties. But where did these estimators come from? Rather than guessing that some function might make a good estimator and then analyzing its bias and variance, we would like to have some principle from which we can derive specific functions that are good estimators for different models.

The most common such principle is the maximum likelihood principle.

Consider a set of m examples $\mathbb{X}={x^{(1)}, \ldots, x^{(m)}}$ drawn independently from the true but unknown data generating distribution $p_{data} (x) $.

Let $p_{model} ( x ; \theta )$ be a parametric family of probability distributions over the same space indexed by $\theta$ . In other words, $p_{model} ( x ; \theta )$ maps any configuration $x$ to a real number estimating the true probability $p_{data} (x) $.

The maximum likelihood estimator for $\theta$ is then defined as $$ \theta_{ML} = \argmax_\theta p_{model}(\mathbb{X};\theta) \qquad{(5.56)}\
=\argmax_\theta \prod_{i=1}^m p_{model}(x^{(i)};\theta) \qquad{(5.57)}\ $$

SUMMARY : 上式中**ML**的含义是maximize likelihood；

This product over many probabilities can be inconvenient for a variety of reasons. For example, it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its argmax but does conveniently transform a product into a sum: $$ \theta_{ML} = \argmax_\theta \sum_{i=1}^m \log p_{model}(x^{(i)};\theta) \qquad{(5.58)}\ $$ Because the argmax does not change when we rescale the cost function, we can divide by m to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution $\hat p_{data}$ defined by the training data: $$ \theta_{ML} = \argmax_\theta \mathbb{E}{X \sim \hat p{data}} \log p_{model}(x;\theta) \qquad{(5.59)}\ $$ SUMMARY : equation ${(5.58)}$ 除以 $m$ 就得到了 equation ${(5.59)}$

SUMMARY : $\hat p_{data}$ 表示的是 $X$ 的 empirical distribution defined by the training set

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution $\hat p_{data}$ defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. The KL divergence is given by $$ D_{KL}(\hat p_{data} \mid\mid p_{model}) = \mathbb{E}{X \sim \hat p{data}} [\log \hat p_{data}(x) - \log p_{model}(x)] \qquad{(5.60)} $$ The term on the left is a function only of the data generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize $$ -\mathbb{E}{X \sim \hat p{data}}[\log p_{model}(x)] \qquad{(5.61)} $$ which is of course the same as the maximization in equation ${(5.59)}$ .

Minimizing this KL divergence corresponds exactly to minimizing the cross-entropy between the distributions. Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer（用词不当）. Any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution defined by the training set and the probability distribution defined by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.

We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution $\hat p_{data}$ . Ideally, we would like to match the true data generating distribution $p_{data}$ , but we have no direct access to this distribution.

While the optimal $\theta$ is the same regardless of whether we are maximizing the likelihood or minimizing the KL divergence, the values of the objective functions are different. In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL), or equivalently, minimization of the cross entropy. The perspective of maximum likelihood as minimum KL divergence becomes helpful in this case because the KL divergence has a known minimum value of zero. The negative log-likelihood can actually become negative when is real-valued.

5.5.1 Conditional Log-Likelihood and Mean Squared Error

The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability $P ( y \mid x ; \theta )$ in order to predict $y$ given $x$ . This is actually the most common situation because it forms the basis for most supervised learning. If $X$ represents all our inputs and $Y$ all our observed targets, then the conditional maximum likelihood estimator is $$ \theta_{ML} = \argmax_\theta P({Y} \mid X; \theta) \qquad{(5.62)}\ $$ If the examples are assumed to be i.i.d., then this can be decomposed into $$ \theta_{ML} = \argmax_\theta \sum_{i=1}^m P({y^{(i)}} \mid x^{(i)}; \theta) \qquad{(5.63)}\ $$

5.5.2 Properties of Maximum Likelihood

The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of examples $m \to \infty$ , in terms of its rate of convergence as m increases.

Under appropriate conditions, the maximum likelihood estimator has the property of consistency (see section above), meaning that as the number 5.4.5 of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. These conditions are:

The true distribution $p_{data}$ must lie within the model family $p_{model} ( · ; \theta )$ . Otherwise, no estimator can recover $p_{data}$ .
The true distribution $p_{data}$ must correspond to exactly one value of $\theta$ . Otherwise, maximum likelihood can recover the correct $p_{data}$ , but will not be able to determine which value of was used by the data generating processing.

For these reasons (consistency and efficiency), maximum likelihood is often considered the preferred estimator to use for machine learning. When the number of examples is small enough to yield overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.