7.8 Early Stopping

SUMMARY : early stopping的提出是基于如下的事实的：

When training large models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise again. See figure for an example of this 7.3 behavior. This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus, hopefully better test set error) by returning to the parameter setting at the point in time with the lowest validation set error. Every time the error on the validation set improves, we store a copy of the model parameters. When the training algorithm terminates, we return these parameters, rather than the latest parameters. The algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specified number of iterations. This procedure is specified more formally in algorithm . 7.1

SUMMARY : 通过validation set error来选择最优的model parameters。

Algorithm 7.1 The early stopping meta-algorithm for determining the best amount of time to train. This meta-algorithm is a general strategy that works well with a variety of training algorithms and ways of quantifying error on the validation set.

Let n be the number of steps between evaluations. n // 运行多少step的训练后再进行evaluate Let p be the “patience,” the number of times to observe worsening validation set error before giving up. Let $θ_0$ be the initial parameters. θ ← $θ_0$ # θ是model parameters i ← 0 # number of step to train j ← 0 v ← ∞ # validation set error $θ^∗$ ← θ $i^∗$ ← i // best number of step to train

while j < p do Update θ by running the training algorithm for n steps.

i ← i + n $v^{'}$ ← ValidationSetError( θ ) # 此时模型的validation set error if $v^{'}$ < v then # validation set error improved，so update the best value j ← 0 $θ^∗$ ← θ $i^∗$ ← i v ← $v^{'}$

else j ← j + 1 end if end while

Best parameters are $θ^∗$ , best number of training steps is $i^∗$

SUMMARY : 其实上述算法所确定的是最优的训练step数，因为在这个step数的时候，模型的表现是最优的。训练的number of step也是一个hyperparameter。

THINKING : 为什么此处使用的是validation set？查阅deep learning book chapter 5.3 Hyperparameters and Validation Sets

This strategy is known as early stopping . It is probably the most commonly used form of regularization in deep learning. Its popularity is due both to its effectiveness and its simplicity.

One way to think of early stopping is as a very efficient hyperparameter selection algorithm. In this view, the number of training steps is just another hyperparameter. We can see in figure 5.3 that this hyperparameter has a U-shaped validation set performance curve. Most hyperparameters that control model capacity have such a U-shaped validation set performance curve, as illustrated in figure 5.3. In the case of early stopping, we are controlling the effective capacity of the model by determining how many steps it can take to fit the training set. Most hyperparameters must be chosen using an expensive guess and check process, where we set a hyperparameter at the start of training, then run training for several steps to see its effect. The “training time” hyperparameter is unique in that by definition a single run of training tries out many values of the hyperparameter. The only significant cost to choosing this hyperparameter automatically via early stopping is running the validation set evaluation periodically during training. Ideally, this is done in parallel to the training process on a separate machine, separate CPU, or separate GPU from the main training process. If such resources are not available, then the cost of these periodic evaluations may be reduced by using a validation set that is small compared to the training set or by evaluating the validation set error less frequently and obtaining a lower resolution estimate of the optimal training time.

SUMMARY : The “training time” hyperparameter is unique in that by definition a single run of training tries out many values of the hyperparameter.这句话要如何理解？按照我的理解，在early stop中，只涉及到了training steps超参数。