5.2 Capacity, Overfitting and Underfitting
The train and test data are generated by a probability distribution over datasets called the data generating process . We typically make a set of assumptions known collectively as the i.i.d. assumptions . These assumptions are that the examples in each dataset are independent from each other, and that the train set and test set are identically distributed , drawn from the same probability distribution as each other. This assumption allows us to describe the data generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data generating distribution , denoted p_{data} . This probabilistic framework and the i.i.d. assumptions allow us to mathematically study the relationship between training error and test error.
上面这段话是原文中比较不容易理解的一段,它的中文翻译如下:
训练数据和测试数据是通过数据集上被称为“数据生成过程”(data generating process)的概率分布生成。我们通常会做一系列被称为“独立同分布假设”(i.i.d assumptions)的假设。该假设的含义是:每个数据集中的样本都是彼此“相互独立的”(independent),并且训练集和测试集是“同分布的”(identically distributed),采样自系统的概率分布。这个假设允许我们能够在单个样本的概率分布描述**数据生成过程**。然后相同的分布可以用来生成每一个训练样本和每个测试样本。我们就将这个共享的潜在分布称为“数据生成分布”(data generating distribution)。这个概率框架和独立同分布假设允许我们从数学上研究训练误差和测试误差之间的关系。
“data generation distribution”是真实的分布,而“data generating process”则是我们根据train set学习到的发布。在统计学中,使用Statistical model来描述“data generating process”。关于此,参见
关于**i.i.d. assumptions** ,参见Independent and identically distributed random variables。
“probability distribution”是Probability theory中的概念,参见probability distributions。所谓“identically distributed”,是指两种的probability distributions相同。
后记
对于capacity、overfitting、underfitting的讨论是machine learning中的重要问题之一,在后续的章节中还将继续对它的讨论。