Statistical Estimators, Underfitting, and Overfitting
An estimator, $\delta$, is a rule that we apply to a dataset $\mathcal{D}$ in order to compute an estimate, $\hat{\theta}$, of a model's weights. In other words, we compute
$$\hat{\theta} = \delta(\mathcal{D})$$
You can think of an estimator as a training procedure. It's a specific method we apply to modify a model's weights when confronted with data in order to have those weights converge on their optimal setting, the estimate. Maximum Likelihood Estimation and Maximum A Posteriori, which we will discuss in the lesson on loss functions, are both examples of estimators.
Sampling Distributions
We can often compute the estimate of a parameter by looking at the sampling distribution of that parameter. The sampling distribution is the probability distribution we get by considering the parameter as a random variable and computing its value based on multiple subsamples of the dataset. If that sounds confusing, it's because it is. For example, let's say we are working with fitting a normal distribution $\mathcal{N}(x; \mu, \sigma^2)$ to some data, $\mathcal{D}$.
Bias and Variance
Bias is encoded as the implicit assumptions of the model and learning algorithm. It's analogous to the error arising from the inability of the estimator to learn the true function that it's approximating. Mathematically, it's defined as the expected difference between an estimator and its true value, i.e.
$$\text{Bias}(\hat{\theta}) = \mathbb{E}\left[\hat{\theta} - \theta^* \right]$$
An unbiased estimator has a numerical bias of 0. This equivalently means that the sampling distribution of the estimator is centered on the true parameter value.
On the other hand, variance is the error resulting from the modeling of noise in the dataset that's not due to the underlying function or parameter being estimated. Mathematically, it's the expected value of the squared difference between an estimator $\hat{\theta}$ and its mean $\overline{\theta}$, i.e.
$$\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \overline{\theta})^2\right]$$
The Bias-Variance Tradeoff
We haven't gotten to it yet, but in a later lesson we will introduce the mean-squared error (MSE). The MSE computes the error between the output of a neural network $\mathcal{N}(x)$ and the true function $f(x)$. It does so using the formula
$$MSE(\mathcal{N}(x), f(x)) = \mathbb{E}\left[(\mathcal{N}(x) - f(x))^2\right]$$
Now, instead of applying the MSE to neural network outputs, we can try applying it to our parameter estimate, i.e.
$$MSE(\hat{\theta}, \theta^*) = \mathbb{E}[(\hat{\theta} - \theta^*)^2]$$
We can decompose this expectation in the following way
$$\begin{align*} \mathbb{E}[(\hat{\theta} - \theta^*)^2] &= \mathbb{E}[(\hat{\theta} - \overline{\theta} + \overline{\theta} - \theta^*)^2] \\
&= \mathbb{E}[(\hat{\theta} - \overline{\theta})^2] + 2\mathbb{E}[(\overline{\theta} - \theta^*)(\hat{\theta} - \overline{\theta})] + \mathbb{E}[(\overline{\theta} - \theta^*)^2] \\
&= \mathbb{E}[(\hat{\theta} - \theta^*)]^2 + 0 + (\overline{\theta} - \theta^*)^2 \\
&= \text{var}(\hat{\theta}) + \text{bias}(\hat{\theta})^2
\end{align*}$$
This is an essential relationship which sort of communicates that there's often no free lunch in deep learning. If we fix the error associated with an estimator, then by decreasing bias we increase variance and vice versa.
As it Applies to Classification
The above analysis of decomposition of error into squared bias and variance terms is an artifact of the Mean Squared Error function used. Since this is the loss function most commonly used for regression tasks (including those involving linear layers and neural networks), it stands to reason that it holds quite well for regression. When we switch to the context of classification, however, the relationship breaks down. In this case, risk functions such as the 0-1 loss induce a multiplicative relationship between bias and variance. Thus, the bias variance decomposition tends to not hold much utility for classification. In this case it's better to focus on expected loss which can be estimated via cross-validation.