Is Learning Feasible?: Generalization Bound

Generalization Bound Idea

A ==Generalization bound== help us to understand how the Generalization Error $E_{out}$ could be, even though we cannot compute it.

In order to understand that, the bound is defined in terms of the training error $E_{in}$, which we can compute.

Starting from ==Hoeffding Inequality==, we can derive two possible escenarios:

==Maximum Bound==: What is the **maximum** possible value for $E_{out}$
==Minimum Bound==: What is the **minimum** possible value for $E_{out}$

The first one is of more interest because it could explain how well our $E_{in}$ can estimate $E_{out}$ by providing a small maximum bound. The following image depicts this derivation

A Generalization Bound is other way to understand if Learning is feasible or not, we will came up to similar conclusions as with Hoeffding Inequality.

In order to have a better expression of $\epsilon$, we use assume that the right term of Hoeffding Inequality is a term $\delta$ that we, as users, defined beforehand.

and we can isolate $\epsilon$ to get $$\epsilon = \sqrt{\frac{1}{2N}\text{ln}\left(\frac{2M}{\delta}\right)}$$ Notice that by replacing $\epsilon$ in any of our two previous cases we have an expression in terms of $N$ (==training set size==) and $M$ (==model complexity==).

Now we, as users, will define $\delta$ instead of $\epsilon$.

In such way, we say that starting from the Hoeffding Inequality with probability ==$1 - \delta$==, the following bounds are true ==Maximum Bound== $$E_{out}(g) \leq E_{in}(g) + \sqrt{\frac{1}{2N}\text{ln}\left(\frac{2M}{\delta}\right)}$$ == Minimal Bound== $$E_{out}(g) \geq E_{in}(g) - \sqrt{\frac{1}{2N}\text{ln}\left(\frac{2M}{\delta}\right)}$$

Notice the isolation $\epsilon$ from $\delta$ is

$$\text{ln}\left(\frac{\delta}{2M}\right) = -2N\epsilon^2$$ $$\epsilon^2 = - \frac{1}{2N} \text{ln}\left(\frac{\delta}{2M}\right)$$ $$\epsilon = \sqrt{\frac{1}{2N}ln\left(\frac{2M}{\delta}\right)}$$

Maximum Bound (First Case)

The following is the definition of the maximum bound $$E_{out}(g) \leq E_{in}(g) + \sqrt{\frac{1}{2N}\text{ln}\left(\frac{2M}{\delta}\right)}$$ Notice we can conclude

$\downarrow E_{in} \quad \downarrow E_{out} \Rightarrow$ limits the generalization error by a smaller training error $E_{in}$
$\uparrow N \quad \downarrow E_{out} \Rightarrow$ more training size less generalization error
$\uparrow M \quad \uparrow E_{out} \Rightarrow$ more model complexity more generalization

**__Trade-off M, $E_{in}$, $E_{out}$__**
Additionally, we can note again that there exists a complex relation between $M$, $E_{in}$, and $E_{out}$ similar to what happened with the Uniform Hoeffnig Inequality

By focusing in the complex relation that exists between $M$, $E_{in}$, and $E_{out}$, we can split the learning in two central questions:

can we make sure that $E_{out}(g)$ is close enough to $E_{in}(g)$
can we make $E_{in}(g)$ small enough?

Then, using the right $M$ (or $\mathcal{H}$) is important.

Notice that although the generalization error increases with the model complexity, it doesn't increase too much due to the square root and logarithm.

--Regularization can account for this trade-off by minimizing $E_{in}$ and at the same time trying to minimize the complexity of the model M--

Minimum Bound (Second Case)

The following is the definition of the minimum bound $$E_{out}(g) \geq E_{in}(g) - \sqrt{\frac{1}{2N}\text{ln}\left(\frac{2M}{\delta}\right)}$$ Notice we can conclude $\uparrow E_{in} \quad \uparrow E_{out}$.

Then, this bound effectively provides a kind of minimum bound that our generalization error can achieve. It can not necesarily be an absolute zero generalization error. It is limited again by the $E_{in}$

Conclusion

The generalization bound is an intrinsic (max and min) limit of

your dataset (which contro N),
model complexity (which controls M),
and how much you want tolerance to bad events we have (which determines $\delta$)

Finite Hypothesis Set Assumption

Notice the whole derivation is possible because we assume a finite set $H$. However, our ML algorithm usually have infinite hypothesis sets. We will use VC dimension to deal with these cases.