2020-04-12 19:35:06 +02:00
\documentclass [../main.tex] { subfiles}
\begin { document}
2020-04-13 12:52:55 +02:00
\chapter { Lecture 7 - 07-04-2020}
2020-04-13 20:58:25 +02:00
Bounding statistical risk of a predictor\\ \
Design a learning algorithm that predict with small statistical risk\\
$$
(D,\ell ) \qquad \ell _ d(h) = \barra { E} \left [ \, \ell (y), h(x) \, \right]
$$
were $ D $ is unknown
$$
\ell (y, \hat { y} ) \in [0,1] \quad \forall y, \hat { y} \in Y
$$
We cannot compute statistical risk of all predictor.\\
We assume statistical loss is bounded so between 0 and 1. Not true for all
losses (like logarithmic ).\\
Before design a learning algorithm with lowest risk, How can we estimate
risk?\\
We can use test error $ \rightarrow $ way to measure performances of a predictor h.
We want to link test error and risk.
\\
Test set $ S' = \{ ( x' _ 1 , y' _ 1 ) ... ( x' _ n,y' _ n ) \} $ is a random sample from $ D $
\\
How can we use this assumption?\\
Go back to the definition of test error\\
\\
\red { Sample mean (IT: Media campionaria)} \\
$$
\hat { \ell } _ s(h) = \frac { 1} { n} \cdot \sum _ { t=1} ^ { n} \ell (\hat { y} _ t,h(x'_ t))
$$
i can look at this as a random variable
\col { $ \ell ( y' _ t,h ( x' _ t ) ) $ } { Blue}
\\
$$
\barra { E} \left [ \, \ell (y'_t, h(x'_t)) \right] = \ell _ D(h) \longrightarrow \red { risk}
2020-04-14 21:52:54 +02:00
$$ \\
Using law of large number (LLN), i know that:
2020-04-13 20:58:25 +02:00
$$
2020-04-14 21:52:54 +02:00
\hat { \ell } \longrightarrow \ell _ D(h) \qquad as \quad n \rightarrow \infty
$$
We cannot have a sample of $ n = \infty $ so we will introduce another assumption:
the \red { Chernoff-Hoffding bound}
\subsection { Chernoff-Hoffding bound}
$$
Z_ 1,...,Z_ n \quad \textit { iid random variable} \qquad \barra { E} \left [Z_t \right] = u
$$
all drawn for the same distribution
\\
$$
t = 1, ..., n \qquad and \qquad 0 \leq Z_ t \leq 1 \qquad t = 1,...,n \quad then \quad \forall \varepsilon > 0
$$ \
$$
\barra { P} \left ( \frac { 1} { n} \cdot \sum _ { t=1} ^ { n} z_ t > u + \varepsilon \right ) \leq e^ { -2 \, \varepsilon ^ 2 \, n} \qquad or \qquad \barra { P} \left ( \frac { 1} { n} \cdot \sum _ { t=1} ^ { n} z_ t < u + \varepsilon \right ) \leq e^ { -2 \, \varepsilon ^ 2 \, n}
$$
as sample size then $ \downarrow $
$$
Z_ t = \ell (Y'_ t, h(X'_ t)) \in \left [0,1\right]
$$
$
(X'_ 1, Y'_ 1)...(X'_ n, Y'_ N)$ are $ iid$ therefore, \\ $ \ell \left (Y'_ t, h\left (X'_ t\right )\right )$ \quad $ t = 1,...,n $ \quad are also $ iid$
\\
We are using the bound of e to bound the deviation of this.
\subsubsection { Union Bound}
Union bound: a collection of event not necessary disjoint, then i know
that probability of the union of this event is the at most the sum of the
probabilities of individual events
$$
A_ 1, ..., A_ n \qquad \barra { P} \left ( A_ 1 \cup ... \cup A_ n \right ) \leq \sum _ { t=1} ^ { n} \barra { P} \left (A_ t\right )
$$
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez7-img1.JPG}
\caption { Example}
%\label{fig:}
\end { figure} \\
\red { that's why $ \leq $ }
\\ \\
$$
\barra { P} \left (|\, \hat { \ell } _ { s'} \left ( h \right ) - \ell _ D\left ( h \right ) \, | \, > \varepsilon \right )
$$
This is the probability according to the random draw of the test set.\\
\\
If test error differ from the risk by a number epsilon > 0. I want to bound the
probability. This two thing will differ by more than epsilon. How can i use the
Chernoff bound?
$$
|\, \hat { \ell } _ { s'} \left ( h \right ) - \ell _ D\left ( h \right ) \, | \, > \varepsilon \quad \Rightarrow \quad
\hat { \ell } _ { s'} \left (h\right )-\ell _ D\left (h\right ) > \varepsilon \quad \vee \quad
\hat { \ell } _ D \left (h\right )-\ell _ { s'} \left (h\right ) > \varepsilon
$$
$$
A, B \qquad A \Rightarrow B \qquad \barra { P} \left ( A \right ) < \barra { P} \left ( B \right )
$$
\begin { figure} [h]
\centering
\includegraphics [width=0.2\linewidth] { ../img/lez7-img2.JPG}
\caption { Example}
%\label{fig:}
\end { figure}
$$
\barra { P} \left (|\, \hat { \ell } _ { s'} \left ( h \right ) - \ell _ D\left ( h \right ) \, | \, > \varepsilon \right )
\leq
\barra { P} \left ( \, | \hat { \ell } _ { s'} \left (h\right )-\ell _ D\left (h\right ) |\, \right ) \quad
\cup \quad
\barra { P} \left ( \, |
\hat { \ell } _ D \left (h\right )-\ell _ { s'} \left (h\right )
|\, \right )
\leq
$$ \
$$
\leq
\barra { P} \left ( \hat { \ell } _ { s'} > \ell _ D\left (h\right ) + \varepsilon \right ) + \barra { P} \left ( \hat { \ell } _ { s'} < \ell _ D\left (h\right ) - \varepsilon \right )
\quad
\leq \quad
2 \cdot e^ { -2 \, \varepsilon ^ 2 \, n} \quad \Rightarrow \red { \textit { we call it } \delta }
$$
$$
\varepsilon = \sqrt [] { \frac { 1} { 2\cdot n} \ln \frac { 2} { \delta
} }
$$
\col { The two events are disjoint} { Blue} \\ \\
This mean that probability of this deviation is at least delta!
$$
|\, \hat { \ell } _ { s'} \left (h\right )-\ell _ D\left (h\right ) \, | \leq \sqrt [] { \frac { 1} { 2\cdot n} \ln \frac { 2} { \delta } } \qquad \textit { with probability at least $ 1 - \delta $ }
$$
\red { Test error of true estimate is going to be good for this value ($ \delta $ )}
\\
\begin { figure} [h]
\centering
\includegraphics [width=0.5\linewidth] { ../img/lez7-img3.JPG}
\caption { Example}
%\label{fig:}
\end { figure} Confidence interval for risk at confidence level 1-delta.\\
I want to take $ \delta = 0 , 05 $ so that $ 1 - \delta $ is $ 95 \% $ . So test error is going to be
an estimate of the true risk which is precise that depend on how big is the test
set ($ n $ ).\\
As n grows I can pin down the position of the true risk.\\ \
This is how we can use probability to make sense of what we do in practise.
If we take a predictor h we can compute the risk error estimate.\\
We can measure how accurate is our risk error estimate.\\
\textbf { Test error is an estimate of risk for a given predictor (h).}
\\
$$
\barra { E} \left [ \, \ell\left( Y'_t, h\left(X'_t\right)\right) \, \right] = \ell _ D \left ( h\right )
$$
\textbf { h is fixed with respect to S’ } $ \longrightarrow $ $ h $ does not depend on the test set.
So learning algorithm which produce h not have access to test set.\\
If we use test set we break down this equation.
\\ \\
Now, how to \textbf { build a good algorithm?} \\
Training set $ S = \{ \left ( x _ 1 ,y _ 1 \right ) ... \left ( x _ m,y _ m \right ) \} $ random sample
\\ $ A $ \qquad $ A \left ( S \right ) = h $ predictor output by $ A $ given $ S $
where A is \red { learning algorithm as function of traning set $ S $ .}
\\
$ \forall \, S $ \qquad $ A \left ( S \right ) \in H \qquad h ^ * \in H $
\\
$$
\ell _ D\left (h^ *\right ) = min \, \ell _ D \left (h\right ) \qquad \hat { \ell } _ s\left (h^ *\right ) \textit { is closed to } \ell _ D\left (h^ *\right ) \longrightarrow \textbf { it is going to have small error }
$$
where $ \ell _ D \left ( h ^ * \right ) $ is the \red { training error of $ h ^ * $ }
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez7-img4.JPG}
\caption { Example}
%\label{fig:}
\end { figure} \\
This guy $ \ell _ D \left ( h ^ * \right ) $ is closest to $ 0 $ since optimum\\
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez7-img5.JPG}
\caption { Example}
%\label{fig:}
\end { figure} \\
In risk we get opt in $ h ^ * $ but in empirical one we could get another $ h’ $ better than $ h ^ + $
\\ \\
In order to fix on a concrete algorithm we are going to take the empirical Islam
minimiser (ERM) algorithm.
2020-04-12 19:35:06 +02:00
\end { document}