Given the training set $(x_1,...,x_m)(y_1,...,y_m)$
\\
If $\hat{\ell}(f)$ for same f, then test of f is also small
\\
Fix F set of predictors output $\hat{f}$\\
$$\hat{f}= arg\,min\,\hat{\ell}(f)\\ f \in F $$
\\
\textbf{This algorithm is called Empirical Risk Minimiser (ERM)}
\\
When this strategy (ERM) fails?\\
ERM may fails if for the given training set there are:\\
Many $f \in F$ with small $\hat{\ell}(f)$, but not all of them have small test error
\\\\
There could be many predictor with small error but some of them may have big test error. Predictor with the smallest training error doesn’t mean we will
have the smallest test error.\\
I would like to pick $f^*$ such that:
$$ f^*= arg\,min \frac{1}{n}\cdot\sum_{t=1}^{m}\ell(f(x'_t),y_t)\\\qquad f \in F $$
where $\ell(f(x'_t),y_t)$ is the test error
\\
ERM works if $f^*\textit{such that}\qquad f^*= arg\,min\,\hat{\ell}(f)\qquad f \in F$
\\
So minimising training and test????? Check videolecture\\
We can think of f as finite since we are working on a finite computer.\\
We want to see why this can happen and we want to formalise a model in
which we can avoid this to happen by design:
We want when we run ERM choosing a good predictor with ...... PD\\\\
\section{Overfitting}
We called this as overfitting: specific situation in which ‘A’ (where A is the
learning algorithm) overfits if f output by A tends to have a training error much
smaller than the test error.\\
A is not doing his job (outputting large test error) this happen because test
error is misleading.\\
Minimising training error doesn’t mean minimising test error. Overfitting is bad.\\
Why this happens?\\
This happen because we have \textbf{noise in the data}\\
\subsection{Noise in the data}
Noise in the data: $y_t$ is not deterministically associated with $x_i$.\\\\
Could be that datapoint appears more times in the same test set.
Same datapoint is repeated actually I’m mislead since training and dataset not
coincide.
Minimising the training error can take me away from the point that minimise
the test error.\\
Why this is the case?
\begin{itemize}
\item Some \textbf{human in the loop}: label assigned by people.(Like image contains
certain object but human are not objective and people may have different
opinion)
\item\textbf{Lack of information}: in weather prediction i want to predict weather error.
Weather is determined by a large complicated system. If i have humidity
today is difficult to say for sure that tomorrow will rain.
\end{itemize}
When data are not noise i should be ok.
\\
\textbf{Labels are not noisy}\\\\\\\\
Fix test set and trainign set.
$$\exists f^*\in F \qquad y'_t = f^*(x'_t)\qquad\forall(x'_t,y'_t)\quad\textit{in test set}$$
$$\qquad\qquad\qquad\qquad y_t = f^+(x_t)\qquad\forall(x_t,y_t)\quad\textit{in training set}
$$
\\
Think a problem in which we have 5 data points(vectors) :\\
$
\vec{x_1},...\vec{x_5}\qquad\textit{in some space X}
$
\\
We have a binary classification problem $Y =\{0,1\}$
\\
$
\{\vec{x_1},..., \vec{x_5}\}\in X \qquad Y= \{0,1\}\\
$
\\$F$ contains all possible calssifier $2^5=32\qquad f: \{x_1,...,x_5\}\rightarrow\{0,1\}