Master-DataScience-Notes/1year/3trimester/Machine Learning, Statistical Learning, Deep Learning and Artificial Intelligence/Machine Learning/lectures/lecture14.tex

\documentclass[../main.tex]{subfiles}
\begin{document}

\chapter{Lecture 14 - 28-04-2020}

\section{Linear Regression}
Yesterday we look at the problem of emprical risk  minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.
\\
Our data point are the form $(x,y) \ x \in \barra{R}^d \ $ regression, $(\hat{y}-y)^2$ square loss.
\\
We are able to pick a much nicer function and we can optimise it in a easier way.
\\
\subsection{The problem of linear regression}
Instead of picking -1 or 1 we just leave it as it is.
$$h(c) = w^T \ x \qquad w \in \barra{R}^d \qquad x = (x_1, ..., x_d, 1) $$
\\
$$
\hat{w} = arg \min_{w \in \barra{R}^d} \frac{1}{m} \sum_{t=1}^{m} (w^T \ x_t- y_t ) ^2 \qquad \textit{ERM for } \ (x_1,y_1) ... (x_m, y_m)
$$
How to compute the minimum? \\
We use the vector v of linear prediction\\
$v = (w^T x_1, .., w^T x_m )$
\\
and a vector of real valued labels\\
$y = (y_1, ..., y_m) $ where $v, y \in \barra{R}^m$
\\
$$
\sum_{t=1}^{m} (w^T x_t - y_t ) ^2  \ = \ \| v - y\|^2
$$
S is a matrix.
$$
s^T = \left[ x_1, ... , x_m \right] \quad d \times m
\qquad
v = s w =
\begin{bmatrix}
x^t_1 \\ ...\\  x^T_m
\end{bmatrix}
\begin{bmatrix}
\\
w
\\\\
\end{bmatrix}
$$
So:
$$
\ \| v - y\|^2  = \| sw - y\|^2
$$
$$
\hat{w} = arg \min_{w \in \barra{R}^d} \| sw - y \| ^2 \qquad \textbf{where $sw$ is the design matrix}
$$
$$
F (w) = \| sw - y \|^2 \qquad \bred{is convex}
$$
$$
\nabla F(w) = \not2 s^T \left( sw - y \right) = 0 \qquad s^T\ s w = s^T y
$$
where $s^T$ is $d \times m$ and $s$ is $m \times d$ and $ d \neq m$
\\
If $s^T \ s$ invertible (non singular)
$ \hat{w} = (s^T \ s)^{-1} \ s^T \ y $\\
And this is called \bred{ Least square solutions (OLS)}
\\\\
We can check $s^T\ s$ is non-singular if $x_1, ... , x_m$ span $\barra{R}^d$
\\\\
$s^T \cdot s$ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.
\\
$ \hat{w} = ( s^T \cdot s)^{-1} \cdot s^T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.
\\
This fact is called \bred{instability} : variance error
\\
It is a good model to see what happens and then try more sofisticated model.
\\
Whenever $\hat{w}$ is invertible we have to prove the instability. But there is a easy fix!
\\\\
\subsection{Ridge regression}
We want to stabilised our solution. If $s^T \cdot s$ non-singular is a problem.
\\\\
We are gonna change and say something like this:
$$
\hat{w} = arg \min_w \| s \cdot w - y \|^2 \quad \rightsquigarrow \hat{w}_\alpha = arg \min_w \left(\| s \, w - y \|^2 + \alpha \cdot \| w \|^2 \right)
$$
where $\alpha$ is the \textbf{regularisation term.}
\\\\
$
\hat{w}_\alpha \rightarrow \hat{w}
$ for $\alpha \rightarrow 0$
\\
$
\hat{w}_\alpha \rightarrow (0,..., 0)
$ for $\alpha \rightarrow \infty$\\
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img1.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\
\\
$\hat{w}_\alpha$ has more bias than $\hat{w}$, but also less variance
\\
$$\nabla \left( \| s \, w - y \|^2 + \alpha \, \| w \|^2 \right) \ = \ 2 \, \left( s^T \, s \, w - s^T \, y \right) + 2 \, \alpha \, w = 0$$
$$
\left(s^T \, s + \alpha \, I \right) \, w = s^T \, y
$$
$$
(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)
$$
where I is the identity
\\
$$
\hat{w}_\alpha = \left( s^T \, s + \alpha \, I \right)^{-1} \, s^T \, y
$$
where $y_1,..., y_\alpha$ are eigen-values of $s^T \, s$
\\
$ y_1,..., y_\alpha + \alpha > 0 $ eigenvalues of $s^T \, s+ \alpha I$
\\In this way we make it positive and semidefinite.\\
We can always compute the inverse and it is a more stable solution and stable means \bred{do not overfit}.

\section{Percetron}
Now we want to talk about algorithms. \\
Data here are processed in a sequential fashion one by one.\\
Each datapoint is processed in costant time $ \Theta \left( d \right)$\\
(check $y_t \, w^T \leq 0$ and in case $ w \leftarrow w + y_t \, x_t$)
and the linear model can be stored in $\Theta (d)$ space.
\\
Sequential processing scales well with the number of datapoints.
\\ But also is good at dealing with scenarios where new data are generated at all times.
\\ Several scenario like:
\begin{itemize}
\item Sensor data
\item Finantial data
\item Logs of user
\end{itemize}
So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor.
\\ We call it \bred{Online learning}
\\
\subsection{Online Learning }
It is a learning protocol and we can think of it like Batch learning.
We have a class $H$ of predictors and a loss function $\ell$ and we have and algorith that outputs an initial default predictor $h_1 \in H$.
\\\\
For $t = 1,2 ...$\\
1) Next example $(x_t, y_t)$ is observed \\
2) The loss $\ell ( h_t(x_t), y_t)$ is observed \qquad $(y_t \, w^T \, x_t \leq 0 )$
\\
3) The algorithm updates $h_t$ generating $h_{t+1}$ \qquad $(w \leftarrow w + y_t \, x_t)$
\\\\
The algorithm generates $s$ sequence $h_1, h_2, ...$ of models\\
It could be that $h_{t+1} = h_t$ occasionally
\\
The update $h_t \rightarrow h_{t+1}$ is \textbf{local} (it only uses $h_t$ and $(x_t, y_t)$)
\\
This is a batch example in which take the training set and generate a new example.
$$
(x_1,y_1) \rightarrow A \rightarrow h_2
$$
$$
(x_1,y_1) (x_2,y_2) \rightarrow A \rightarrow h_3
$$
But if I have a non-learning algorithm i can look at the updates:
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img2.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\
This is a most efficient way and can be done in a costant time.
The batch learning usually have single predictor while the online learning uses a sequence of predictors.
\\\\
How do I evaluate an online learning algorithm A?
I cannot use a single model, instead we use a method called \bred{Sequential Risk}. \\
Suppose that I have $h_1, h_2 ...$ on some data sequence.
\\
$$
\frac{1}{m} \ \sum_{t=1}^{T} \ell(h_t(x), y_t) \qquad \textit{as a function of T}
$$
The loss on the next incoming example.
\newpage
 I would like something like this:
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img3.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\\\
We need to fix the sequence of data: I absorb the example into the loss of the predictor.
$$
\ell(h_t(x), y_t) \longrightarrow \ell_t(h_t)
$$
I can write the sequential risk of the algorithm:
$$
\frac{1}{m} \sum_{t=1}^{T} \ell_t(h_t) - \min_{h \in H} \frac{1}{m} \sum_{t=1}^{T} \ell_t(h)
$$
So the sequencial risk of the algorithm - the sequential risk of best predictor in $H$ (up to $T$).
\\
\bred{This is a sequential similar of variance error.} $\longrightarrow$ is called \textbf{Regret}.
\\

$$
h^*_T = arg \min_{h \in H} \frac{1}{T} \sum_{t} \ell_t(h) \qquad \frac{1}{T} \ell_t(h_t) - \frac{1}{T} \sum_t \ell_t(h_T^*)
$$
\newpage
\subsection{Online Gradiant Descent (OGD)}
It is an example of learning algorithm. \\
In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\
We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img4.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\

$$
f \quad convex \qquad \min_x f(x) \qquad f \ \barra{R}^d \rightarrow \barra{R}
$$
$$
x_{t+1} = x_t + \eta \nabla f(x_t)  \qquad \eta > 0
$$
$$
w_{t+1} =  w_t + \eta \, \nabla \ell_t(w_t)
$$
where $\eta$ is the learning rate.
$$
h(x) = w^T \, x \qquad \ell_t(w) = \ell( w^T \, x_t, y_t) \qquad \textit{for istance } \ \ell(w^T \, x_t, y_t) = (w^T \, x_t - y_t)^2
$$
Assumption $\ell_t$ is convex (to do optimisation easily) and differentiable (to compute the gradiant)
\end{document}