Master-DataScience-Notes/1year/3trimester/Machine Learning, Statistical Learning, Deep Learning and Artificial Intelligence/Machine Learning/lectures/lecture14.tex

\documentclass[../main.tex]{subfiles}
\begin{document}

\chapter{Lecture 14 - 28-04-2020}

\section{Linear Regression}
Yesterday we look at the problem of emprical risk  minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.
\\
Our data point are the form $(x,y) \ x \in \barra{R}^d \ $ regression, $(\hat{y}-y)^2$ square loss. 
\\
We are able to pick a much nicer function and we can optimise it in a easier way.
\\
\subsection{The problem of linear regression}
Instead of picking -1 or 1 we just leave it as it is. 
$$h(c) = w^T \ x \qquad w \in \barra{R}^d \qquad x = (x_1, ..., x_d, 1) $$
\\
$$
\hat{w} = arg \min_{w \in \barra{R}^d} \frac{1}{m} \sum_{t=1}^{m} (w^T \ x_t- y_t ) ^2 \qquad \textit{ERM for } \ (x_1,y_1) ... (x_m, y_m)
$$
How to compute the minimum? \\
We use the vector v of linear prediction\\
$v = (w^T x_1, .., w^T x_m )$
\\
and a vector of real valued labels\\
$y = (y_1, ..., y_m) $ where $v, y \in \barra{R}^m$
\\
$$
\sum_{t=1}^{m} (w^T x_t - y_t ) ^2  \ = \ \| v - y\|^2 
$$
S is a matrix.
$$
s^T = \left[ x_1, ... , x_m \right] \quad d \times m
\qquad
v = s w = 
\begin{bmatrix}
x^t_1 \\ ...\\  x^T_m
\end{bmatrix}
\begin{bmatrix}
\\
w
\\\\
\end{bmatrix}
$$
So:
$$
\ \| v - y\|^2  = \| sw - y\|^2
$$
$$
\hat{w} = arg \min_{w \in \barra{R}^d} \| sw - y \| ^2 \qquad \textbf{where $sw$ is the design matrix}
$$
$$
F (w) = \| sw - y \|^2 \qquad \bred{is convex}
$$
$$
\nabla F(w) = \not2 s^T \left( sw - y \right) = 0 \qquad s^T\ s w = s^T y
$$
where $s^T$ is $d \times m$ and $s$ is $m \times d$ and $ d \neq m$
\\
If $s^T \ s$ invertible (non singular)
$ \hat{w} = (s^T \ s)^{-1} \ s^T \ y $\\
And this is called \bred{ Least square solutions (OLS)}
\\\\
We can check $s^T\ s$ is non-singular if $x_1, ... , x_m$ span $\barra{R}^d$
\\\\
$s^T \cdot s$ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.
\\
$ \hat{w} = ( s^T \cdot s)^{-1} \cdot s^T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.
\\
This fact is called \bred{instability} : variance error
\\
It is a good model to see what happens and then try more sofisticated model. 
\\
Whenever $\hat{w}$ is invertible we have to prove the instability. But there is a easy fix!
\\\\
\subsection{Ridge regression}
We want to stabilised our solution. If $s^T \cdot s$ non-singular is a problem.
\\\\
We are gonna change and say something like this:
$$
\hat{w} = arg \min_w \| s \cdot w - y \|^2 \quad \rightsquigarrow \hat{w}_\alpha = arg \min_w \left(\| s \, w - y \|^2 + \alpha \cdot \| w \|^2 \right)
$$
where $\alpha$ is the \textbf{regularisation term.}
\\\\
$
\hat{w}_\alpha \rightarrow \hat{w} 
$ for $\alpha \rightarrow 0$
\\
$
\hat{w}_\alpha \rightarrow (0,..., 0) 
$ for $\alpha \rightarrow \infty$\\
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img1.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\
\\
$\hat{w}_\alpha$ has more bias than $\hat{w}$, but also less variance
\\
$$\nabla \left( \| s \, w - y \|^2 + \alpha \, \| w \|^2 \right) \ = \ 2 \, \left( s^T \, s \, w - s^T \, y \right) + 2 \, \alpha \, w = 0$$
$$
\left(s^T \, s + \alpha \, I \right) \, w = s^T \, y
$$
$$
(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)
$$
where I is the identity
\\
$$
\hat{w}_\alpha = \left( s^T \, s + \alpha \, I \right)^{-1} \, s^T \, y
$$
where $y_1,..., y_\alpha$ are eigen-values of $s^T \, s$
\\
$ y_1,..., y_\alpha + \alpha > 0 $ eigenvalues of $s^T \, s+ \alpha I$
\\In this way we make it positive and semidefinite.\\
We can always compute the inverse and it is a more stable solution and stable means \bred{do not overfit}.

\section{Percetron}
Now we want to talk about algorithms. \\
Data here are processed in a sequential fashion one by one.\\
Each datapoint is processed in costant time $ \Theta \left( d \right)$\\
(check $y_t \, w^T \leq 0$ and in case $ w \leftarrow w + y_t \, x_t$) 
and the linear model can be stored in $\Theta (d)$ space.
\\
Sequential processing scales well with the number of datapoints.
\\ But also is good at dealing with scenarios where new data are generated at all times.
\\ Several scenario like:
\begin{itemize}
\item Sensor data
\item Finantial data
\item Logs of user
\end{itemize}
So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor. 
\\ We call it \bred{Online learning}
\\
\subsection{Online Learning }
It is a learning protocol and we can think of it like Batch learning.
We have a class $H$ of predictors and a loss function $\ell$ and we have and algorith that outputs an initial default predictor $h_1 \in H$.
\\\\
For $t = 1,2 ...$\\
1) Next example $(x_t, y_t)$ is observed \\
2) The loss $\ell ( h_t(x_t), y_t)$ is observed \qquad $(y_t \, w^T \, x_t \leq 0 )$
\\
3) The algorithm updates $h_t$ generating $h_{t+1}$ \qquad $(w \leftarrow w + y_t \, x_t)$
\\\\
The algorithm generates $s$ sequence $h_1, h_2, ...$ of models\\
It could be that $h_{t+1} = h_t$ occasionally 
\\
The update $h_t \rightarrow h_{t+1}$ is \textbf{local} (it only uses $h_t$ and $(x_t, y_t)$)
\\
This is a batch example in which take the training set and generate a new example.
$$
(x_1,y_1) \rightarrow A \rightarrow h_2
$$
$$
(x_1,y_1) (x_2,y_2) \rightarrow A \rightarrow h_3
$$
But if I have a non-learning algorithm i can look at the updates: 
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img2.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\
This is a most efficient way and can be done in a costant time.
The batch learning usually have single predictor while the online learning uses a sequence of predictors.
\\\\
How do I evaluate an online learning algorithm A?
I cannot use a single model, instead we use a method called \bred{Sequential Risk}. \\
Suppose that I have $h_1, h_2 ...$ on some data sequence.
\\
$$
\frac{1}{m} \ \sum_{t=1}^{T} \ell(h_t(x), y_t) \qquad \textit{as a function of T}
$$
The loss on the next incoming example. 
\newpage
 I would like something like this: 
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img3.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\\\
We need to fix the sequence of data: I absorb the example into the loss of the predictor.
$$
\ell(h_t(x), y_t) \longrightarrow \ell_t(h_t)
$$
I can write the sequential risk of the algorithm:
$$
\frac{1}{m} \sum_{t=1}^{T} \ell_t(h_t) - \min_{h \in H} \frac{1}{m} \sum_{t=1}^{T} \ell_t(h) 
$$
So the sequencial risk of the algorithm - the sequential risk of best predictor in $H$ (up to $T$).
\\
\bred{This is a sequential similar of variance error.} $\longrightarrow$ is called \textbf{Regret}.
\\

$$
h^*_T = arg \min_{h \in H} \frac{1}{T} \sum_{t} \ell_t(h) \qquad \frac{1}{T} \ell_t(h_t) - \frac{1}{T} \sum_t \ell_t(h_T^*)
$$
\newpage
\subsection{Online Gradiant Descent (OGD)}
It is an example of learning algorithm. \\
In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\
We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\
\begin{figure}[h]
    \centering
    \includegraphics[width=0.3\linewidth]{../img/lez14-img4.JPG}
    \caption{}
    %\label{fig:}
\end{figure}\\

$$
f \quad convex \qquad \min_x f(x) \qquad f \ \barra{R}^d \rightarrow \barra{R}
$$
$$
x_{t+1} = x_t + \eta \nabla f(x_t)  \qquad \eta > 0
$$
$$
w_{t+1} =  w_t + \eta \, \nabla \ell_t(w_t)
$$
where $\eta$ is the learning rate.
$$
h(x) = w^T \, x \qquad \ell_t(w) = \ell( w^T \, x_t, y_t) \qquad \textit{for istance } \ \ell(w^T \, x_t, y_t) = (w^T \, x_t - y_t)^2
$$
Assumption $\ell_t$ is convex (to do optimisation easily) and differentiable (to compute the gradiant)
\end{document}
up 2020-05-13 17:25:26 +02:00			`\documentclass[../main.tex]{subfiles}`
			`\begin{document}`

			`\chapter{Lecture 14 - 28-04-2020}`

up 2020-05-14 19:12:37 +02:00			`\section{Linear Regression}`
			`Yesterday we look at the problem of emprical risk minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.`
			`\\`
			`Our data point are the form $(x,y) \ x \in \barra{R}^d \ $ regression, $(\hat{y}-y)^2$ square loss.`
			`\\`
			`We are able to pick a much nicer function and we can optimise it in a easier way.`
			`\\`
			`\subsection{The problem of linear regression}`
			`Instead of picking -1 or 1 we just leave it as it is.`
			`$$h(c) = w^T \ x \qquad w \in \barra{R}^d \qquad x = (x_1, ..., x_d, 1) $$`
			`\\`
			`$$`
			`\hat{w} = arg \min_{w \in \barra{R}^d} \frac{1}{m} \sum_{t=1}^{m} (w^T \ x_t- y_t ) ^2 \qquad \textit{ERM for } \ (x_1,y_1) ... (x_m, y_m)`
			`$$`
			`How to compute the minimum? \\`
			`We use the vector v of linear prediction\\`
			$v = (w^T x_1, .., w^T x_m )$
			`\\`
			`and a vector of real valued labels\\`
			`$y = (y_1, ..., y_m) $ where $v, y \in \barra{R}^m$`
			`\\`
			`$$`
			`\sum_{t=1}^{m} (w^T x_t - y_t ) ^2 \ = \ \\| v - y\\|^2`
			`$$`
			`S is a matrix.`
			`$$`
			`s^T = \left[ x_1, ... , x_m \right] \quad d \times m`
			`\qquad`
			`v = s w =`
			`\begin{bmatrix}`
			`x^t_1 \\ ...\\ x^T_m`
			`\end{bmatrix}`
			`\begin{bmatrix}`
			`\\`
			`w`
			`\\\\`
			`\end{bmatrix}`
			`$$`
			`So:`
			`$$`
			`\ \\| v - y\\|^2 = \\| sw - y\\|^2`
			`$$`
			`$$`
			`\hat{w} = arg \min_{w \in \barra{R}^d} \\| sw - y \\| ^2 \qquad \textbf{where $sw$ is the design matrix}`
			`$$`
			`$$`
			`F (w) = \\| sw - y \\|^2 \qquad \bred{is convex}`
			`$$`
			`$$`
			`\nabla F(w) = \not2 s^T \left( sw - y \right) = 0 \qquad s^T\ s w = s^T y`
			`$$`
			`where $s^T$ is $d \times m$ and $s$ is $m \times d$ and $ d \neq m$`
			`\\`
			`If $s^T \ s$ invertible (non singular)`
			`$ \hat{w} = (s^T \ s)^{-1} \ s^T \ y $\\`
			`And this is called \bred{ Least square solutions (OLS)}`
			`\\\\`
			`We can check $s^T\ s$ is non-singular if $x_1, ... , x_m$ span $\barra{R}^d$`
			`\\\\`
			`$s^T \cdot s$ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.`
			`\\`
			`$ \hat{w} = ( s^T \cdot s)^{-1} \cdot s^T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.`
			`\\`
			`This fact is called \bred{instability} : variance error`
			`\\`
			`It is a good model to see what happens and then try more sofisticated model.`
			`\\`
			`Whenever $\hat{w}$ is invertible we have to prove the instability. But there is a easy fix!`
			`\\\\`
			`\subsection{Ridge regression}`
			`We want to stabilised our solution. If $s^T \cdot s$ non-singular is a problem.`
			`\\\\`
			`We are gonna change and say something like this:`
			`$$`
			`\hat{w} = arg \min_w \\| s \cdot w - y \\|^2 \quad \rightsquigarrow \hat{w}_\alpha = arg \min_w \left(\\| s \, w - y \\|^2 + \alpha \cdot \\| w \\|^2 \right)`
			`$$`
			`where $\alpha$ is the \textbf{regularisation term.}`
			`\\\\`
			`$`
			`\hat{w}_\alpha \rightarrow \hat{w}`
			`$ for $\alpha \rightarrow 0$`
			`\\`
			`$`
			`\hat{w}_\alpha \rightarrow (0,..., 0)`
			`$ for $\alpha \rightarrow \infty$\\`
			`\begin{figure}[h]`
			`\centering`
			`\includegraphics[width=0.3\linewidth]{../img/lez14-img1.JPG}`
			`\caption{}`
			`%\label{fig:}`
			`\end{figure}\\`
			`\\`
			`$\hat{w}_\alpha$ has more bias than $\hat{w}$, but also less variance`
			`\\`
			`$$\nabla \left( \\| s \, w - y \\|^2 + \alpha \, \\| w \\|^2 \right) \ = \ 2 \, \left( s^T \, s \, w - s^T \, y \right) + 2 \, \alpha \, w = 0$$`
			`$$`
			`\left(s^T \, s + \alpha \, I \right) \, w = s^T \, y`
			`$$`
			`$$`
			`(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)`
			`$$`
			`where I is the identity`
			`\\`
			`$$`
			`\hat{w}_\alpha = \left( s^T \, s + \alpha \, I \right)^{-1} \, s^T \, y`
			`$$`
			`where $y_1,..., y_\alpha$ are eigen-values of $s^T \, s$`
			`\\`
			`$ y_1,..., y_\alpha + \alpha > 0 $ eigenvalues of $s^T \, s+ \alpha I$`
			`\\In this way we make it positive and semidefinite.\\`
			`We can always compute the inverse and it is a more stable solution and stable means \bred{do not overfit}.`
up 2020-05-13 17:25:26 +02:00
up 2020-05-14 19:12:37 +02:00			`\section{Percetron}`
			`Now we want to talk about algorithms. \\`
			`Data here are processed in a sequential fashion one by one.\\`
			`Each datapoint is processed in costant time $ \Theta \left( d \right)$\\`
			`(check $y_t \, w^T \leq 0$ and in case $ w \leftarrow w + y_t \, x_t$)`
			`and the linear model can be stored in $\Theta (d)$ space.`
			`\\`
			`Sequential processing scales well with the number of datapoints.`
			`\\ But also is good at dealing with scenarios where new data are generated at all times.`
			`\\ Several scenario like:`
			`\begin{itemize}`
			`\item Sensor data`
			`\item Finantial data`
			`\item Logs of user`
			`\end{itemize}`
			`So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor.`
			`\\ We call it \bred{Online learning}`
			`\\`
			`\subsection{Online Learning }`
			`It is a learning protocol and we can think of it like Batch learning.`
			`We have a class $H$ of predictors and a loss function $\ell$ and we have and algorith that outputs an initial default predictor $h_1 \in H$.`
			`\\\\`
			`For $t = 1,2 ...$\\`
			`1) Next example $(x_t, y_t)$ is observed \\`
			`2) The loss $\ell ( h_t(x_t), y_t)$ is observed \qquad $(y_t \, w^T \, x_t \leq 0 )$`
			`\\`
			`3) The algorithm updates $h_t$ generating $h_{t+1}$ \qquad $(w \leftarrow w + y_t \, x_t)$`
			`\\\\`
			`The algorithm generates $s$ sequence $h_1, h_2, ...$ of models\\`
			`It could be that $h_{t+1} = h_t$ occasionally`
			`\\`
			`The update $h_t \rightarrow h_{t+1}$ is \textbf{local} (it only uses $h_t$ and $(x_t, y_t)$)`
			`\\`
			`This is a batch example in which take the training set and generate a new example.`
			`$$`
			`(x_1,y_1) \rightarrow A \rightarrow h_2`
			`$$`
			`$$`
			`(x_1,y_1) (x_2,y_2) \rightarrow A \rightarrow h_3`
			`$$`
			`But if I have a non-learning algorithm i can look at the updates:`
			`\begin{figure}[h]`
			`\centering`
			`\includegraphics[width=0.3\linewidth]{../img/lez14-img2.JPG}`
			`\caption{}`
			`%\label{fig:}`
			`\end{figure}\\`
			`This is a most efficient way and can be done in a costant time.`
			`The batch learning usually have single predictor while the online learning uses a sequence of predictors.`
			`\\\\`
			`How do I evaluate an online learning algorithm A?`
			`I cannot use a single model, instead we use a method called \bred{Sequential Risk}. \\`
			`Suppose that I have $h_1, h_2 ...$ on some data sequence.`
			`\\`
			`$$`
			`\frac{1}{m} \ \sum_{t=1}^{T} \ell(h_t(x), y_t) \qquad \textit{as a function of T}`
			`$$`
			`The loss on the next incoming example.`
			`\newpage`
			`I would like something like this:`
			`\begin{figure}[h]`
			`\centering`
			`\includegraphics[width=0.3\linewidth]{../img/lez14-img3.JPG}`
			`\caption{}`
			`%\label{fig:}`
			`\end{figure}\\\\`
			`We need to fix the sequence of data: I absorb the example into the loss of the predictor.`
			`$$`
			`\ell(h_t(x), y_t) \longrightarrow \ell_t(h_t)`
			`$$`
			`I can write the sequential risk of the algorithm:`
			`$$`
			`\frac{1}{m} \sum_{t=1}^{T} \ell_t(h_t) - \min_{h \in H} \frac{1}{m} \sum_{t=1}^{T} \ell_t(h)`
			`$$`
			`So the sequencial risk of the algorithm - the sequential risk of best predictor in $H$ (up to $T$).`
			`\\`
			`\bred{This is a sequential similar of variance error.} $\longrightarrow$ is called \textbf{Regret}.`
			`\\`

			`$$`
			`h^_T = arg \min_{h \in H} \frac{1}{T} \sum_{t} \ell_t(h) \qquad \frac{1}{T} \ell_t(h_t) - \frac{1}{T} \sum_t \ell_t(h_T^)`
			`$$`
			`\newpage`
			`\subsection{Online Gradiant Descent (OGD)}`
			`It is an example of learning algorithm. \\`
			`In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\`
			`We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\`
			`\begin{figure}[h]`
			`\centering`
			`\includegraphics[width=0.3\linewidth]{../img/lez14-img4.JPG}`
			`\caption{}`
			`%\label{fig:}`
			`\end{figure}\\`

			`$$`
			`f \quad convex \qquad \min_x f(x) \qquad f \ \barra{R}^d \rightarrow \barra{R}`
			`$$`
			`$$`
			`x_{t+1} = x_t + \eta \nabla f(x_t) \qquad \eta > 0`
			`$$`
			`$$`
			`w_{t+1} = w_t + \eta \, \nabla \ell_t(w_t)`
			`$$`
			`where $\eta$ is the learning rate.`
			`$$`
			`h(x) = w^T \, x \qquad \ell_t(w) = \ell( w^T \, x_t, y_t) \qquad \textit{for istance } \ \ell(w^T \, x_t, y_t) = (w^T \, x_t - y_t)^2`
			`$$`
			`Assumption $\ell_t$ is convex (to do optimisation easily) and differentiable (to compute the gradiant)`
up 2020-05-13 17:25:26 +02:00			`\end{document}`