2020-05-13 17:25:26 +02:00
\documentclass [../main.tex] { subfiles}
\begin { document}
\chapter { Lecture 14 - 28-04-2020}
2020-05-14 19:12:37 +02:00
\section { Linear Regression}
Yesterday we look at the problem of emprical risk minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.
\\
Our data point are the form $ ( x,y ) \ x \in \barra { R } ^ d \ $ regression, $ ( \hat { y } - y ) ^ 2 $ square loss.
\\
We are able to pick a much nicer function and we can optimise it in a easier way.
\\
\subsection { The problem of linear regression}
Instead of picking -1 or 1 we just leave it as it is.
$$ h ( c ) = w ^ T \ x \qquad w \in \barra { R } ^ d \qquad x = ( x _ 1 , ..., x _ d, 1 ) $$
\\
$$
\hat { w} = arg \min _ { w \in \barra { R} ^ d} \frac { 1} { m} \sum _ { t=1} ^ { m} (w^ T \ x_ t- y_ t ) ^ 2 \qquad \textit { ERM for } \ (x_ 1,y_ 1) ... (x_ m, y_ m)
$$
How to compute the minimum? \\
We use the vector v of linear prediction\\
$ v = ( w ^ T x _ 1 , .., w ^ T x _ m ) $
\\
and a vector of real valued labels\\
$ y = ( y _ 1 , ..., y _ m ) $ where $ v, y \in \barra { R } ^ m $
\\
$$
\sum _ { t=1} ^ { m} (w^ T x_ t - y_ t ) ^ 2 \ = \ \| v - y\| ^ 2
$$
S is a matrix.
$$
s^ T = \left [ x_1, ... , x_m \right] \quad d \times m
\qquad
v = s w =
\begin { bmatrix}
x^ t_ 1 \\ ...\\ x^ T_ m
\end { bmatrix}
\begin { bmatrix}
\\
w
\\ \\
\end { bmatrix}
$$
So:
$$
\ \| v - y\| ^ 2 = \| sw - y\| ^ 2
$$
$$
\hat { w} = arg \min _ { w \in \barra { R} ^ d} \| sw - y \| ^ 2 \qquad \textbf { where $ sw $ is the design matrix}
$$
$$
F (w) = \| sw - y \| ^ 2 \qquad \bred { is convex}
$$
$$
\nabla F(w) = \not 2 s^ T \left ( sw - y \right ) = 0 \qquad s^ T\ s w = s^ T y
$$
where $ s ^ T $ is $ d \times m $ and $ s $ is $ m \times d $ and $ d \neq m $
\\
If $ s ^ T \ s $ invertible (non singular)
$ \hat { w } = ( s ^ T \ s ) ^ { - 1 } \ s ^ T \ y $ \\
And this is called \bred { Least square solutions (OLS)}
\\ \\
We can check $ s ^ T \ s $ is non-singular if $ x _ 1 , ... , x _ m $ span $ \barra { R } ^ d $
\\ \\
$ s ^ T \cdot s $ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.
\\
$ \hat { w } = ( s ^ T \cdot s ) ^ { - 1 } \cdot s ^ T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.
\\
This fact is called \bred { instability} : variance error
\\
It is a good model to see what happens and then try more sofisticated model.
\\
Whenever $ \hat { w } $ is invertible we have to prove the instability. But there is a easy fix!
\\ \\
\subsection { Ridge regression}
We want to stabilised our solution. If $ s ^ T \cdot s $ non-singular is a problem.
\\ \\
We are gonna change and say something like this:
$$
\hat { w} = arg \min _ w \| s \cdot w - y \| ^ 2 \quad \rightsquigarrow \hat { w} _ \alpha = arg \min _ w \left (\| s \, w - y \| ^ 2 + \alpha \cdot \| w \| ^ 2 \right )
$$
where $ \alpha $ is the \textbf { regularisation term.}
\\ \\
$
\hat { w} _ \alpha \rightarrow \hat { w}
$ for $ \alpha \rightarrow 0$
\\
$
\hat { w} _ \alpha \rightarrow (0,..., 0)
$ for $ \alpha \rightarrow \infty $ \\
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez14-img1.JPG}
\caption { }
%\label{fig:}
\end { figure} \\
\\
$ \hat { w } _ \alpha $ has more bias than $ \hat { w } $ , but also less variance
\\
$$ \nabla \left ( \| s \, w - y \| ^ 2 + \alpha \, \| w \| ^ 2 \right ) \ = \ 2 \, \left ( s ^ T \, s \, w - s ^ T \, y \right ) + 2 \, \alpha \, w = 0 $$
$$
\left (s^ T \, s + \alpha \, I \right ) \, w = s^ T \, y
$$
$$
(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)
$$
where I is the identity
\\
$$
\hat { w} _ \alpha = \left ( s^ T \, s + \alpha \, I \right )^ { -1} \, s^ T \, y
$$
where $ y _ 1 ,..., y _ \alpha $ are eigen-values of $ s ^ T \, s $
\\
$ y _ 1 ,..., y _ \alpha + \alpha > 0 $ eigenvalues of $ s ^ T \, s + \alpha I $
\\ In this way we make it positive and semidefinite.\\
We can always compute the inverse and it is a more stable solution and stable means \bred { do not overfit} .
2020-05-13 17:25:26 +02:00
2020-05-14 19:12:37 +02:00
\section { Percetron}
Now we want to talk about algorithms. \\
Data here are processed in a sequential fashion one by one.\\
Each datapoint is processed in costant time $ \Theta \left ( d \right ) $ \\
(check $ y _ t \, w ^ T \leq 0 $ and in case $ w \leftarrow w + y _ t \, x _ t $ )
and the linear model can be stored in $ \Theta ( d ) $ space.
\\
Sequential processing scales well with the number of datapoints.
\\ But also is good at dealing with scenarios where new data are generated at all times.
\\ Several scenario like:
\begin { itemize}
\item Sensor data
\item Finantial data
\item Logs of user
\end { itemize}
So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor.
\\ We call it \bred { Online learning}
\\
\subsection { Online Learning }
It is a learning protocol and we can think of it like Batch learning.
We have a class $ H $ of predictors and a loss function $ \ell $ and we have and algorith that outputs an initial default predictor $ h _ 1 \in H $ .
\\ \\
For $ t = 1 , 2 ... $ \\
1) Next example $ ( x _ t, y _ t ) $ is observed \\
2) The loss $ \ell ( h _ t ( x _ t ) , y _ t ) $ is observed \qquad $ ( y _ t \, w ^ T \, x _ t \leq 0 ) $
\\
3) The algorithm updates $ h _ t $ generating $ h _ { t + 1 } $ \qquad $ ( w \leftarrow w + y _ t \, x _ t ) $
\\ \\
The algorithm generates $ s $ sequence $ h _ 1 , h _ 2 , ... $ of models\\
It could be that $ h _ { t + 1 } = h _ t $ occasionally
\\
The update $ h _ t \rightarrow h _ { t + 1 } $ is \textbf { local} (it only uses $ h _ t $ and $ ( x _ t, y _ t ) $ )
\\
This is a batch example in which take the training set and generate a new example.
$$
(x_ 1,y_ 1) \rightarrow A \rightarrow h_ 2
$$
$$
(x_ 1,y_ 1) (x_ 2,y_ 2) \rightarrow A \rightarrow h_ 3
$$
But if I have a non-learning algorithm i can look at the updates:
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez14-img2.JPG}
\caption { }
%\label{fig:}
\end { figure} \\
This is a most efficient way and can be done in a costant time.
The batch learning usually have single predictor while the online learning uses a sequence of predictors.
\\ \\
How do I evaluate an online learning algorithm A?
I cannot use a single model, instead we use a method called \bred { Sequential Risk} . \\
Suppose that I have $ h _ 1 , h _ 2 ... $ on some data sequence.
\\
$$
\frac { 1} { m} \ \sum _ { t=1} ^ { T} \ell (h_ t(x), y_ t) \qquad \textit { as a function of T}
$$
The loss on the next incoming example.
\newpage
I would like something like this:
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez14-img3.JPG}
\caption { }
%\label{fig:}
\end { figure} \\ \\
We need to fix the sequence of data: I absorb the example into the loss of the predictor.
$$
\ell (h_ t(x), y_ t) \longrightarrow \ell _ t(h_ t)
$$
I can write the sequential risk of the algorithm:
$$
\frac { 1} { m} \sum _ { t=1} ^ { T} \ell _ t(h_ t) - \min _ { h \in H} \frac { 1} { m} \sum _ { t=1} ^ { T} \ell _ t(h)
$$
So the sequencial risk of the algorithm - the sequential risk of best predictor in $ H $ (up to $ T $ ).
\\
\bred { This is a sequential similar of variance error.} $ \longrightarrow $ is called \textbf { Regret} .
\\
$$
h^ *_ T = arg \min _ { h \in H} \frac { 1} { T} \sum _ { t} \ell _ t(h) \qquad \frac { 1} { T} \ell _ t(h_ t) - \frac { 1} { T} \sum _ t \ell _ t(h_ T^ *)
$$
\newpage
\subsection { Online Gradiant Descent (OGD)}
It is an example of learning algorithm. \\
In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\
We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\
\begin { figure} [h]
\centering
\includegraphics [width=0.3\linewidth] { ../img/lez14-img4.JPG}
\caption { }
%\label{fig:}
\end { figure} \\
$$
f \quad convex \qquad \min _ x f(x) \qquad f \ \barra { R} ^ d \rightarrow \barra { R}
$$
$$
x_ { t+1} = x_ t + \eta \nabla f(x_ t) \qquad \eta > 0
$$
$$
w_ { t+1} = w_ t + \eta \, \nabla \ell _ t(w_ t)
$$
where $ \eta $ is the learning rate.
$$
h(x) = w^ T \, x \qquad \ell _ t(w) = \ell ( w^ T \, x_ t, y_ t) \qquad \textit { for istance } \ \ell (w^ T \, x_ t, y_ t) = (w^ T \, x_ t - y_ t)^ 2
$$
Assumption $ \ell _ t $ is convex (to do optimisation easily) and differentiable (to compute the gradiant)
2020-05-13 17:25:26 +02:00
\end { document}