mirror of
synced 2025-02-19 06:26:47 +01:00
226 lines
7.8 KiB
226 lines
7.8 KiB
\chapter{Lecture 14 - 28-04-2020}
\section{Linear Regression}
Yesterday we look at the problem of emprical risk minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.
Our data point are the form $(x,y) \ x \in \barra{R}^d \ $ regression, $(\hat{y}-y)^2$ square loss.
We are able to pick a much nicer function and we can optimise it in a easier way.
\subsection{The problem of linear regression}
Instead of picking -1 or 1 we just leave it as it is.
$$h(c) = w^T \ x \qquad w \in \barra{R}^d \qquad x = (x_1, ..., x_d, 1) $$
\hat{w} = arg \min_{w \in \barra{R}^d} \frac{1}{m} \sum_{t=1}^{m} (w^T \ x_t- y_t ) ^2 \qquad \textit{ERM for } \ (x_1,y_1) ... (x_m, y_m)
How to compute the minimum? \\
We use the vector v of linear prediction\\
$v = (w^T x_1, .., w^T x_m )$
and a vector of real valued labels\\
$y = (y_1, ..., y_m) $ where $v, y \in \barra{R}^m$
\sum_{t=1}^{m} (w^T x_t - y_t ) ^2 \ = \ \| v - y\|^2
S is a matrix.
s^T = \left[ x_1, ... , x_m \right] \quad d \times m
v = s w =
x^t_1 \\ ...\\ x^T_m
\ \| v - y\|^2 = \| sw - y\|^2
\hat{w} = arg \min_{w \in \barra{R}^d} \| sw - y \| ^2 \qquad \textbf{where $sw$ is the design matrix}
F (w) = \| sw - y \|^2 \qquad \bred{is convex}
\nabla F(w) = \not2 s^T \left( sw - y \right) = 0 \qquad s^T\ s w = s^T y
where $s^T$ is $d \times m$ and $s$ is $m \times d$ and $ d \neq m$
If $s^T \ s$ invertible (non singular)
$ \hat{w} = (s^T \ s)^{-1} \ s^T \ y $\\
And this is called \bred{ Least square solutions (OLS)}
We can check $s^T\ s$ is non-singular if $x_1, ... , x_m$ span $\barra{R}^d$
$s^T \cdot s$ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.
$ \hat{w} = ( s^T \cdot s)^{-1} \cdot s^T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.
This fact is called \bred{instability} : variance error
It is a good model to see what happens and then try more sofisticated model.
Whenever $\hat{w}$ is invertible we have to prove the instability. But there is a easy fix!
\subsection{Ridge regression}
We want to stabilised our solution. If $s^T \cdot s$ non-singular is a problem.
We are gonna change and say something like this:
\hat{w} = arg \min_w \| s \cdot w - y \|^2 \quad \rightsquigarrow \hat{w}_\alpha = arg \min_w \left(\| s \, w - y \|^2 + \alpha \cdot \| w \|^2 \right)
where $\alpha$ is the \textbf{regularisation term.}
\hat{w}_\alpha \rightarrow \hat{w}
$ for $\alpha \rightarrow 0$
\hat{w}_\alpha \rightarrow (0,..., 0)
$ for $\alpha \rightarrow \infty$\\
$\hat{w}_\alpha$ has more bias than $\hat{w}$, but also less variance
$$\nabla \left( \| s \, w - y \|^2 + \alpha \, \| w \|^2 \right) \ = \ 2 \, \left( s^T \, s \, w - s^T \, y \right) + 2 \, \alpha \, w = 0$$
\left(s^T \, s + \alpha \, I \right) \, w = s^T \, y
(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)
where I is the identity
\hat{w}_\alpha = \left( s^T \, s + \alpha \, I \right)^{-1} \, s^T \, y
where $y_1,..., y_\alpha$ are eigen-values of $s^T \, s$
$ y_1,..., y_\alpha + \alpha > 0 $ eigenvalues of $s^T \, s+ \alpha I$
\\In this way we make it positive and semidefinite.\\
We can always compute the inverse and it is a more stable solution and stable means \bred{do not overfit}.
Now we want to talk about algorithms. \\
Data here are processed in a sequential fashion one by one.\\
Each datapoint is processed in costant time $ \Theta \left( d \right)$\\
(check $y_t \, w^T \leq 0$ and in case $ w \leftarrow w + y_t \, x_t$)
and the linear model can be stored in $\Theta (d)$ space.
Sequential processing scales well with the number of datapoints.
\\ But also is good at dealing with scenarios where new data are generated at all times.
\\ Several scenario like:
\item Sensor data
\item Finantial data
\item Logs of user
So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor.
\\ We call it \bred{Online learning}
\subsection{Online Learning }
It is a learning protocol and we can think of it like Batch learning.
We have a class $H$ of predictors and a loss function $\ell$ and we have and algorith that outputs an initial default predictor $h_1 \in H$.
For $t = 1,2 ...$\\
1) Next example $(x_t, y_t)$ is observed \\
2) The loss $\ell ( h_t(x_t), y_t)$ is observed \qquad $(y_t \, w^T \, x_t \leq 0 )$
3) The algorithm updates $h_t$ generating $h_{t+1}$ \qquad $(w \leftarrow w + y_t \, x_t)$
The algorithm generates $s$ sequence $h_1, h_2, ...$ of models\\
It could be that $h_{t+1} = h_t$ occasionally
The update $h_t \rightarrow h_{t+1}$ is \textbf{local} (it only uses $h_t$ and $(x_t, y_t)$)
This is a batch example in which take the training set and generate a new example.
(x_1,y_1) \rightarrow A \rightarrow h_2
(x_1,y_1) (x_2,y_2) \rightarrow A \rightarrow h_3
But if I have a non-learning algorithm i can look at the updates:
This is a most efficient way and can be done in a costant time.
The batch learning usually have single predictor while the online learning uses a sequence of predictors.
How do I evaluate an online learning algorithm A?
I cannot use a single model, instead we use a method called \bred{Sequential Risk}. \\
Suppose that I have $h_1, h_2 ...$ on some data sequence.
\frac{1}{m} \ \sum_{t=1}^{T} \ell(h_t(x), y_t) \qquad \textit{as a function of T}
The loss on the next incoming example.
I would like something like this:
We need to fix the sequence of data: I absorb the example into the loss of the predictor.
\ell(h_t(x), y_t) \longrightarrow \ell_t(h_t)
I can write the sequential risk of the algorithm:
\frac{1}{m} \sum_{t=1}^{T} \ell_t(h_t) - \min_{h \in H} \frac{1}{m} \sum_{t=1}^{T} \ell_t(h)
So the sequencial risk of the algorithm - the sequential risk of best predictor in $H$ (up to $T$).
\bred{This is a sequential similar of variance error.} $\longrightarrow$ is called \textbf{Regret}.
h^*_T = arg \min_{h \in H} \frac{1}{T} \sum_{t} \ell_t(h) \qquad \frac{1}{T} \ell_t(h_t) - \frac{1}{T} \sum_t \ell_t(h_T^*)
\subsection{Online Gradiant Descent (OGD)}
It is an example of learning algorithm. \\
In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\
We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\
f \quad convex \qquad \min_x f(x) \qquad f \ \barra{R}^d \rightarrow \barra{R}
x_{t+1} = x_t + \eta \nabla f(x_t) \qquad \eta > 0
w_{t+1} = w_t + \eta \, \nabla \ell_t(w_t)
where $\eta$ is the learning rate.
h(x) = w^T \, x \qquad \ell_t(w) = \ell( w^T \, x_t, y_t) \qquad \textit{for istance } \ \ell(w^T \, x_t, y_t) = (w^T \, x_t - y_t)^2
Assumption $\ell_t$ is convex (to do optimisation easily) and differentiable (to compute the gradiant)
\end{document} |