mirror of
https://github.com/Andreaierardi/Master-DataScience-Notes.git
synced 2025-01-27 11:47:36 +01:00
226 lines
8.0 KiB
TeX
226 lines
8.0 KiB
TeX
\documentclass[../main.tex]{subfiles}
|
|
\begin{document}
|
|
|
|
\chapter{Lecture 14 - 28-04-2020}
|
|
|
|
\section{Linear Regression}
|
|
Yesterday we look at the problem of emprical risk minismisation for a linear classifier. 0-1 loss is not good: discontinuous jumping from 0 to 1 and it's dififcult to optimise. Maybe with linear regression we are luckier.
|
|
\\
|
|
Our data point are the form $(x,y) \ x \in \barra{R}^d \ $ regression, $(\hat{y}-y)^2$ square loss.
|
|
\\
|
|
We are able to pick a much nicer function and we can optimise it in a easier way.
|
|
\\
|
|
\subsection{The problem of linear regression}
|
|
Instead of picking -1 or 1 we just leave it as it is.
|
|
$$h(c) = w^T \ x \qquad w \in \barra{R}^d \qquad x = (x_1, ..., x_d, 1) $$
|
|
\\
|
|
$$
|
|
\hat{w} = arg \min_{w \in \barra{R}^d} \frac{1}{m} \sum_{t=1}^{m} (w^T \ x_t- y_t ) ^2 \qquad \textit{ERM for } \ (x_1,y_1) ... (x_m, y_m)
|
|
$$
|
|
How to compute the minimum? \\
|
|
We use the vector v of linear prediction\\
|
|
$v = (w^T x_1, .., w^T x_m )$
|
|
\\
|
|
and a vector of real valued labels\\
|
|
$y = (y_1, ..., y_m) $ where $v, y \in \barra{R}^m$
|
|
\\
|
|
$$
|
|
\sum_{t=1}^{m} (w^T x_t - y_t ) ^2 \ = \ \| v - y\|^2
|
|
$$
|
|
S is a matrix.
|
|
$$
|
|
s^T = \left[ x_1, ... , x_m \right] \quad d \times m
|
|
\qquad
|
|
v = s w =
|
|
\begin{bmatrix}
|
|
x^t_1 \\ ...\\ x^T_m
|
|
\end{bmatrix}
|
|
\begin{bmatrix}
|
|
\\
|
|
w
|
|
\\\\
|
|
\end{bmatrix}
|
|
$$
|
|
So:
|
|
$$
|
|
\ \| v - y\|^2 = \| sw - y\|^2
|
|
$$
|
|
$$
|
|
\hat{w} = arg \min_{w \in \barra{R}^d} \| sw - y \| ^2 \qquad \textbf{where $sw$ is the design matrix}
|
|
$$
|
|
$$
|
|
F (w) = \| sw - y \|^2 \qquad \bred{is convex}
|
|
$$
|
|
$$
|
|
\nabla F(w) = \not2 s^T \left( sw - y \right) = 0 \qquad s^T\ s w = s^T y
|
|
$$
|
|
where $s^T$ is $d \times m$ and $s$ is $m \times d$ and $ d \neq m$
|
|
\\
|
|
If $s^T \ s$ invertible (non singular)
|
|
$ \hat{w} = (s^T \ s)^{-1} \ s^T \ y $\\
|
|
And this is called \bred{ Least square solutions (OLS)}
|
|
\\\\
|
|
We can check $s^T\ s$ is non-singular if $x_1, ... , x_m$ span $\barra{R}^d$
|
|
\\\\
|
|
$s^T \cdot s$ may not be always invertible. Also Linear regression is high bias solution. ERM may underfit since linear predictor introduce big bias.
|
|
\\
|
|
$ \hat{w} = ( s^T \cdot s)^{-1} \cdot s^T \cdot y $ is very instable: can change a lot when the the dataset is perturbed.
|
|
\\
|
|
This fact is called \bred{instability} : variance error
|
|
\\
|
|
It is a good model to see what happens and then try more sofisticated model.
|
|
\\
|
|
Whenever $\hat{w}$ is invertible we have to prove the instability. But there is a easy fix!
|
|
\\\\
|
|
\subsection{Ridge regression}
|
|
We want to stabilised our solution. If $s^T \cdot s$ non-singular is a problem.
|
|
\\\\
|
|
We are gonna change and say something like this:
|
|
$$
|
|
\hat{w} = arg \min_w \| s \cdot w - y \|^2 \quad \rightsquigarrow \hat{w}_\alpha = arg \min_w \left(\| s \, w - y \|^2 + \alpha \cdot \| w \|^2 \right)
|
|
$$
|
|
where $\alpha$ is the \textbf{regularisation term.}
|
|
\\\\
|
|
$
|
|
\hat{w}_\alpha \rightarrow \hat{w}
|
|
$ for $\alpha \rightarrow 0$
|
|
\\
|
|
$
|
|
\hat{w}_\alpha \rightarrow (0,..., 0)
|
|
$ for $\alpha \rightarrow \infty$\\
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=0.3\linewidth]{../img/lez14-img1.JPG}
|
|
\caption{}
|
|
%\label{fig:}
|
|
\end{figure}\\
|
|
\\
|
|
$\hat{w}_\alpha$ has more bias than $\hat{w}$, but also less variance
|
|
\\
|
|
$$\nabla \left( \| s \, w - y \|^2 + \alpha \, \| w \|^2 \right) \ = \ 2 \, \left( s^T \, s \, w - s^T \, y \right) + 2 \, \alpha \, w = 0$$
|
|
$$
|
|
\left(s^T \, s + \alpha \, I \right) \, w = s^T \, y
|
|
$$
|
|
$$
|
|
(d \times m) \, (m \times d) \ (d \times d) \ (d \times m) \qquad (d \times m) \ (m \times 1)
|
|
$$
|
|
where I is the identity
|
|
\\
|
|
$$
|
|
\hat{w}_\alpha = \left( s^T \, s + \alpha \, I \right)^{-1} \, s^T \, y
|
|
$$
|
|
where $y_1,..., y_\alpha$ are eigen-values of $s^T \, s$
|
|
\\
|
|
$ y_1,..., y_\alpha + \alpha > 0 $ eigenvalues of $s^T \, s+ \alpha I$
|
|
\\In this way we make it positive and semidefinite.\\
|
|
We can always compute the inverse and it is a more stable solution and stable means \bred{do not overfit}.
|
|
|
|
\section{Percetron}
|
|
Now we want to talk about algorithms. \\
|
|
Data here are processed in a sequential fashion one by one.\\
|
|
Each datapoint is processed in costant time $ \Theta \left( d \right)$\\
|
|
(check $y_t \, w^T \leq 0$ and in case $ w \leftarrow w + y_t \, x_t$)
|
|
and the linear model can be stored in $\Theta (d)$ space.
|
|
\\
|
|
Sequential processing scales well with the number of datapoints.
|
|
\\ But also is good at dealing with scenarios where new data are generated at all times.
|
|
\\ Several scenario like:
|
|
\begin{itemize}
|
|
\item Sensor data
|
|
\item Finantial data
|
|
\item Logs of user
|
|
\end{itemize}
|
|
So sequential learning is good when we have lot of data and scenario in which data comes in fits like sensor.
|
|
\\ We call it \bred{Online learning}
|
|
\\
|
|
\subsection{Online Learning }
|
|
It is a learning protocol and we can think of it like Batch learning.
|
|
We have a class $H$ of predictors and a loss function $\ell$ and we have and algorith that outputs an initial default predictor $h_1 \in H$.
|
|
\\\\
|
|
For $t = 1,2 ...$\\
|
|
1) Next example $(x_t, y_t)$ is observed \\
|
|
2) The loss $\ell ( h_t(x_t), y_t)$ is observed \qquad $(y_t \, w^T \, x_t \leq 0 )$
|
|
\\
|
|
3) The algorithm updates $h_t$ generating $h_{t+1}$ \qquad $(w \leftarrow w + y_t \, x_t)$
|
|
\\\\
|
|
The algorithm generates $s$ sequence $h_1, h_2, ...$ of models\\
|
|
It could be that $h_{t+1} = h_t$ occasionally
|
|
\\
|
|
The update $h_t \rightarrow h_{t+1}$ is \textbf{local} (it only uses $h_t$ and $(x_t, y_t)$)
|
|
\\
|
|
This is a batch example in which take the training set and generate a new example.
|
|
$$
|
|
(x_1,y_1) \rightarrow A \rightarrow h_2
|
|
$$
|
|
$$
|
|
(x_1,y_1) (x_2,y_2) \rightarrow A \rightarrow h_3
|
|
$$
|
|
But if I have a non-learning algorithm i can look at the updates:
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=0.3\linewidth]{../img/lez14-img2.JPG}
|
|
\caption{}
|
|
%\label{fig:}
|
|
\end{figure}\\
|
|
This is a most efficient way and can be done in a costant time.
|
|
The batch learning usually have single predictor while the online learning uses a sequence of predictors.
|
|
\\\\
|
|
How do I evaluate an online learning algorithm A?
|
|
I cannot use a single model, instead we use a method called \bred{Sequential Risk}. \\
|
|
Suppose that I have $h_1, h_2 ...$ on some data sequence.
|
|
\\
|
|
$$
|
|
\frac{1}{m} \ \sum_{t=1}^{T} \ell(h_t(x), y_t) \qquad \textit{as a function of T}
|
|
$$
|
|
The loss on the next incoming example.
|
|
\newpage
|
|
I would like something like this:
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=0.3\linewidth]{../img/lez14-img3.JPG}
|
|
\caption{}
|
|
%\label{fig:}
|
|
\end{figure}\\\\
|
|
We need to fix the sequence of data: I absorb the example into the loss of the predictor.
|
|
$$
|
|
\ell(h_t(x), y_t) \longrightarrow \ell_t(h_t)
|
|
$$
|
|
I can write the sequential risk of the algorithm:
|
|
$$
|
|
\frac{1}{m} \sum_{t=1}^{T} \ell_t(h_t) - \min_{h \in H} \frac{1}{m} \sum_{t=1}^{T} \ell_t(h)
|
|
$$
|
|
So the sequencial risk of the algorithm - the sequential risk of best predictor in $H$ (up to $T$).
|
|
\\
|
|
\bred{This is a sequential similar of variance error.} $\longrightarrow$ is called \textbf{Regret}.
|
|
\\
|
|
|
|
$$
|
|
h^*_T = arg \min_{h \in H} \frac{1}{T} \sum_{t} \ell_t(h) \qquad \frac{1}{T} \ell_t(h_t) - \frac{1}{T} \sum_t \ell_t(h_T^*)
|
|
$$
|
|
\newpage
|
|
\subsection{Online Gradiant Descent (OGD)}
|
|
It is an example of learning algorithm. \\
|
|
In optimisation we have one dimension and we want to minimise the function i can compute the gradiant in every point. \\
|
|
We start from a point and get the derivative: as I get the derivative I can see if is decreasing or increasing.\\
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=0.3\linewidth]{../img/lez14-img4.JPG}
|
|
\caption{}
|
|
%\label{fig:}
|
|
\end{figure}\\
|
|
|
|
$$
|
|
f \quad convex \qquad \min_x f(x) \qquad f \ \barra{R}^d \rightarrow \barra{R}
|
|
$$
|
|
$$
|
|
x_{t+1} = x_t + \eta \nabla f(x_t) \qquad \eta > 0
|
|
$$
|
|
$$
|
|
w_{t+1} = w_t + \eta \, \nabla \ell_t(w_t)
|
|
$$
|
|
where $\eta$ is the learning rate.
|
|
$$
|
|
h(x) = w^T \, x \qquad \ell_t(w) = \ell( w^T \, x_t, y_t) \qquad \textit{for istance } \ \ell(w^T \, x_t, y_t) = (w^T \, x_t - y_t)^2
|
|
$$
|
|
Assumption $\ell_t$ is convex (to do optimisation easily) and differentiable (to compute the gradiant)
|
|
\end{document} |