2020-05-25 16:25:06 +02:00
\documentclass [../main.tex] { subfiles}
\begin { document}
\chapter { Lecture 19 - 18-05-2020}
$$
k(x,x') = < \phi (x), \phi (x')>
\qquad \phi : X \rightarrow H
$$
where $ X \rightarrow \barra { R } ^ 2 $ and $ H \rightarrow barra { R } ^ N $
$$
H_ \delta = \{ \sum _ { i =1} ^ N \alpha _ i \, k_ \delta (x_ i, \cdot ), x_ 1,..., x_ N \in \barra { R} ^ d, \alpha _ 1, ... \alpha _ N, N \in \barra { N } \}
$$
Inner product measures "similarities" between data points.
\\
$$
x^ T \, x' = \| x\| \, \| x'\| \, \cos \Theta \qquad x \in X \quad k(x,x')
$$
$ k $ sais how much similar are the structure (tree, documents etc).
\\
I would like to learn a predictor based on the notion of similarity.
\\
$$
k(x,x') = < \phi (x), \phi (x')>
$$
where $ <> $ is the inner product.
\\
So we have Data $ \rightarrow $ Kernel $ \rightarrow $ Kernel learning Algortithm
\\
Kernels offer a uniform interface to data in such way they algoriithm can learn from data.
\\
Given $ K $ on $ X $ , I need to find $ \exists H _ k \quad \phi _ k \ X \rightarrow H _ k $
\\
$ \exists <...> _ k $ s.t $ k ( x,x' ) = < \phi _ k ( x ) , \phi _ k ( x' ) > _ k $
\\ \\
\bred { Theorem}
\\
Given $ K: X \times X \rightarrow \barra { R } $ , symmetric
\\
Then $ K $ is a Kernel iif $ \forall m \in \barra { N } $ $ \forall x _ 1 ,...,x _ m \in X $
\\
The $ m \times m $ matrix $ K $ \quad $ K _ { ij } = k ( x _ i,x _ j ) $ is positive semidefinite\\
$
\forall \alpha \in \barra { R} ^ m \qquad \alpha ^ T \, K \, \alpha \geq 0
$
\\
In general, given a Kernel $ K $ there is not unique representation for $ \phi _ k $ and $ <...> _ k $ (inner product).
\\
However, there is a "canonical" representation:
$
\phi _ k(x) = K(x, \cdot )
$
$$
\phi _ k : X \rightarrow H \qquad H_ k = \{ \sum _ { i=1} ^ N \alpha _ i \, k (x_ i, \cdot ), \alpha _ 1,..., \alpha _ N \in \barra { R} , x_ 1,...,x_ N \in X, N \in \barra { N} \}
$$
We have to define an inner product like:
$$
<\phi _ k(x), \phi _ k(x')>_ k \ = \ k(x,x')
$$
This is the canonical representation that helps mapping.
\\ \\
What happen to use this mechanism to perform predictions?
\\
$
x \in \barra { R} ^ d \ w \in \barra { R} ^ d \ w^ T \, x \qquad \textit { \ where } g = \sum _ { i=1} ^ N \alpha _ i \, k (x_ i, \cdot )
$
$$
\phi _ k(x) \qquad g \in H_ k \qquad <g, \phi _ k(x)>_ k \ = \ <\sum _ i \alpha _ i k(x_ i, \cdot ), \phi _ k(x)> \ = $$
We have to satisfy allinearity
$$
= \ \sum _ i \alpha _ i <k(x_ i, \cdot ), k(x, \cdot ) >_ k \ = \ \sum _ i \alpha _ i <\phi (x_ i), \phi _ k(x)>_ k \ = \ \sum _ i \alpha _ i k(x_ i, x) = g(x)
$$
At the end we have:
$$
<g, \phi _ k(x)>_ k \ = \ g(x)
$$
\\ \\
Now, if i have two functions:
$$
f = \sum _ { i=1} ^ N \alpha _ i \, k(x_ i, \cdot ) \qquad g = \sum _ { j=1} ^ M \beta _ j \, k (x'_ j, \cdot ) \qquad f,g \in H_ k
$$
$$
<f,g>_ k = <\sum _ i \alpha _ i \, k(x_ i,\cdot ) , \sum _ j \beta _ j \, k(x'_ j, \cdot ) >_ k \ =
\ \sum _ i \sum _ j \alpha _ i \, \beta _ j <k(x_ i, \cdot ), k(x'_ j, \cdot >_ k \ =
$$
$$
= \ \sum _ i \sum _ j \alpha _ i \, \beta _ j \, k(x_ i, x_ j)
$$
$$
\| f\| ^ 2 = <f,f>_ k = \sum _ { ij} \alpha _ i \, \alpha _ j \, k(x_ i, x_ j)
$$
Perceptron convergence theorem in kernel space:
$$
M \leq \| U\| ^ 2 ( \max _ t \| x_ t\| ^ 2) \qquad \forall u \in \barra { R} ^ d \quad y_ t \, u^ T \, x_ t \geq 1 quad \forall g \in H_ k \quad y_ t \, g(x_ t) \geq 1
$$
we know that:
$$
\| x_ t \| ^ 2 \rightsquigarrow \| \phi _ k(x_ t)\| ^ 2_ k \ = \ <\phi _ k(x_ t), \phi _ k(\alpha _ t) >_ k \ = \ k (x_ t,x_ t)
$$
so
\\
.... MANCA ULTIMA FORUMA
\\ \\
Ridge regression:
$$
w = \left ( \alpha \, I + S^ T \, S \right )^ { -1} \, S^ T \, y
$$
$ S $ is $ m \times d $ matrix whose rows are the training points $ x _ 1 ,..., x _ m \in \barra { R } ^ d $
\\
$ y = ( y _ 1 ,...,y _ m ) \quad y _ t \in \barra { R } ^ d $ training labels $ \alpha > 0 $
$$
\left ( \alpha \, I + S^ T \, S \right )^ { -1} \, S^ T \ = \ S^ T \left ( \alpha \, I_ m + S\, S^ T\right )^ { -1}
$$
where $ d \times d $ and $ d \times m $ = $ d \times
m$ and $ m \times m$
$$
\left ( S \, S^ T\right )_ { ij} = x_ i^ T x_ j \qquad \rightsquigarrow \ <\phi (x_ i),\phi (x_ j)>_ k = k(x_ i, x_ j) = K_ { ij}
$$
$$
S^ T = \left [ x_1,...,x_m \right] \ \rightsquigarrow \ \left [ \ \phi_k(x_i),..., \phi_k(x_m) \ \right] = \left [ \ k(x_1, \cdot), ..., k(k_m, \cdot) \ \right] \ = \ k(\cdot )
$$
$$
k (\cdot )^ T \, \left ( \alpha \, I_ m + K \right )^ { -1} \, y \ = \ g
$$
where $ 1 \times
m$ and $ m \times
m$ and $ m \times
1$ \\ \\
How to compute prediction?
$$
g(x) = y^ T \left ( \alpha \, I_ m + K \right )^ { -1} \, k(x)
$$
\qquad $ 1 \times m $ and $ m \times m $ and $ m \times 1 $
\\
In fact, is the evaltuation of $ g $ in any point $ x $ .
\\
The drawback is that we pass from $ d \times d $ matrix to a $ m \times m $ matrix that can be huge. So it is not really efficient in this way, we need to use addictional "tricks" having a more compact representation of the last matrix prediction.
\newpage
\section { Support Vector Machine (SVM)}
It is a linear predictor and is a very popular one because has better performance than perceptron and we will see it for classification but there are also version for regression.
\\ \\
The idea here is that you want to come up with an hyperplane that is defined as a solution of an optimisation problem.
\\
We have a classification dataset $ ( x _ 1 ,y _ 1 ) ... ( x _ m,y _ m ) \qquad x _ t \in \barra { R } ^ d \quad y _ t \in \{ - 1 , 1 \} $ and it is linearly separable.
\\
Sum as the solution $ w ^ * $ (optimisation problem) to this problem:
$$
\min _ { w \in \barra { R} ^ d} \frac { 1} { 2} \| w \| ^ 2 \qquad s.t \quad y_ t \, w^ T \, x_ t \geq 1 \quad t = 1,2,...,m
$$
Geometrically $ w ^ * $ corresponds to the maximum marging separating hyperplane like:
$$
\gamma ^ * = \max _ { u: \| u\| =1} y_ t \, u^ t \, x_ t \qquad t=1,...,m
$$ \
\textbf { $ u ^ * $ is achieving $ \gamma ^ * $ is the maximal margin separator.} \\
\begin { figure} [h]
\centering
\includegraphics [width=0.4\linewidth] { ../img/lez19-img1.JPG}
\caption { Draw of SVG}
%\label{fig:}
\end { figure} \\
So I want to maximise this distance.
$$
\max _ { \gamma > 0} \, \gamma ^ 2 \qquad s.t \quad \| u \| ^ 2 = 1 \qquad y_ t \, u^ t \, x_ t \geq \gamma \quad t=1,...,m
$$
So we can maximise instead of minimising.
\\
What is the theorem? The equivalent between this two.
\\ \\
\bred { Theorem} :\\
$ \forall $ linear separator $ ( x _ 1 ,y _ 1 ) ... ( x _ m,y _ m ) $ \\
The max margin separator $ u ^ * $ satisfies $ u ^ * = \gamma ^ * \, w ^ * $ where $ w ^ * $ is the SVM solution and $ \gamma ^ * $ is the maximum margin.
2020-05-18 10:21:40 +02:00
\end { document}