Now - this gradiant is exactly this $\ell_t(w)=\left[-y_t \, w^T \, x_t \right]_+$
\\
The problem is not comparable with the number of mistakes so I am not going to have the number of mistakes.
\\\\How do I do it?
\\
What if I just shift to the right? \\
Now this loss is an upper bound of the zero-one loss. And this is called \bred{Hinge loss} (where hinge take the door attached to the frame of the wall)\\
M_T = \sum_{t=1}^T I\{y_t w_t^T \, x_t \leq 0 \}\quad\textbf{ invariant with respect to $\eta >0$}
$$
It does not matter which $\eta$ we choose. The number of mistakes is going to be the same. This mean that the state of the algorithm (which depends on mistakes) is gonna be the same.
\\
I can run perceptron with $\eta=1$ and pretend (in the analysis) it was run with $\eta=\frac{\| U \|}{X \,\sqrt[]{M_T}}$
\\\\
We go back to the bound of $M_T$ . We are actually free to choose any number of U.
\\
If $(x_1, y_1),(x_2,y_2)$ is linearly separable then: \\$\exists U$ s.t. $y_t \, U^T x_t \geq1\ \Rightarrow\ h_t(u)=0\ \ \forall t $
\\
$$M_T \ \leq\ \left(\,\| U\|\ X \,\right)^2\qquad the \ \bred{perceptron convergence theorem.}$$
$$
M_T \ \leq\ \min_{u \in\barra{R}^d}\left( \sum_{t=1}^T h_t(u) + \left( \|U\|\, X \right)^2 + \| U \|\, X \ \sqrt[]{\sum_t h_t(u)}\right)
$$
This are called \bred{Oracle bounds}, the perceptron knows which is the best $U$.
\newpage
\subsection{Strongly convex loss functions}
We use this to analyse all class of algorithms that regularise the ERM which is the support vector machine. We want to explain what happen using Support vector Machine. For neural networks we cannot do this since NN are not convex and there is not way to "convexify". Convexifying we lose the power of NN.
\\
We said that $\ell_t$ have to be convex. But i have a lot of types of convexity.
This two for example are both convex. In the left this always has a positive curvature, while the right one we have a 0 curvature since is two straight line and not differentiable.
In other word, Hessian on the left positive and definite. On the right Hessian is 0.
\\
We are looking for \bred{strongly convex losses}.
\\
$\ell$ differentiable is $\sigma$-$strongly$ convex if:
Next lecture we are going to show that we can run OGD with strongly convex functions. We are going to get a better bound. Our regret is gonna vanish much faster than the case of simple convexity.
\\
You can prove that if Hessian is 0, your regret is vanishing with a rate of $\frac{U \, G}{\sqrt[]{T}}$.
\\
We will shows with strong convexity the OGD will converge much faster with a rate of $\frac{\ln T}{T}$.\\
This is what happen in optimisation, we prefer strictly convex function.