AdaDelta and Adam Algorithm

AdaDelta

AdaDelta is another variant of AdaGrad. Like RMSProp, it solves the problem of relying too much on previous gradients, by leaky average, but in a more complicated way. Here is how it works.

First, like RMSProp, we have:

S_t=\rho S_{t-1} + (1-\rho)g_t^2

but unlike RMSProp, we don’t upgrade W directly with S. We have a iterative equation:

\begin{cases} M_t = \rho M_{t-1}+(1-\rho)G_{t-1}^2\\ G_t = \frac{\sqrt{M_{t-1}+\epsilon}}{\sqrt{S_t+\epsilon}}\cdot g_t \end{cases}

If we combine them into one equation, we have:

G_t=\frac{\sqrt{\rho M_{t-2}+(1-\rho)G_{t-1}^2+\epsilon}}{\sqrt{S_t+\epsilon}}\cdot\nabla W_t

And we update W with G:

W_t = W_{t-1} - G_t

The iterative equation part could be quite confusing. The sequence of calculation and update should be:

gradient( $g_t=\nabla W_t$ )
$S_t$
$G_t$
$M_t$
$W_t$

In AdaDelta, we have no learning rate $\eta$ , that means we don’t need to set a hyper-parameter by ourselves.

Adam

We use a algorithm to combine RMSProp and momentum method.

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

v_t = \beta_2 v_{t-1}+(1-\beta_2)g_t^2

It is suggested that we set $\beta_1$ to 0.9 and $\beta_2$ to 0.999. To correct the discrepancy between the expectation of v and $g_t^2$ , we need to correct the v: