AdaDelta and Adam Algorithm

AdaDelta and Adam Algorithm

AdaDelta

AdaDelta is another variant of AdaGrad. Like RMSProp, it solves the problem of relying too much on previous gradients, by leaky average, but in a more complicated way. Here is how it works.

First, like RMSProp, we have:

St=ρSt1+(1ρ)gt2S_t=\rho S_{t-1} + (1-\rho)g_t^2

but unlike RMSProp, we don’t upgrade W directly with S. We have a iterative equation:

{Mt=ρMt1+(1ρ)Gt12Gt=Mt1+ϵSt+ϵgt\begin{cases} M_t = \rho M_{t-1}+(1-\rho)G_{t-1}^2\\ G_t = \frac{\sqrt{M_{t-1}+\epsilon}}{\sqrt{S_t+\epsilon}}\cdot g_t \end{cases}

If we combine them into one equation, we have:

Gt=ρMt2+(1ρ)Gt12+ϵSt+ϵWtG_t=\frac{\sqrt{\rho M_{t-2}+(1-\rho)G_{t-1}^2+\epsilon}}{\sqrt{S_t+\epsilon}}\cdot\nabla W_t

And we update W with G:

Wt=Wt1GtW_t = W_{t-1} - G_t

The iterative equation part could be quite confusing. The sequence of calculation and update should be:

  1. gradient(gt=Wtg_t=\nabla W_t)
  2. StS_t
  3. GtG_t
  4. MtM_t
  5. WtW_t

In AdaDelta, we have no learning rate η\eta, that means we don’t need to set a hyper-parameter by ourselves.

Adam

We use a algorithm to combine RMSProp and momentum method.

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1)g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1}+(1-\beta_2)g_t^2

It is suggested that we set β1\beta_1 to 0.9 and β2\beta_2 to 0.999. To correct the discrepancy between the expectation of v and gt2g_t^2, we need to correct the v:

vt^=vt1β2t\hat{v_t}=\frac{v_t}{1-\beta_2^t}

Likewise, we could also correct m by:

mt^=mt1β1t\hat{m_t}=\frac{m_t}{1-\beta_1^t}

And now we could update W by:

Wt=Wt1ηmt^vt^+ϵW_t=W_{t-1}-\eta\frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon}

About the correction of discrepancy part, the author gives a inference in the original paper:https://arxiv.org/pdf/1412.6980v9.pdf?

Yogi

We can rewrite the formula in Adam to:

vt=vt1+(1β2)(gt2vt1)v_t=v_{t-1}+(1-\beta_2)(g_t^2-v_{t-1})

If the gradient is too big, Adam algorithm may fail to converge. To fix this problem, we have Yogi algorithm:

vt=vt1+(1β2)gt2sgn(gt2vt1)v_t=v_{t-1}+(1-\beta_2)g_t^2\cdot sgn(g_t^2-v_{t-1})

References

https://arxiv.org/pdf/1412.6980v9.pdf?

https://zh-v2.d2l.ai/d2l-zh-pytorch.pdf

https://blog.csdn.net/weixin_35344136/article/details/113041592

https://blog.csdn.net/ustbbsy/article/details/106930309


AdaDelta and Adam Algorithm
http://example.com/2023/02/19/AdaDelta-and-Adam-Algorithm/
Author
dingzr2001
Posted on
February 19, 2023
Licensed under