优化算法总结
Frey July 04, 2019 [算法] #神经网络 #AdaGrad #RMSProp #AdamAdaGrad-Adaptive-Gradient-自适应梯度
$$ g_t = \nabla_\theta J(\theta_{t-1}) $$
$$ \theta_{t+1} = \theta_t-\alpha\cdot g_t/\sqrt{\sum_{i=1}^tg_t^2} \ \ \ \ \ (\alpha=0.01) $$
随梯度的变化改变学习率
RMSProp
$$ g_t = \nabla_\theta J(\theta_{t-1}) $$
$$ v_t= \gamma(v_{t-1})+(1-\gamma)g_t^2 \ \ \ \ \ (\gamma=0.9) $$
$$ \theta_{t} = \theta_{t-1}-\alpha* g_t/(\sqrt v_t + \varepsilon) \ \ \ \ \ (\alpha=0.001,\varepsilon=10^{-8}) $$
结合梯度平方的指数移动平均数来调节学习率的变化
克服AdaGrad梯度急剧减小的问题
Adam优化器
$$ g_t = \nabla_\theta J(\theta_{t-1}) $$
$$ m_t = \beta_1m_{t-1}+(1-\beta_1)g_t \ \ \ \ \ (\beta_1=0.9,m_0=0) $$
$$ v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \ \ \ \ \ (\beta_2=0.999,v_0=0) $$
$$ \hat m_t = m_t/(1-\beta_1^t) $$
$$ \hat v_t = v_t/(1-\beta_2^t) $$
$$ \theta_t = \theta_{t-1}-\alpha * \hat m_t/(\hat v_t + \varepsilon) \ \ \ \ \ (\alpha=0.001,\varepsilon=10^{-8}) $$
对梯度的一阶矩估计(First Moment Estimation,即梯度的均值)和二阶矩估计(Second Moment Estimation,即梯度的未中心化的方差)进行综合考虑,计算出更新步长。
下面是对几种梯度下降算法的模拟
其中包括随机梯度下降算法SGD,小批量梯度下降算法MSGD,批量梯度下降算法BGD
都是根据损失函数计算参数梯度,更新参数;不一样的地方是他们使用使用训练集的方式
# 随机梯度下降算法SGD
=
=
=
=. #100,2
=0.5
=
=
=
=
=
=
= #1,5
= #1,2
= + #2,1
=
=
= - *
= - *
break
#print(loss)
# 小批量梯度下降算法MSGD
=
=
=
=. #100,2
=0.3
=
=
=
=
=
= 50
, ,=0,0,0
=
= #1,5
= #1,2
= + #2,10
+=
+=
+=
=
= - */
= - */
break
#print(loss)
# 批量梯度下降算法BGD
=
=
=
=. #100,2
=0.3
#ww = np.random.rand(2,5)
#bb = np.random.rand(2,1)
=
=
=
= 1000
, ,=0,0,0
=
= #1,5
= #1,2
= + #2,10
+=
+=
+=
=
= - */
= - */
break
#print(loss)