Mini-batch梯度下降
如果针对整个数据集进行梯度下降,如果数据集很大,那么速度就会很慢。因此,我们可以对训练集中的一部分进行梯度下降。因此可以将训练集分为多个mini-batch,针对每个mini-batch进行梯度下降。从图中,我们看到Batch vs Mini-Batch,这里对比的就是我们原本的针对整个训练集进行梯度下降,以及选取mini-batch进行梯度下降的两种算法的对比。另外,针对每个mini-batch,其标记符号为X^{1}…X^{t}.
Mini-batch梯度下降的实现过程:
从图中可知,我们在一次for循环中遍历所有的mini-batch样本,每次的计算仅针对X^{t}和Y^{t}来进行,整个的for循环称为一次epoch。如果我们要进行多次epoch,可以再添加一个针对epoch的for循环。
理解Mini-batch梯度下降
因此我们需要决定mini-batch的大小,一般有如下三种情况:
三种选择下的不同梯度下降情况如下:
- 随机梯度下降:失去了向量化的速度优势,且梯度来回震荡剧烈;
- Mini-Batch: 学习速度最快,并且不需要处理整个数据集;
- Batch: 每一次迭代时间过长,需要处理整个数据集。
那么如何选择合适的位于1与m之间的mini-batch大小呢?
- 当数据集较小,如m<=2000,则使用Batch梯度下降
- 当数据集较大,通常选择:64,128,256,512,1024等等(2的次数)(通常这里需要多加尝试,选择合适的值)
- 确保mini batch的大小在可用内存大小的范围内。
指数加权平均
首先,我们需要知道平均数求法,即比如我们现在有100天的温度值,要求这100天的平均温度值,那么我们可以直接利用公式进行计算:
何为指数加权平均,如何计算指数加权平均,下面给了一个关于温度的例子,可以看到得到的红线相比原来的蓝点要平滑许多。
改变$\beta$的作用,导致我们得到的指数加权平均数的不同:
可以看到,当$\beta$越大,得到的曲线越平缓,但是却不能很好反映温度的变化;$\beta$越小,则曲线越接近温度的变化,但这样子得到的噪声或异常值也会有很多。因此,合适的$\beta$值,也会影响着算法的表现。
理解指数加权平均
指数加权平均是几种优化算法的关键步骤,而其实质上是将每日温度与指数衰减函数相乘,然后求和。形象化的图形解释如下图:
另外,按照粗略的计算,当$\beta=0.9$时,相当于计算了10天的平均温度,当$\beta=0.98$时,相当于计算了50天的平均温度。
实际的实现:
可以看到实现很简单,并且内存占用也很少,因此其效率也很高。尽管这不是最精确的计算,因为最精确的计算时直接计算温度的平均值,这往往会得到更好的预测,但这种做法的缺点是如果保存所有最近的温度数据和过去的天数温度,这会占用很多内存。因此在深度学习中往往使用指数加权平均数来计算。
指数加权平均的偏差修正-Bias correction in exponentially weighted average
有一种偏差修正的技巧,能够使我们通过指数加权平均计算的平均数更加准确。如下图,当$\beta=0.9$时,我们得到红线;当$\beta=0.98$时,如果我们直接使用上面所写的原公式计算,即$v_t = \beta v_{t-1} + (1-\beta)\theta_t$,我们实际上得到的是紫线,而其实更合理的是绿色线。可以看到紫色曲线的起点较低,因此我们需要处理这种情况。
举原公式的一个例子,我们设置$v_0$初值为0,所以计算$v_1$时,由于$v_0=0$,而且我们第一天的温度为40的话,那么计算出来的$v_1$实际上很小,不符合实际情况,继续计算下去的结果也均会偏小。因此我们可以有方法来让估测更好,特别是在估测初期。我们将估计的值不再表示为$v_t$,而是表示为$\frac{v_t}{1-\beta ^t}$。因此当t比较小时,分母较小,我们可以消除偏差;当t比较大时,分母接近1,则还原为原式。
动量梯度下降-Gradient descent with momentum
有一种叫做Momentum的优化算法,它的运行速度总是比标准的梯度下降算法要快。其简单的思想就是,计算梯度的指数加权平均数,并利用该梯度来更新权重。
假设我们要对上图的成本函数J做梯度下降,其中红点为最优点,蓝点为我们的起点。如果我们采用梯度下降法,那么整个过程就如下图的蓝线,其频繁的上下波动明显减慢了梯度下降法的速度。为了避免大幅度的波动(紫线),我们也需要采用一个较小的学习率。
因此,我们可以采用动量梯度下降的方法,其对应的步骤如下:
我们可以形象地把整个过程当做一个从山上滚下来的球的加速运动,其中,以$V_{db}$的计算式为例,$d_b$起到了加速度的作用,而原来的$V_{db}$则表示速度,$\beta$则表现为摩擦力,保证了球不会无限加速下去。在这个例子中,球获得了动量,因此也称为动量梯度下降。
由上图算法步骤,我们可知超参数包括学习率$\alpha$和指数加权参数$\beta$,而$\beta$最常用的值为0.9(在梯度下降中则表示平均了前十次迭代的梯度,另外0.9是很棒的鲁棒数)。在实际中,我们在使用梯度下降或M
omentum的时候,通常不需要考虑偏差修正。另外,$db$和$dW$最开始初始化为零向量。另外,$v_{dW}=\beta v_{dw}+(1-\beta)dW$可以写成另一种算法$v_{dW}=\beta v_{dw}+dW$,即去掉了$1-\beta$。但大多数情况下,原有的表达形式更方便,效果也更好一点。
RMSprop - Root Mean Square Prop
RMSprop也是可以加速梯度下降的一种优化算法。我们在上一节就看到了,普通的梯度下降在横轴方向推进,但在纵轴方向上则来回摆动。因此,我们会希望梯度的下降在纵轴的速度比较慢,而横轴则比较快,如绿线所示。
这一部分与动量梯度下降有所不同。具体如下:
可以看到,在梯度更新部分,后半个式子乘的是$(dW)^2$(这也是叫做Square的原因)。随后,更新参数时减去的是学习率乘以$\frac{dW}{\sqrt(S_{dW})}$,因此有Root Mean Square的说法。使用RMSprop进行更新,我们得到的梯度下降速度在纵轴方向上摆动较小,在横轴方向则持续推进。另外,我们也可以选取较大的学习率,并且也不会影响到纵轴的速度。
需要注意的是,由于在下一节会将Momentum和RMSprop结合起来,所以为了区分两个相同的超参数$\beta$,因此我们将RMSprop的超参数写作$\beta_2$。同时,为了避免分母为0,即$\sqrt(S_{dW})$接近0,因此我们会加一个小小的$\epsilon$来使得数据稳定。通常取$\epsilon$为10^{-8}.
总结:RMSprop与Momentum相似的一点是可以消除梯度下降中的摆动,同时也包括mini-batch梯度下降,并且RMSprop允许我们使用一个更大的学习率$\alpha$。从而加快算法的学习速度。
一个趣事:RMSprop最开始不是在学术论文上提出的,而是Jeff在Coursera课程上提出的。
Adam优化算法
RMSprop和Adam优化算法可以使用于不同的神经网络结构,而Adam算法是将Momentum和RMSprop结合起来的算法。算法过程如下:
接下来对上述算法流程的一个解释:首先初始化四个变量。在每一次的迭代中,计算当前mini-batch的梯度dW和db,然后计算Momentum和RMSprop各自需要的量(注意指数加权平均的\beta表示不一样)。之后,再对两者将进行偏差矫正,之后使用更新公式将Momentum和RMSprop结合起来,从而得到了Adam优化算法。
在此算法中有许多超参数:
对于后面三个,我们通常直接使用缺省值,而真正需要调整的是学习率$\alpha$。
为什么叫做Adam,其全称为Adaptive Moment Estimation.因此$\beta_1$用来计算dW微分,称为first moment,而$\beta_2$用来计算平方数的指数加权平均数,称为second moment。
学习率衰减
加快梯度下降的一个方法是随时间增长而慢慢地减少学习率,我们称为学习率衰减(decay)。对于下图,蓝线指我们使用mini-batch但是固定的学习率,因此最后我们可能不断地再最优点迭代但很难达到最优区域;而针对绿线,我们首先可以采取较大的学习率,到了接近收敛的时候,则采用较小的学习率,以便于接近最优点。
接下来讲讲如何实现学习率衰减。
方法:首先理解epoch的含义,一次epoch是指遍历一次训练集全集,形象化表示如下:
而学习率衰减的公式及一个简单的例子如下:
可以看到,这里包括了两个需要调整的超参数:$\alpha_0$和衰减率decay_rate,从而实现了学习率随着epoch的增大而不断较小的目的。
其他方法:如指数衰减。
或者其他三种形式的衰减:
另外,也有一些人采用手动调整学习率的方法。在下周,我们会学习如何对大量的超参数进行高效搜索的方法。
局部最优的问题
在深度学习研究早期,人们总是担心优化算法会困在极差的局部最优中。但随着深度学习理论的不断发展,我们对局部最优有了更深的理解。
当我们创建一个神经网络,通常梯度为0的点并不是这个图中的局部最优点。实际上,成本函数的零梯度点通常是鞍点(saddle points)。
但一个具有高维空间的成本函数,如果梯度为0,那么在每个方向它可能是凸函数,也可能是凹函数。比如,对于一个20000维的空间,如果要得到局部最优,这个可能性是比较小的,而我们更可能是遇到鞍点,如下图右图。因此我们从深度学习的历史可以学习到,我们在低维空间学习到的经验,并不适用于高维空间的情况。(低维空间容易遇到局部最优点,但高维空间更多的是鞍点)
Problems of plateaus: 平稳段:导数长期接近0。对于水平平面,我们需要很长的一段时间才能达到平稳段的点,再由于左边或右边的随机扰动,最后才可以走出平稳段。因此我们可以知道,我们在高维空间里是不太可能困在局部最优中的,但另一个问题是我们可能需要较长的时间才能走出平稳段。
本周作业
代入包:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets
from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCases import *
%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
1- Gradient Descent
一个简单的exercise:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Update rule for each parameter
for l in range(L): # l从0开始记起,因此for循环中记为l+1
### START CODE HERE ### (approx. 2 lines)
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads['dW' + str(l+1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads['db' + str(l+1)]
### END CODE HERE ###
return parameters
接下来,展示一个区分Batch和SGD的代码样例:
(Batch) Gradient Descent:
1
2
3
4
5
6
7
8
9
10
11
12X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost.
cost = compute_cost(a, Y)
# Backward propagation.
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)Stochastic Gradient Descent:
1
2
3
4
5
6
7
8
9
10
11
12
13X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
for j in range(0, m):
# Forward propagation
a, caches = forward_propagation(X[:,j], parameters)
# Compute cost
cost = compute_cost(a, Y[:,j])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
可以看到实现随机梯度下降需要三个for循环:
- Over the number of iterations
- Over the m training examples
- Over the layers (to update all parameters, from (W[1],b[1]) to (W[L],b[L]))
What we should remember:
- The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step.
- You have to tune a learning rate hyperparameter α.
- With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).
2- Mini-Batch Gradient Descent
实现Mini-Batch梯度下降包括两步,一是对训练样本进行随机打乱,二是划分mini-batches。
1 | def random_mini_batches(X, Y, mini_batch_size=64, seed=0): |
What we should remember:
- Shuffling and Partitioning are the two steps required to build mini-batches
- Powers of two are often chosen to be the mini-batch size, e.g., 16,32,64,128.
3- Momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence. Using momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the ‘direction’ of the previous gradients in the variable v. Formally, this will be the exponentially weighted average of the gradient on previous steps.
1 | def initialize_velocity(parameters): # 只是初始化 |
接下来,实现具体的Momentum算法:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2
for l in range(L):
v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1-beta) * grads["dW" + str(l+1)]
v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1-beta) * grads["db" + str(l+1)]
parameters['W' + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * v["dW"+str(l+1)]
parameters['b' + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * v["db" + str(l+1)]
return parameters, v
Note:
- The velocity is initialized with zeros. So the algorithm will take a few iterations to “build up” velocity and start to take bigger steps.
- If β = 0, then this just becomes standard gradient descent without momentum.
How do we choose β?
- The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.
- Common values for β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9 is often a reasonable default.
- Tuning the optimal β for your model might need trying several values to see what works best in term of reducing the value of the cost function J.
**What we should remember”:
- Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
- You have to tune a momentum hyperparameter β and a learning rate α.
4- Adam
Adam is one of the most effective optimization algorithms for training neural networks. It combines ideas from RMSProp (described in lecture) and Momentum.
How does Adam word?
- It calculates an exponentially weighted average of past gradients, and stores it in variables v (before bias correction) and v_corrected (with bias correction).
- It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables s (before bias correction) and s_corrected(with bias correction).
- It updates parameters in a direction based on combining information from “1” and “2”.
1 | def initialize_adam(parameters): |
接下来,更新参数:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2
v_corrected = {}
s_corrected = {}
for l in range(L):
v["dW"+str(l+1)] = beta1 * v["dW"+str(l+1)] + (1-beta1) * grads["dW" + str(l+1)]
v["db"+str(l+1)] = beta1 * v["db"+str(l+1)] + (1-beta1) * grads["db" + str(l+1)]
# bias-correction
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1-pow(beta1, t))
v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1-pow(beta1, t))
s["dW"+str(l+1)] = beta2 * s["dW"+str(l+1)] + (1-beta2) * pow(grads["dW" + str(l+1)],2)
s["db"+str(l+1)] = beta2 * s["db"+str(l+1)] + (1-beta2) * pow(grads["db" + str(l+1)],2)
# bias-correction
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1-pow(beta2, t))
s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1-pow(beta2, t))
# update W and b
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW"+str(l+1)] / np.sqrt(s_corrected["dW"+str(l+1)] + epsilon)
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db"+str(l+1)] / np.sqrt(s_corrected["db"+str(l+1)] + epsilon)
return parameters, v, s
5- Model with different optimization algorithm
Lets use the following “moons” dataset to test the different optimization methods. (The dataset is named “moons” because the data from each of the two classes looks a bit like a crescent-shaped moon.)1
train_X, train_Y = load_dataset()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79def model(X, Y, layers_dims, optimizer, learning_rate=0.007, mini_batch_size=64, beta=0.9, beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs = 10000):
"""
3-layer neural network model which can be run in different optimizer modes.
Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
layers_dims -- python list, containing the size of each layer
learning_rate -- the learning rate, scalar.
mini_batch_size -- the size of a mini batch
beta -- Momentum hyperparameter
beta1 -- Exponential decay hyperparameter for the past gradients estimates
beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
num_epochs -- number of epochs
print_cost -- True to print the cost every 1000 epochs
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(layers_dims) # number of layers in the neural networks
costs = []
t = 0
seed = 10
# Initialize parameters
parameters = initialize_parameters(layers_dims)
# Initialize the optimizer
if optimizer == "gd":
pass # no initialization required for gradient descent
elif optimizer == "momentum":
v = initialize_velocity(parameters)
elif optimizer == "adam":
v, s = initialize_adam(parameters)
# Optimization loop
for i in range(num_epochs):
# Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
seed = seed + 1
minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
for minibatch in minibatches:
# Select a mini_batch
(minibatch_X, minibatch_Y) = minibatch
# Forward propagation
a3, caches = forward_propagation(mini_batch_X, parameters)
# Compute cost
cost = compute_cost(a3, mini_batch_Y)
# Backward propagation
grads = backward_propagation(minibatch_X, mini_batch_Y, caches)
# Update parameters
if optimizer == "gd":
parameters = update_parameters(parameters, grads, learning_rate)
elif optimizer == "momentum":
parameters = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
elif optimizer == "adam":
t = t + 1 # Adam counter
parameters = update_parameters_with_adam(parameters, grads, v, s, t, learning_rate, beta1, beta2, epsilon)
# Print the cost every 1000 epoch
if print_cost and i % 1000 == 0:
print ("Cost after epoch %i: %f" %(i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)
# plot the cost
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('epochs (per 100)')
plt.title("Learning rate = " + str(learning_rate))
plt.show()
return parameters
5.1- Mini-batch Gradient descent
1 | # train 3-layer model |
结果:
5.2- Mini-batch gradient descent with momentum
1 | # train 3-layer model |
结果:
Because this example is relatively simple, the gains from using momemtum are small; but for more complex problems you might see bigger gains.
5.3- Mini-batch with Adam model
1 | # train 3-layer model |
结果:
5.4- Summary
Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable.Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult than others for the optimization algorithm.
Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster.
Some advantages of Adam include:
- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
- Usually works well even with little tuning of hyperparameters(except α)
Adam论文链接:https://arxiv.org/pdf/1412.6980.pdf