训练/验证/测试集

在实际构建和训练深层神经网络的时候，我们往往要确定一些超参数，如下：

在实际应用中，对数据集进行划分为训练集、验证集、测试集可以加速神经网络的集成，也可以更有效地衡量算法的偏差和方差，从而帮助我们更高效地选择合适的方法来优化算法。图中展示了在数据样本较少时，我们将其划分比例为70%-30%或60%-20%-20%；当大数据时，则验证集和测试集的比例往往会很小。

当然，我们有时候会面临训练集和测试集分布不均衡的情况。

如上图举例，我们的训练集来自网页上的图片，可能都是很高清的，而测试集和验证集的图片都来自用户自己拍摄的，很显然图片的分布就不同。面对这种情况，我们需要遵循的一个准则就是：Make sure dev and test set come from same distribution.

另外，实际应用中也可以不需要测试集，只使用训练集和验证集（有些人可能就会称作训练集和测试集）。

偏差/方差 - Bias/Variance

我们可以通过训练集和验证集上的误差来确定算法的偏差和方差的高低情况，再根据具体的情况，如高偏差高方差、低偏差高方差等来判断下一步应该如何进行算法的优化。

下图的猫分类的偏差方差衡量的前提有两个；1. 基本误差很小，即Optimal(Bayes) error很小；2. train和dev set的分布相同。然而，我们首先查看训练集误差，若误差大，说明高偏差，反之为低偏差；再看验证集误差，若误差大，则高方差，反之为低方差。

接着，吴老师又举了一个高偏差和高方差的具体例子：

我们可以看到，由于这个分类器基本是线性拟合，所以其拟合程度低，说明为高偏差；但同时，它也有着过拟合的部分，如图的两个曲折部分，因此也具有高方差。在高维数据分类中，这种情况非常常见：有些区域偏差高，有些区域方差高。

机器学习基础 - Basic Recipie for ML

一般，我们会先查看是否High bias(training set performance)? 如果是，则有以下方法来进行调整：训练更深更大的神经网络、训练更长时间、或者修改我们的神经网络结构。最后达到至少能够拟合训练数据的目的。

接着，我们查看是否Hign variance(dev set performance)? 如果是，则有以下方法：获得更多的数据、正则化、修改神经网络结构。

当偏差和方差都比较小时，则我们的工作也基本完成。在以前，我们会特别注重”Bias Variance trade-off”，但是如今的深度学习并不需要特别看重这个问题，因为我们其实可以再减少偏差的同时不损害方差（利用正则化之类的方法）。

正则化 - Regularization

当我们出现过拟合的结果，则高方差，那么我们第一个想到的方法应该是正则化。另一个方法则是得到更多的数据，但这往往比较难。而正则化通常可以避免过拟合。

以Logistic回归为例，我们可以在图中看到L2-范数和L1-范数的写法，其中lambda为正则化参数，而L2-范数相比L1-范数而言更为常用。

对神经网络而言，我们不再称为L2-范数，而称为Frobenius Norm。因为这里其范数的计算方式为一个矩阵中所有元素的平方和。

另外，我们可以注意到，由于我们的cost函数发生变化，则多了正则项，那么dW的计算方式也有了变化。具体计算公式如图所示，并且我们可以发现，加了正则化项后，我们的参数W也是往减少的趋势走，我们称其为”Weight Decay”。

为什么正则化可以减少过拟合

这里吴老师给了两个比较直观的解释。

如图，当我们将正则化参数lambda设置得足够大，那么W会接近0，从而使得很多隐藏单元的影响变小了，因此网络结构也变得简单，从而可以减少过拟合。

又如上图，如果我们的lambda增大，那么W会较小，从而Z也会减小。如果Z的范围小，则tanh激活函数近似于线性函数，从而使得神经网络类似于线性神经网络，因此可以减小过拟合。

Dropout Regularization

此部分参考总结：https://blog.csdn.net/u012328159/article/details/80210363

为了防止过拟合的问题，我们最常使用的手段就是L2正则化，即在代价函数后面加一个L2正则项。dropout正则化是Srivastava在2014年提出来的：Dropout: A Simple Way to Prevent Neural Networks from Overfitting。Dropout的思想其实非常简单粗暴：对于网络的每一层，随机的丢弃一些单元。如下图所示:

如何实现正则化，吴老师举了Inverted dropout的例子来进行阐述：从技术实现方面来看下dropout正则项，这里最重要的一个参数就是keep_prob，称作保留概率（同样，1−keep_prob1−keep_prob则为丢弃概率），比如某一层的 keep_prob=0.8，则意味着某一层随机的保留80%的神经单元（也即有20%的单元被丢弃）。通常实现dropout regularization的技术称为 inverted dropout，假设对于第三层，则inverted dropout的具体实现为：

d3 = np.random.rand(a3.shape[0],a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
a3 = a3 / keep_prob
z4 = np.dot(w4, a3) + b4

对于上述第三行代码的解释：因为有1−keep_prob的单元失活了，这样a3的期望值也就减少了1−keep_prob，所以我们要用a3/keep_prob，这样a3的期望值不变。这就是inverted dropout。

我们用两个动态图来演示下dropout的过程（素材来自ng的deep learning课）：

另外要注意的是，我们在测试阶段的时候是不使用Dropout的。

理解Dropout

为什么Dropout能够减小过拟合，吴老师给出了两个比较直观的解释：

正是因为在每一层随机地丢弃了一些单元，所以相当于训练出来的网络要比原有的网络小得多，这在一定程度上解释了避免过拟合的问题。
如下图所示的一个简单单层网络，因为每一个特征都有可能被丢弃，所以整个网络不会偏向于某一个特征（把某特征的权重的值赋的很大），会把每一个特征的权重都赋的很小，这就有点类似于L2正则化了，能够起到减轻过拟合的作用。（压缩权重）

另外，不同层的keep_prob可以设置得不同，这就有些类似于正则化中的参数lambda。

简单总结：如果你担心某些层比其他层更容易发生过拟合，那么可以把某些层的keep_prob设置得低一些，而缺点是为了使用交叉验证，我们要搜索更多的超参数；另一种方案则是在一些层上应用dropout，而一些层则不用。dropout在计算机视觉领域用得很频繁，因为我们往往足够的数据，因此经常容易发生过拟合，那么dropout就是必然会使用的。另外需要记住，dropout是正则化中的一种方法，它可以帮助预防过拟合。不过，dropout的使用使得我们的cost function不再明确，因此我们的一个做法就是，先去掉dropout函数，画出cost function图确保是递减的，然后再使用dropout。

其他正则化方法

Data augmentation-数据增强

如果我们无法获得更多的数据，以图片为例，我们可以增加其他方向不同的图片，或者随机翻转和裁剪图片，额外生成假数据。尽管这种做法不如增加一组新图片，因为存在一些冗余，但是这样做也节省了获取更多猫咪图片的花费；对于数字，我们也可以随机旋转或扭转数字来扩增数据。如以下例子：

Early stopping

我们可以画出迭代过程中的训练误差和在验证集上的误差，通过观察验证误差，来提前结束迭代。这样子得到的参数W不至于过大，因此也减少了过拟合。但是early-stopping的一个主要缺点就是我们无法独立地处理优化cost function和防止过拟合这两个过程，因为提前停止梯度下降，我们实际上也停止了cost funciton的优化，因此我们实际上是用一种方法来同时解决两个问题。

我们也可以选择不同的正则化参数lambda，但这意味着更多的训练和尝试，计算代价会很大。因此Early stopping就不需要尝试那么多的lambda。

标准化输入-Normalize input

训练神经网络时，提高训练速度的方法之一是对输入进行归一化。假设我们有一个训练集,它有两个输入特征,所以输入特征x是二维的,这是数据集的散点图。

归一化需要两个步骤：零均值化、归一化方差。

第一步：零均值化。subtract out or to zero out the mean 计算出u即x(i)的均值。

随后x = x-u。即通过移动训练集，直到其为零均值化。

第二步：归一化方差。

如上图所示，特征1的方差比特征2的要大很多。计算出方差sigma^2，如下图：

上一步我们已经完成了零均值化，接下来将所有数据都除以向量sigma^2.最后数据分布如下：

最后注意，如果我们用了此种方法来对训练集的特征进行归一化，那么对于测试集也要使用同样的u和sigma做归一化，而不是在训练集和测试集上分别评估出不同的u和sigma。

疑惑：为什么是除以方差，而不是除以标准差？

为什么我们要对输入进行标准化

标准化后，cost function更容易优化，更快达到迭代目标，即前提是特征的范围基本一致，而不是相差很大。

梯度消失与梯度爆炸-Vanishing/exploding gradients

在本节中，周老师举了一个较为简单的例子来解释梯度爆炸和梯度消失的问题。

为了简单起见，每个隐藏层的结点数都为2，并且设置激活函数为线性激活函数，且b=0。在例子中可以看到，如果我们将W设置得稍微比I(全1矩阵)大一些，最后得到的预测结果y呈指数型增长，即出现指数爆炸；而若是W比I稍微小一些，得到的结果也是指数级下降，即出现梯度消失。这个问题长久以来都是深层神经网络发展的一个阻力。

深层神经网络的权重初始化

我们想出了一个不完整的解决方案，有助于我们为神经网络更谨慎的选择随机初始化参数。

一般而言，为了预防Z过大或过小，随着n越大，w_i应该越小，则最合理的方法是设置w_i = 1/n，其中n表示输入神经元的特征数。因此我们实际上一般写作：

其中n^{l-1}即为我们输入到第l层神经单元的数目。

通常，relu激活函数的W_i应该设置为2/n比较合适，而tanh函数的见下图：

我们可以给上述参数再增加一个乘法参数，但上述方差参数的调优优先级并不高。通过上述的方法，我们确实降低的梯度爆炸和梯度消失问题。

梯度的数值逼近-Numerical approximation of gradients

为了实现梯度检验，我们首先说说如何对计算梯度做数值逼近。吴老师在这节主要是利用了斜率的概念来计算导数，用实际的计算例子来验证梯度计算得是否正确，如下图：

我们在检验的时候用如下公式更加准确：

梯度检验

梯度检验能够帮我们节省很多时间，帮我们发现在反向传播过程中的bug，接下来我们看看如何利用它来调试或检验backprop的实施是否正确。

假设网络中含有下列参数，W1和b1…Wl和bl。为了执行梯度检验，首先我们将所有参数转换成一个巨大的向量数据，将矩阵W转换成向量之后，做连接运算，得到一个巨型向量theta。我们的代价函数J是所有W和b的函数，因此我们得到了J(theta)。接着，我们可以同样把求得的dW1和db1…dWl和dbl转换成一个新的向量，用他们作为dtheta，它与theta具有相同维度。因此现在的问题是，dtheta和代价函数J的梯度有什么关系？

接下来就是实施梯度检验的过程，英语中简称为”grad check”。首先我们要清楚J是超参数theta的一个函数，不论theta的维度是多少，为了实施梯度检验，我们要做的是循环执行，对每个i也就是对每个theta的组成元素计算dtheta_approx(i)的值，我们要使用的是双边误差的公式：

从上节课的梯度数值逼近，我们知道dtheta_approx(i)的值应该逼近dtheta(i)的值，而dtheta(i)即为代价函数的偏导数。通过对每个i执行这个运算，我们可以得到两个向量，即dtheta的逼近值dtheta_approx和dtheta本身。

利用上式计算两个向量之间的距离，再利用右边的值大小判断是否正确。在实施神经网络的时候，我们经常要执行foreprop和backprop，如果我们发现梯度检验有一个较大的值，那么我们就可以怀疑存在bug，因此需要进行调试。

关于梯度检验实现的提示

一点解释：

不要在训练的时候进行梯度检验。
如果算法在梯度检验的时候发生了错误，我们要查看不同的可能导致错误的组成部分，比如dbl或者dWl。
注意正则化，如果我们在成本函数中加了正则项，那么在梯度检验求梯度的时候也要求正则项的梯度。
不要与Dropout一块使用，因为Dropout会随机地使一些隐藏单元不起作用。
只有当W和b接近0时，梯度下降的实施是正确的，因此我们在随机初始化过程中运行梯度检验，再训练网络。

本周作业

Initialization

A well chosen intialization can:

- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error

Import packages and the planar dataset:

import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
from init_utils import sigmoid, relu, compute_loss, forward_propagation, backward_propagation
from init_utils import update_parameters, predict, load_dataset, plot_decision_boundary, predict_dec

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# load image dataset: blue/red dots in circles
train_X, train_Y, test_X, test_Y = load_dataset()

1-Neural Network model

def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
    learning_rate -- learning rate for gradient descent 
    num_iterations -- number of iterations to run gradient descent
    print_cost -- if True, print the cost every 1000 iterations
    initialization -- flag to choose which initialization to use ("zeros","random" or "he")
    
    Returns:
    parameters -- parameters learnt by the model
    """
        
    grads = {}
    costs = [] # to keep track of the loss
    m = X.shape[1] # number of examples
    layers_dims = [X.shape[0], 10, 5, 1]
    
    # Initialize parameters dictionary.
    if initialization == "zeros":
        parameters = initialize_parameters_zeros(layers_dims)
    elif initialization == "random":
        parameters = initialize_parameters_random(layers_dims)
    elif initialization == "he":
        parameters = initialize_parameters_he(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        a3, cache = forward_propagation(X, parameters)
        
        # Loss
        cost = compute_loss(a3, Y)

        # Backward propagation.
        grads = backward_propagation(X, Y, cache)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 1000 iterations
        if print_cost and i % 1000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
            costs.append(cost)
            
    # plot the loss
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

2-Zero initialization

def initialize_parameters_zeros(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    parameters = {}
    L = len(layers_dims)            # number of layers in the network
    
    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1]))
        parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
        ### END CODE HERE ###
    return parameters

Run the following code to train your model on 15,000 iterations using zeros initialization.

parameters = model(train_X, train_Y, initialization = "zeros")
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

Result:

The model is predicting 0 for every example.

In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression.

Remember:

The weights W[l] should be initialized randomly to break symmetry.
It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is initialized randomly.

3- Random initialization

def initialize_parameters_random(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)               # This seed makes sure your "random" numbers will be the as ours
    parameters = {}
    L = len(layers_dims)            # integer representing the number of layers
    
    for l in range(1, L):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*10
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###

    return parameters

Result:

Observations

The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(a[3])=log(0), the loss goes to infinity.
Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm.
If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.

In summary:

Initializing weights to very large random values does not work well.
Hopefully intializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!

4- He initialization

Finally, try “He Initialization”; this is named for the first author of He et al., 2015.

def initialize_parameters_he(layers_dims):
    """
    Arguments:
    layer_dims -- python array (list) containing the size of each layer.
    
    Returns:
    parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
                    W1 -- weight matrix of shape (layers_dims[1], layers_dims[0])
                    b1 -- bias vector of shape (layers_dims[1], 1)
                    ...
                    WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1])
                    bL -- bias vector of shape (layers_dims[L], 1)
    """
    
    np.random.seed(3)
    parameters = {}
    L = len(layers_dims) - 1 # integer representing the number of layers
     
    for l in range(1, L + 1):
        ### START CODE HERE ### (≈ 2 lines of code)
        parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1])*np.sqrt(2/layers_dims[l-1])
        parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
        ### END CODE HERE ###
        
    return parameters

Result:

Observation

The model with He initialization separates the blue and the red dots very well in a small number of iterations.

5- Conclusions

What we need to remember:

Different initializations lead to different results.
Random initialization is used to break symmetry and make sure different hidden units can learn different things.
Don’t intialize to values that are too large.
He initialization works well for networks with ReLU activations.

Regularization

import packages:

# import packages
import numpy as np
import matplotlib.pyplot as plt
from reg_utils import sigmoid, relu, plot_decision_boundary, initialize_parameters, load_2D_dataset, predict_dec
from reg_utils import compute_cost, predict, forward_propagation, backward_propagation, update_parameters
import sklearn
import sklearn.datasets
import scipy.io
from testCases import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

Problem Statement You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France’s goal keeper should kick the ball so that the French team’s players can then hit it with their head.

They give you the following 2D dataset from France’s past 10 games.

1	train_X, train_Y, test_X, test_Y = load_2D_dataset()

Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field.

If the dot is blue, it means the French player managed to hit the ball with his/her head
If the dot is red, it means the other team’s player hit the ball with their head

Your goal: Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball.

Analysis of the dataset: This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue) from the lower right half (red) would work well.

You will first try a non-regularized model. Then you’ll learn how to regularize it and decide which model you will choose to solve the French Football Corporation’s problem.

1- Non-regularized model

You will use the following neural network (already implemented for you below). This model can be used:

in regularization mode — by setting the lambd input to a non-zero value. We use “lambd“ instead of “lambda“ because “lambda“ is a reserved keyword in Python.
in dropout mode — by setting the keep_prob to a value less than one

You will first try the model without any regularization. Then, you will implement:

L2 regularization — functions: “compute_cost_with_regularization()“ and “backward_propagation_with_regularization()“
Dropout — functions: “forward_propagation_with_dropout()“ and “backward_propagation_with_dropout()“

In each part, you will run this model with the correct inputs so that it calls the functions you’ve implemented. Take a look at the code below to familiarize yourself with the model.

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.
    
    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1: # 无dropout
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0: # 无L2正则
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert(lambd==0 or keep_prob==1)    # it is possible to use both L2 regularization and dropout, 
                                            # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

Let’s train the model without any regularization, and observe the accuracy on the train/test sets.

parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! Lets now look at two techniques to reduce overfitting.

2- L2 Regularization

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.
    
    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model
    
    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = (np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))*(lambd/(2*m))
    ### END CODER HERE ###
    
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.
    
    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + W3*(lambd/m)
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + W2*(lambd/m)
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + W1*(lambd/m)
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

Train and test:

parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

Result:

Observations:

The value of $\lambda$ is a hyperparameter that you can tune using a dev set.
L2 regularization makes your decision boundary smoother. If $\lambda$ is too large, it is also possible to “oversmooth”, resulting in a model with high bias.

What is L2-regularization actually doing?:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.

What you should rememeber —— the implications

The cost computation: A regularization term is added to the cost
The backpropagation function: There are extra terms in the gradients with respect to weight matrices
Weights end up smaller (“weight decay”): Weights are pushed to smaller values.

3- Dropout

Finally, dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

3.1- Forward propagation

注意随机函数用的是np.random.rand，获得0到1之间的随机数；而不是randn。

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    """
    Implements the forward propagation: LINEAR -> RELU + DROPOUT -> LINEAR -> RELU + DROPOUT -> LINEAR -> SIGMOID.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (20, 2)
                    b1 -- bias vector of shape (20, 1)
                    W2 -- weight matrix of shape (3, 20)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    A3 -- last activation value, output of the forward propagation, of shape (1,1)
    cache -- tuple, information stored for computing the backward propagation
    """
    
    np.random.seed(1)
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    ### START CODE HERE ### (approx. 4 lines)         # Steps 1-4 below correspond to the Steps 1-4 described above. 
    D1 = np.random.rand(A1.shape[0],A1.shape[1])                                         # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = D1 < keep_prob                                         # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                        # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob                                         # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    ### START CODE HERE ### (approx. 4 lines)
    D2 = np.random.rand(A2.shape[0],A2.shape[1])                                         # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = D2 < keep_prob                                      # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                        # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                                         # Step 4: scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

3.2- Backward propagation with dropout

Instruction: Backpropagation with dropout is actually quite easy. You will have to carry out 2 Steps:

You had previously shut down some neurons during forward propagation, by applying a mask $D^{[1]}$ to A1. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[1]}$ to dA1.
During forward propagation, you had divided A1 by keep_prob. In backpropagation, you’ll therefore have to divide dA1 by keep_prob again (the calculus interpretation is that if $A^{[1]}$ is scaled by keep_prob, then its derivative $dA^{[1]}$ is also scaled by the same keep_prob).

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    """
    Implements the backward propagation of our baseline model to which we added dropout.
    
    Arguments:
    X -- input dataset, of shape (2, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation_with_dropout()
    keep_prob - probability of keeping a neuron active during drop-out, scalar
    
    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA2 = dA2 * D2              # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    ### START CODE HERE ### (≈ 2 lines of code)
    dA1 = dA1 * D1              # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob              # Step 2: Scale the value of neurons that haven't been shut down
    ### END CODE HERE ###
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

Train and test:

parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

Result:

What you should remember about dropout:

Dropout is a regularization technique.
You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

Conclusions

Note that regularization hurts training set performance! This is because it limits the ability of the network to overfit to the training set. But since it ultimately gives better test accuracy, it is helping your system.

What we need to remember:

Regularization will help you reduce overfitting.
Regularization will drive your weights to lower values.
L2 regularization and Dropout are two very effective regularization techniques.

Gradient Checking

import packages:

1
2
3

import numpy as np
from testCases import *
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector

1) How does gradient checking work?

2) 1-dimensional gradient checking

def forward_propagation(x, theta):
    """
    Implement the linear forward propagation (compute J) presented in Figure 1 (J(theta) = theta * x)
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    
    Returns:
    J -- the value of function J, computed using the formula J(theta) = theta * x
    """
    
    ### START CODE HERE ### (approx. 1 line)
    J = theta * x
    ### END CODE HERE ###
    
    return J

def backward_propagation(x, theta):
    """
    Computes the derivative of J with respect to theta (see Figure 1).
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    
    Returns:
    dtheta -- the gradient of the cost with respect to theta
    """
    
    ### START CODE HERE ### (approx. 1 line)
    dtheta = x
    ### END CODE HERE ###
    
    return dtheta

def gradient_check(x, theta, epsilon = 1e-7):
    """
    Implement the backward propagation presented in Figure 1.
    
    Arguments:
    x -- a real-valued input
    theta -- our parameter, a real number as well
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Compute gradapprox using left side of formula (1). epsilon is small enough, you don't need to worry about the limit.
    ### START CODE HERE ### (approx. 5 lines)
    thetaplus = theta + epsilon                               # Step 1
    thetaminus = theta - epsilon                              # Step 2
    J_plus = forward_propagation(x,thetaplus)                                  # Step 3
    J_minus = forward_propagation(x,thetaminus)                                 # Step 4
    gradapprox = (J_plus-J_minus)/(2*epsilon)                             # Step 5
    ### END CODE HERE ###
    
    # Check if gradapprox is close enough to the output of backward_propagation()
    ### START CODE HERE ### (approx. 1 line)
    grad = backward_propagation(x, theta)
    ### END CODE HERE ###
    
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                             # Step 1'
    denominator = np.linalg.norm(grad)+np.linalg.norm(gradapprox)                          # Step 2'
    difference = numerator / denominator                           # Step 3'
    ### END CODE HERE ###
    
    if difference < 1e-7:
        print ("The gradient is correct!")
    else:
        print ("The gradient is wrong!")
    
    return difference

If the difference is smaller that the 1e-7, then we can have high confidence that we’ve correctly computed the gradient in backward_propagation().

3)N-dimensional gradient checking

如何对多层神经网络进行梯度检测？

def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    """
    Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
    
    Arguments:
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
    grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters. 
    x -- input datapoint, of shape (input size, 1)
    y -- true "label"
    epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
    
    Returns:
    difference -- difference (2) between the approximated gradient and the backward propagation gradient
    """
    
    # Set-up variables
    parameters_values, _ = dictionary_to_vector(parameters)
    grad = gradients_to_vector(gradients)
    num_parameters = parameters_values.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # Compute gradapprox
    for i in range(num_parameters):
        
        # Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
        # "_" is used because the function you have to outputs two parameters but we only care about the first one
        ### START CODE HERE ### (approx. 3 lines)
        thetaplus = np.copy(parameters_values)                                     # Step 1
        thetaplus[i][0] = thetaplus[i][0] + epsilon                           # Step 2
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaplus))                                # Step 3
        ### END CODE HERE ###
        
        # Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
        ### START CODE HERE ### (approx. 3 lines)
        thetaminus = np.copy(parameters_values)                                        # Step 1
        thetaminus[i][0] = thetaplus[i][0] - epsilon                              # Step 2        
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(thetaminus))                                    # Step 3
        ### END CODE HERE ###
        
        # Compute gradapprox[i]
        ### START CODE HERE ### (approx. 1 line)
        gradapprox[i] = (J_plus[i]-J_minus[i])/(2*epsilon)
        ### END CODE HERE ###
    
    # Compare gradapprox to backward propagation gradients by computing difference.
    ### START CODE HERE ### (approx. 1 line)
    numerator = np.linalg.norm(grad - gradapprox)                                           # Step 1'
    denominator = np.linalg.norm(gradapprox) + np.linalg.norm(grad)                                         # Step 2'
    difference = numerator / denominator                                         # Step 3'
    ### END CODE HERE ###

    if difference > 1e-7:
        print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
    else:
        print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

最后运行结果出现difference大于1e-7，说明我们的反向传播函数中出现错误，需要到原函数中进行查找。

Note

Gradient Checking is slow. For this reason, we don’t run gradient checking at every iteration during training. Just a few times to check if the gradient is correct.
Gradient Checking, at least as we’ve presented it, doesn’t work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout.

What we need to rememeber

Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
Gradient checking is slow, so we don’t run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process.