调试处理-Tuning process

在深度学习中，我们需要调整的超参数有学习率alpha、Momentum的参数beta、Adam优化算法的参数beta1/beta2/epsilon、神经网络层数、每一层隐藏单元个数、衰退率learning rate decay、mini-batch的大小等等。在这些超参数中，一些超参数比其他的要重要，其中学习率是最重要的超参数。图中，红色为1，橙色为2，紫色为3，Adam的参数则通常为默认值。

策略一： Try random values, but don’t use a grid. 通常，我们可能会使用网格(grid)搜索，但这种方法仅适用于超参数较少的情况。当训练深度神经网络时，我们不使用网格搜索，而是设置随机值。有时我们能难预知哪些超参数更重要，因为我们搜索的超参数可能有很多个，因此采取随机取值而不是网格取值表明你探究了更多重要超参数的潜在值。

策略二： Coarse to fine(从粗糙到惊喜) 形象化例子如下，现在粗糙的网格中随机搜索，再在结果比较优的几个取值周围进行更精细地随机选取。

为超参数选择合适的范围

在上一节中我们知道了在超参数范围中随机取值可以提高我们的搜索效率。但随机取值并不是在有效值范围内的随机均匀取值，而是选择合适的标尺用于探究这些超参数。

对于可以随机均匀取值的超参数，如隐藏层单元个数，神经网络层数等：

而对于有些超参数则不适合使用随机均匀取值。比如学习率，我们觉得最小取值为0.0001，最大取值为1。显然，90%的搜索会集中在0.0001到0.1之间，但在0.1到1却只有10%的可能。因此，我们可以采取另一种搜索策略。如图，设置几个固定点为0.0001,0.001，0.01,0.1和1，在这些范围内再进行随机均匀取值。

用python表示为：

1 2	r = -4 * np.random.rand() # 那么r属于[-4,0] alpha = pow(10, r) # 那么alpha属于[10^(-4), 1]

另一个比较棘手的超参数调参例子是beta，其是用来计算指数的加权平均值。假设我们认定beta是0.9到0.999中的某个值。我们需要注意的是，beta取值0.9类似于与计算10天的温度平均值，取值0.999相当于在1000个值中取平均。因此我们在0.9到0.999中取值，就不适合用线性搜索，即不可在此区间随机均匀取值。

因此最好的方法是考虑1-beta，其取值为0.1到0.001。然后再应用学习率的取值方法，有r取值在[-3,-1]，再设置1-beta=10^r，从而得到beta。

为什么不可以使用线性取值呢？这是因为，当beta越接近1时，其所得结果的灵敏度会变化，即使beta只有微小的变化。因此当beta在0.9到0.9005之间取值，我们的结果几乎不会变化；但beta在0.999(1000个温度数据)到0.9995(2000个温度数据)之间取值，则会对我们的算法产生巨大影响。

因此，我们需要在超参数选择中做出正确的scale decision。

超参数训练的实践：Panda vs Caviar

到目前为止，我们已经听了许多关于如何搜索最优超参数的内容，在结束该讨论之前，我们讲讲如何组织超参数搜索过程。

如今的深度学习已经应用到许多不同的领域，某个应用领域的超参数设定，有可能通用与另一领域，不同的应用领域出现相互交融。比如，吴老师说，他曾经看到过计算机视觉领域中涌现的巧妙方法，比如Confonets或ResNets，它们还成功应用于语音识别。

深度学习领域中，发展很好的一点是不同应用领域的人们会阅读越来多其它研究深度学习领域的文章，跨领域寻找灵感。

就超参数设定而言，即使我们只研究一个问题，比如逻辑学，如果我们已经找到一组很好的参数设置，并继续发展算法。或许在几个月的过程中，观察到数据会逐渐改变，而这些改变使得我们原来的超参数设定不再好用。因此我们需要重新测试或评估我们的超参数(Re-test hyperparameters occasionally)，至少每隔几个月一次，以确保对数值依然满意。

最后，关于如何搜索超参数的问题，有两种重要的思路。一个是babysitting one model，即每天根据模型的表，对该模型进行不同参数的调整（如学习率），这通常是因为我们没有足够的计算能力；一个是Training many models in parallel，同时训练多种模型，从中选择表现最优的模型，用这种方式我们可以试验许多不同的参数设置，从中选择最好的。

上面两种方法就好像熊猫和鱼卵的对比，而这主要是由于我们的计算资源来决定的。

Batch Norm —— 感觉还不太懂，需要回看

机器学习领域有个很重要的假设：IID独立同分布假设，就是假设训练数据和测试数据是满足相同分布的，这是通过训练数据获得的模型能够在测试集获得好的效果的一个基本保障。那Batch Norm的作用是什么呢？Batch Norm就是在深度神经网络训练过程中使得每一层神经网络的输入保持相同分布的。

正则化网络的激活函数 —— Normalizing activations in a network

在之前的课程中我们学到过归一化输入特征对于训练神经网络参数W和b的速度提升有很大帮助，如下：

那么这就产生了对于每一层隐藏层的输入是否要归一化的问题。对于有些学者而言，有着是归一化Z还是A的讨论，这里吴老师默认第一选择是归一化Z。

Implementing Batch Norm 假设我们有隐藏单元值Z[1]到Z[m]，这里简化了原有的符号表示。Batch Norm使得归一化不仅适用于训练的输入，也能适用于隐藏层的输入。

在图中，iteration内我们首先计算了平均值mu和方差，并且计算了我们原有归一化后的Z值。但是由于我们并不希望每一个隐藏层都具有相同的平均值和方差，因此添加了两个超参数gamma和beta来调整对应的平均值和方差。计算得到结果后，我们使用新的Z值而不是原来的Z值来参与训练。

将Batch Norm拟合进神经网络

接下来我们将Batch Norm拟合进神经网络中，简单的图示过程如下：

因此我们可以得到整个神经网络的参数为：

对于beta[l]和gamma[l]，我们也可以使用梯度下降的方法来对其进行更新，注意这里的beta与优化算法中的beta是两个完全不同的参数。

在实际应用深度学习框架时，我们往往不需要实现Batch Norm的细节，比如Tensorflow中，可以直接使用tf.nn.batch_normalization来实现BN。

实际中，Batch Norm经常与Mini-batch一同使用，简单的图示过程如下：

这里有一个需要注意的细节。我们在Batch Norm中的参数为W[l],b[l],beta[l]和gamma[l]。在原先的实现中，我们计算Z[l]=W[l]a[l-1]+b[l]，但在实施了Batch Norm之后，b[l]都会被减去，因为我们在减去平均值时就相当于将b[l]消去了。因此在使用Batch归一化时，我们可以将b[l]简单地设置为常数0，而不需要对其进行更新。另外注意参数的维度即可。

接下来讲解整个过程：

注意到，db实际上不用再计算了。另外，Batch Norm也适用于其他的优化算法，如Adam等。

为什么Batch Norm奏效？

一个原因就是我们之前在归一化输入特征时讲到的，通过归一化所有的输入特征值，以获得类似范围的值，可以加快学习速度。

另一个原因就是考虑到covariate shift的问题，这个问题是指如果我们有一个从X到Y的映射函数，当X的分布发生改变时，那么这个函数也要变化。
对于深度学习这种包含很多隐层的网络结构，在训练过程中，因为各层参数不停在变化，所以每个隐层都会面临covariate shift的问题，也就是在训练过程中，隐层的输入分布老是变来变去，这就是所谓的“Internal Covariate Shift”，Internal指的是深层网络的隐层，是发生在网络内部的事情，而不是covariate shift问题只发生在输入层。因此Batch Norm可以确保，无论输入的数据如何变化，输入的均值和方差保持不变。

Batch Norm减少了输入值改变带来的问题，它使得这些值变得更稳定，即使输入分布改变了一些，那么归一化后它改变的程度也刽很多。它所做的是当前层的输入改变时，使得后层需要适应的程度减少了。这就意味着减弱了前层参数的作用与后层参数的作用之间的联系，使得网络每一层都可以自己学习，而稍稍独立于其它层，也有利于加速整个网络的学习。

Batch Norm还有一个作用，它有轻微的正则化效果，将Batch Norm应用于Mini-batch上，因为是在mini-batch上计算均值和方差，而不是在整个数据集上，因此可以存在一点噪声，而这些噪声的作用和dropout类似，dropout是在每个隐藏层的激活值上增加了噪音，通过一定的概率使得隐藏单元激活或者失活；另一个轻微但非直观的效果是，如果我们应用了较大的mini-batch，如512而不是64，我们减少了噪音，因此减少了正则化效果。这也是dropout的一个奇怪的性质，就是应用较大的mini-batch可以减少正则化效果。

一般来说，我们不会把Batch Norm当做正则化方式，而是把它当做将归一化隐藏层并且加速学习的一种方式。

Batch Norm at test time

Batch Nrom将数据以mini-batch的形式进行处理，但在测试时，我们可能需要对每个样本逐一处理（预测）。

回想最开始，我们是通过以上等式执行Batch Norm。在一个mini-batch中，将所有的Z(i)值求和计算均值，计算方差后再计算z_norm(i)，最后再次调整z_norm得到z_tilda。注意，用于计算的均值和方差是在整个mini-batch上计算的，但在测试时，我们不可能将一个mini-batch的样本同时处理，因此需要用其他方式得到均值和方差，并且假设我们只有一个样本的话，一个样本的均值和方差没有意义。因此实际上，为了将我们的神经网络运用于测试，需要单独估算均值和方差。在典型的Batch Norm运用中，我们需要用一个指数加权平均来估算，这个平均值涵盖了所有的mini-batch。

假设我们在训练集上有多个mini-batch，通过在每个mini-batch上计算当前隐藏层的均值mu和方差，我们得到了每一层的均值和方差的不同数值（以mini-batch来变化），因此我们可以像之前计算温度一样计算得到均值和方差的指数加权平均值。最后在测试时，使用均值和方差的指数加权平均来求z_norm，再使用我们在神经网络训练过程中得到的beta和gamma参数来计算我们的测试样本的z_tilda值。

关于Batch Norm更详细的知识解释可看：https://www.cnblogs.com/guoyaohua/p/8724433.html

Softmax回归

在之前我们所讲到的分类都是二元分类，接下来讲解与多元分类相关的Softmax回归。

我们用大写字母C来表示输入会被分入的类别总个数，如上图一共有4类，即0,1,2,3。我们要用神经网络来进行多元分类，希望有输出层的神经元个数来告知我们这4种类型中每一个的概率有多大。（为什么这里输出层单元可以有这样的对应关系呢？我不明白）

要做到多个概率的输入，需要用到Softmax函数。与sigmoid和relu激活函数的输入和输出不同（这两者的输入输出都是一个实值），Softmax的输入可以是一个向量。由下图可知，我们的Z[L]的维度为(4,1)，而得到的输出a[L]的维度也是(4,1)。并且，计算时首先算出Z[L]每个元素的指数幂，随后再进行整体归一化，得到对应的概率值，而这个概率值也就是我们想要的结果。

在图中右边也给了一个简单的计算例子，即算出来Z[L] = [5 2 -1 3].T，通过计算指数幂得到t = [148.4, 7.4, 0.4, 20.1].T，总和为176.3，从而计算得到概率输出为a[L] = [0.842, 0.042, 0.002, 0.114].T。

接下来举了没有隐藏层的神经网络结合Softmax的例子方便理解。

可以看到，尽管没有输出层，Softmax还是可以学习到线性分界，那么结合隐藏层的话，就可以得到更复杂的非线性分界了。

训练一个Softmax分类器

Understanding softmax

用临时变量t进行归一化，之后计算得到对应的概率。hard max会观察Z的值，然后直接在最大的元素上设置输出为1，其他的为0；而Softmax则使得Z到概率之间的映射更为温和。而Softmax回归实际上是Logistic回归的扩展。当C=2，我们可以得到输出层的两个概率，而由于我们实际上不需要两个概率，只要得到其中一个值就可以知道另一个，因此Logistic的输出实际上只有一个。因此我们可以说softmax回归将logistic回归推广到了两种分类以上。

Loss function

假设我们的ground truth label是cat，即y = [0 1 0 0].T，而我们训练得到的a[L] = y_hat = [0.3 0.2 0.1 0.4].T，这实际上不是好的结果。那么我们需要一个loss function来衡量误差。

概括来说，损失函数所做的就是找到训练集中的真实类别，然后试图使该类别相应的概率尽可能地高。左边显示的是单个样本的loss function，而右边以W,b为参数的则是整个数据集的loss function。

Gradient descent with softmax

由于之后可以用深度学习框架来做作业了，吴恩达老师没有讲具体的求导过程…

其实，最关键的就是求得a关于z的导数，这里分为了两种情况：

得到了上述之后，再求Loss函数关于a的导数，相乘即可，从而就有了Loss函数关于z的导数。

深度学习框架

还需补上Pytorch，这个也是必须掌握的。所以我要掌握的有Tensorflow和Pytorch。

Tensorflow

这里吴恩达老师给了一些Tensorflow基本结构的例子。

首先，我们设置cost function为J(w) = w^2 - l0*w + 25，而我们希望求得使得J(w)最小化的w（显然w=5时J最小），简单的tensorflow程序如下：

注意到cost函数可以由注释部分写为下方比较简单的形式。而tf.train.GradientDescentOptimizer中的参数为学习率，只有当run学习函数的时候，w才会变化。再经过1000次迭代后，注意到输出的w为4.9999886，这与w=5非常接近。

如果我们希望加入训练数据，比如在这个二次方程中，希望将方程的系数作为输入的数据，可以用tf.placeholder来完成。示例代码如下：

注意，我们使用feed_dict参数传入我们的训练数据。

另外，有个可以注意的地方：

通常我们写程序的时候，采取右边的with方式来写，这种写法有利于在执行内循环出错时的内存释放。

Tensorflow程序的核心是计算损失函数，然后Tensorflow会自动求出导数，以及如何最小化损失。

这个损失函数的作用就是让TensorFlow建立计算图，计算图所做的事情如下：

而Tensorflow的优点是，通过用这个计算图基本实现前向传播，而且内置了所有必要的反向函数，因此我们在使用内置函数计算前向传播时，它可以自动地计算反向传播。（Tensorflow计算图用的是运算符作为结点）

本周作业

1- Exploring the Tensorflow Library

Import packages:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
from tf_utils import load_dataset, random_mini_batches, convert_to_one_hot, predict

%matplotlib inline
np.random.seed(1)

首先给出一个简单的Loss function的例子：

y_hat = tf.constant(36, name="y_hat")
y = tf.constant(39, name="y")

loss = tf.Variable((y-y_hat)**2, name="loss")

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    print(sess.run(loss))

Writing and running programs in TensorFlow has the following steps:

Create Tensors (variables) that are not yet executed/evaluated.
Write operations between those Tensors.
Initialize your Tensors.
Create a Session.
Run the Session. This will run the operations you’d written above.

接下来看一个例子：

我们没有得到20的结果，而是得到了一个tensor的介绍：You got a tensor saying that the result is a tensor that does not have the shape attribute, and is of type “int32”. 我们所做的只是将其放入了计算图，但并没有开始运算。为了能够使这两个数字相乘，我们需要创建会话并且运行它。

Summarize: remember to initialize your variables, create a session and run the operations inside the session.

接下来，我们需要知道placeholders。A placeholder is an object whose value you can specify only later. To specify values for a placeholder, you can pass in values by using a “feed dictionary” (feed_dict variable). Below, we created a placeholder for x. This allows us to pass in a number later when we run the session.

Here’s what’s happening: When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. The computation graph can have some placeholders whose values you will specify only later. Finally, when you run the session, you are telling TensorFlow to execute the computation graph.

1.1- Linear function

Lets start this programming exercise by computing the following equation: Y=WX+bY, where W and X are random matrices and b is a random vector.

def linear_function():
    """
    Implements a linear function: 
            Initializes W to be a random tensor of shape (4,3)
            Initializes X to be a random tensor of shape (3,1)
            Initializes b to be a random tensor of shape (4,1)
    Returns: 
    result -- runs the session for Y = WX + b 
    """

    np.random.seed(1)
    ### START CODE HERE ### (4 lines of code)
    X = tf.constant(np.random.randn(3,1), name="X")
    W = tf.constant(np.random.randn(4,3), name="W")
    b = tf.constant(np.random.randn(4,1), name="b")
    Y = tf.add(tf.matmul(W,X),b)
    ### END CODE HERE ### 

    # Create the session using tf.Session() and run it with sess.run(...) on the variable you want to calculate
    ### START CODE HERE ###
    sess = tf.Session()
    result = sess.run(Y)
    ### END CODE HERE ### 
    
    # close the session 
    sess.close()

    return result

1.2- Computing the sigmoid

tf.placeholder的参数意义：

tf.placeholder(
    dtype,
    shape=None,
    name=None
)

Tensorflow offers a variety of commonly used neural network functions like tf.sigmoid and tf.softmax. For this exercise lets compute the sigmoid function of an input. You will do this exercise using a placeholder variable x. When running the session, you should use the feed dictionary to pass in the input z. In this exercise, you will have to (i) create a placeholder x, (ii) define the operations needed to compute the sigmoid using tf.sigmoid, and then (iii) run the session.

def sigmoid(z):
    """
    Computes the sigmoid of z
    
    Arguments:
    z -- input value, scalar or vector
    
    Returns: 
    results -- the sigmoid of z
    """
    
    ### START CODE HERE ### ( approx. 4 lines of code)
    # Create a placeholder for x. Name it 'x'.
    x = tf.placeholder(tf.float32, name="x")

    # compute sigmoid(x)
    sigmoid = tf.sigmoid(x)

    # Create a session, and run it. Please use the method 2 explained above. 
    # You should use a feed_dict to pass z's value to x. 
    with tf.Session() as sess:
        # Run session and call the output "result"
        result = sess.run(sigmoid, feed_dict={x:z})
    
    ### END CODE HERE ###
    
    return result

Summarize:

Create placeholders.
Specify the computation graph corresponding to operations you want to compute.
Create the session.
Run the session, using a feed dictionary if necessary to specify placeholder variables’ values.

1.3- Computing the Cost

You can also use a built-in function to compute the cost of your neural network. So instead of needing to write code to compute this as a function of a[2]_i and y(i) for i=1…m:

这里用的一个函数是tf.nn.sigmoid_cross_entropy_with_logits，其中tf.nn.sigmoid_cross_entropy_with_logits(logits = …, labels = …)，另外可以看到下面注释的notes中写着：What we’ve been calling “z” and “y” in this class are respectively called “logits” and “labels” in the TensorFlow documentation. So logits will feed into z, and labels into y.

def cost(logits, labels):
    """
    Computes the cost using the sigmoid cross entropy
    
    Arguments:
    logits -- vector containing z, output of the last linear unit (before the final sigmoid activation)
    labels -- vector of labels y (1 or 0) 
    
    Note: What we've been calling "z" and "y" in this class are respectively called "logits" and "labels" 
    in the TensorFlow documentation. So logits will feed into z, and labels into y. 
    
    Returns:
    cost -- runs the session of the cost (formula (2))
    """
    
    ### START CODE HERE ### 
    
    # Create the placeholders for "logits" (z) and "labels" (y) (approx. 2 lines)
    z = tf.placeholder(tf.float32, name="z")
    y = tf.placeholder(tf.float32, name="y")
    
    # Use the loss function (approx. 1 line)
    cost = tf.nn.sigmoid_cross_entropy_with_logits(logits=z,labels=y)
    
    # Create a session (approx. 1 line). See method 1 above.
    sess = tf.Session()
    
    # Run the session (approx. 1 line).
    cost = sess.run(cost, feed_dict={z:logits, y:labels})
    
    # Close the session (approx. 1 line). See method 1 above.
    sess.close()
    
    ### END CODE HERE ###
    
    return cost

1.4- Using One Hot encodings

Many times in deep learning you will have a y vector with numbers ranging from 0 to C-1, where C is the number of classes. If C is for example 4, then you might have the following y vector which you will need to convert as follows:

This is called a “one hot” encoding, because in the converted representation exactly one element of each column is “hot” (meaning set to 1). To do this conversion in numpy, you might have to write a few lines of code. In tensorflow, you can use one line of code:

tf.one_hot(indices, depth, axis)

注意，axis=0是按列编码，即与上图中类似，而axis=1则按照行编码。

def one_hot_matrix(labels, C):
    """
    Creates a matrix where the i-th row corresponds to the ith class number and the jth column
                     corresponds to the jth training example. So if example j had a label i. Then entry (i,j) 
                     will be 1. 
                     
    Arguments:
    labels -- vector containing the labels 
    C -- number of classes, the depth of the one hot dimension
    
    Returns: 
    one_hot -- one hot matrix
    """
    
    ### START CODE HERE ###
    # Create a tf.constant equal to C (depth), name it 'C'. (approx. 1 line)
    C = tf.constant(C,name="C")
    
    # Use tf.one_hot, be careful with the axis (approx. 1 line)
    one_hot_matrix = tf.one_hot(indices=labels,depth=C,axis=0)
    
    # Create the session (approx. 1 line)
    sess = tf.Session()
    
    # Run the session (approx. 1 line)
    one_hot = sess.run(one_hot_matrix)
    
    # Close the session (approx. 1 line). See method 1 above.
    sess.close()
    
    ### END CODE HERE ###
    
    return one_hot

1.5- Initialize with zeros and ones

Now you will learn how to initialize a vector of zeros and ones. The function you will be calling is tf.ones(). To initialize with zeros you could use tf.zeros() instead. These functions take in a shape and return an array of dimension shape full of zeros and ones respectively.

def ones(shape):
    """
    Creates an array of ones of dimension shape
    
    Arguments:
    shape -- shape of the array you want to create
        
    Returns: 
    ones -- array containing only ones
    """
    
    ### START CODE HERE ###
    
    # Create "ones" tensor using tf.ones(...). (approx. 1 line)
    ones = tf.ones(shape)
    
    # Create the session (approx. 1 line)
    sess = tf.Session()
    
    # Run the session to compute 'ones' (approx. 1 line)
    ones = sess.run(ones)
    
    # Close the session (approx. 1 line). See method 1 above.
    sess.close()
    
    ### END CODE HERE ###
    return ones

2- Building your first neural network in tensorflow

In this part of the assignment you will build a neural network using tensorflow. Remember that there are two parts to implement a tensorflow model:

Create the computation graph
Run the graph

2.0- Problem statement: SIGNS Dataset

One afternoon, with some friends we decided to teach our computers to decipher sign language. We spent a few hours taking pictures in front of a white wall and came up with the following dataset. It’s now your job to build an algorithm that would facilitate communications from a speech-impaired person to someone who doesn’t understand sign language.

Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number).
Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number).

Note that this is a subset of the SIGNS dataset. The complete dataset contains many more signs. Here are examples for each number, and how an explanation of how we represent the labels. These are the original pictures, before we lowered the image resolutoion to 64 by 64 pixels.

Run the following code to load the dataset.

1	X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

Change the index below and run the cell to visualize some examples in the dataset.

# Example of a picture
index = 0
plt.imshow(X_train_orig[index])
print("y = " + str(np.squeeze(Y_train_orig[:,index])))

As usual you flatten the image dataset, then normalize it by dividing by 255. On top of that, you will convert each label to a one-hot vector as shown in Figure 1. Run the cell below to do so.

# Flatten the traing and test images
X_train_flatten = X_train_orig.reshape(X_train_orig.shape[0], -1).T
X_test_flatten = X_test_orig.reshape(X_test_orig.shape[0], -1).T
# Normalize the image vectors
X_train = X_train_flatten / 255.
X_test = X_test_flatten / 255.
# Convert training and test labels to one hot matrixs
Y_train = convert_to_one_hot(Y_train_orig, 6)
Y_test = convert_to_one_hot(Y_test_orig, 6)

Your goal is to build an algorithm capable of recognizing a sign with high accuracy. To do so, you are going to build a tensorflow model that is almost the same as one you have previously built in numpy for cat recognition (but now using a softmax output). It is a great occasion to compare your numpy implementation to the tensorflow one.

The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX. The SIGMOID output layer has been converted to a SOFTMAX. A SOFTMAX layer generalizes SIGMOID to when there are more than two classes.

2.1- Create placeholders

Your first task is to create placeholders for X and Y. This will allow you to later pass your training data in when you run your session.

def create_placeholders(n_x, n_y):
    """
    Creates the placeholders for the tensorflow session.
    
    Arguments:
    n_x -- scalar, size of an image vector (num_px * num_px = 64 * 64 * 3 = 12288)
    n_y -- scalar, number of classes (from 0 to 5, so -> 6)
    
    Returns:
    X -- placeholder for the data input, of shape [n_x, None] and dtype "float"
    Y -- placeholder for the input labels, of shape [n_y, None] and dtype "float"
    
    Tips:
    - You will use None because it let's us be flexible on the number of examples you will for the placeholders.
      In fact, the number of examples during test/train is different.
    """

    ### START CODE HERE ### (approx. 2 lines)
    X = tf.placeholder(dtype="float",shape=(n_x,None))
    Y = tf.placeholder(dtype="float",shape=(n_y,None))
    ### END CODE HERE ###
    
    return X, Y

2.2- Initializing the parameters

Your second task is to initialize the parameters in tensorflow.

Exercise: Implement the function below to initialize the parameters in tensorflow. You are going use Xavier Initialization for weights and Zero Initialization for biases. The shapes are given below. As an example, to help you, for W1 and b1 you could use:

W1 = tf.get_variable(“W1”, [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) (tf.contrib这一方法在Tensorflow2.0会被取消)
b1 = tf.get_variable(“b1”, [25,1], initializer = tf.zeros_initializer())

先要写一下tf.get_variable的各个参数名和含义：

tf.get_variable(
    name, # 新变量或现有变量的名称。
    shape=None, # 新变量或现有变量的形状。
    dtype=None, # 新变量或现有变量的类型（默认为DT_FLOAT）。
    initializer=None, # 如果创建了则用它来初始化变量。
    regularizer=None, # 
    trainable=True, # 如果为True，还将变量添加到图形集合GraphKeys.TRAINABLE_VARIABLES
    collections=None, # 要将变量添加到的图表集合列表
    caching_device=None, # 可选的设备字符串或函数，描述变量应被缓存以供读取的位置。
    partitioner=None, # 可选callable，接受完全定义的TensorShape和要创建的Variable的dtype，并返回每个轴的分区列表
    validate_shape=True, # 如果为False，则允许使用未知形状的值初始化变量。
    use_resource=None, # 如果为False，则创建常规变量。如果为true，则使用定义良好的语义创建实验性ResourceVariable。
    custom_getter=None,
    constraint=None
)

def initialize_parameters():
    """
    Initializes parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [25, 12288]
                        b1 : [25, 1]
                        W2 : [12, 25]
                        b2 : [12, 1]
                        W3 : [6, 12]
                        b3 : [6, 1]
    
    Returns:
    parameters -- a dictionary of tensors containing W1, b1, W2, b2, W3, b3
    """
    tf.set_random_seed(1)

    ### START CODE HERE ### (approx. 6 lines of code)
    W1 = tf.get_variable("W1", shape=(25,12288), initializer=tf.contrib.layers.xavier_initializer(seed=1))
    b1 = tf.get_variable("b1", shape=(25,1), initializer=tf.zeros_initializer())
    W2 = tf.get_variable("W2", shape=(12,25), initializer=tf.contrib.layers.xavier_initializer(seed=1))
    b2 = tf.get_variable("b2", shape=(12,1), initializer=tf.zeros_initializer())
    W3 = tf.get_variable("W3", shape=(6,12), initializer=tf.contrib.layers.xavier_initializer(seed=1))
    b3 = tf.get_variable("b3", shape=(6,1), initializer=tf.zeros_initializer())
    ### END CODE HERE ###

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2,
                  "W3": W3,
                  "b3": b3}
    return parameters

At this moment, the parameters haven’t been evaluated yet.

2.3- Forward propagation in tensorflow

You will now implement the forward propagation module in tensorflow. The function will take in a dictionary of parameters and it will complete the forward pass. The functions you will be using are:

tf.add(…,…) to do an addition
tf.matmul(…,…) to do a matrix multiplication
tf.nn.relu(…) to apply the ReLU activation

def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX
    
    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """
    
    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']
    W3 = parameters['W3']
    b3 = parameters['b3']
    
    ### START CODE HERE ### (approx. 5 lines)              # Numpy Equivalents:
    Z1 = tf.add(tf.matmul(W1,X),b1)                                          # Z1 = np.dot(W1, X) + b1
    A1 = tf.nn.relu(Z1)                                              # A1 = relu(Z1)
    Z2 = tf.add(tf.matmul(W2,A1),b2)                                              # Z2 = np.dot(W2, a1) + b2
    A2 = tf.nn.relu(Z2)                                             # A2 = relu(Z2)
    Z3 = tf.add(tf.matmul(W3,A2),b3)                                              # Z3 = np.dot(W3,Z2) + b3
    ### END CODE HERE ###
    
    return Z3

You may have noticed that the forward propagation doesn’t output any cache. You will understand why below, when we get to brackpropagation.

2.4- Compute cost

As seen before, it is very easy to compute the cost using:

- tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...))

Question: Implement the cost function below.

It is important to know that the “logits” and “labels” inputs of tf.nn.softmax_cross_entropy_with_logits are expected to be of shape (number of examples, num_classes). We have thus transposed Z3 and Y for you.
Besides, tf.reduce_mean basically does the summation over the examples. 注意！

这里要注意tf.nn.softmax_cross_entropy_with_logits，其参数含义如下：

logits: 神经网络最后一层的输出，如果有batch的话，它的大小就是[batchsize，num_classes]，单样本的话，大小就是num_classes.
labels: 实际的标签，大小同上。

def compute_cost(Z3, Y):
    """
    Computes the cost
    
    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    
    Returns:
    cost - Tensor of the cost function
    """

    # to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...)
    logits = tf.transpose(Z3)
    labels = tf.transpose(Y)

    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

    return cost

2.5- Backward propagation & parameter update

This is where you become grateful to programming frameworks. All the backpropagation and the parameters update is taken care of in 1 line of code. It is very easy to incorporate this line in the model.

After you compute the cost function. You will create an “optimizer” object. You have to call this object along with the cost when running the tf.session. When called, it will perform an optimization on the given cost with the chosen method and learning rate.

For instance, for gradient descent the optimizer would be:

optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)

To make the optimization you would do:

_, c = sess.run([optimizer, cost], feed_dict={X:minibatch_X, Y:minibatch_Y})

This computes the backpropagation by passing through the tensorflow graph in the reverse order. From cost to inputs.

Note When coding, we often use _ as a “throwaway” variable to store values that we won’t need to use later. Here, _ takes on the evaluated value of optimizer, which we don’t need (and c takes the value of the cost variable).

2.6- Building the model

def model(X_train, Y_train, X_test, Y_test, learning_rate=0.0001, num_epoches=1500, minibatch_size=32, print_cost=True):
    """
    Implements a three-layer tensorflow neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SOFTMAX.
    
    Arguments:
    X_train -- training set, of shape (input size = 12288, number of training examples = 1080)
    Y_train -- test set, of shape (output size = 6, number of training examples = 1080)
    X_test -- training set, of shape (input size = 12288, number of training examples = 120)
    Y_test -- test set, of shape (output size = 6, number of test examples = 120)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    ops.reset_default_graph() # tensorflow在生产环境下，需要将default graph 重新初始化，以保证内存中没有其他的Graph，或者说我们需要在每个session之后清理相应的Graph。
    tf.set_random_seed(1)
    seed = 3
    (n_x, m) = X_train.shape
    n_y = Y_train.shape[0]
    costs = []

    X, Y = create_placeholders(n_x,, n_y)

    parameters = initialize_parameters()

    Z3 = forward_propagation(X, parameters)

    cost = compute_cost(Z3, Y)

    optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)

    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        sess.run(init)
        for epoch in range(num_epochs):
            epoch_cost = 0
            num_minibatches = int(m / minibatch_size)
            seed += 1
            minibatches = random_mini_batches(X_train, Y_train, minibatch_size)

            for minibatch in minibatches:
                (minibatch_X, minibatch_Y) = minibatch
                _, minibatch_cost = sess.run([optimizer, cost], feed_dict={X:minibatch_X,Y:minibatch_Y}) # 列表表示optimizer和cost同时计算
                epoch_cost += minibatch_cost / num_minibatches

            if print_cost == True and epoch % 100 == 0:
                print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
            if print_cost == True and epoch % 5 == 0:
                costs.append(epoch_cost)

        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        # lets save the parameters in a variable
        parameters = sess.run(parameters)
        print ("Parameters have been trained!")

        # Calculate the correct predictions
        correct_prediction = tf.equal(tf.argmax(Z3), tf.argmax(Y))

        # Calculate accuracy on the test set
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) # tf.cast转变数据格式

        print("Train Accuracy:", accuracy.eval({X:X_train, Y:Y_train}))


# 训练

parameters = model(X_train, Y_train, X_test, Y_test)
``` 

结果：
![image.png](https://upload-images.jianshu.io/upload_images/8636110-52fb43d5d3768581.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

可以看到，结果有些过拟合了。

补充函数细节：
1. tf.equal(A, B)是对比这两个矩阵或者向量的相等的元素，如果是相等的那就返回True，反正返回False，返回的值的矩阵维度和A是一样的
。
2. tf.cast转换数据格式，如tf.cast(correct_prediction, "float")，其中correct_prediction本为bool格式，现在转换为float格式。
3. tf.reduce_mean(A,axis=0) #求平均，其中axis=0是按列求平均，axis=1按行求
4. accuracy.eval({X:X_train, Y:Y_train}的写法类似于：
``` py
with tf.Session() as sess:
    sess.run(accuracy, feed_dict={X:X_train,Y:Y_train})

Insights:

Your model seems big enough to fit the training set well. However, given the difference between train and test accuracy, you could try to add L2 or dropout regularization to reduce overfitting.
Think about the session as a block of code to train the model. Each time you run the session on a minibatch, it trains the parameters. In total you have run the session a large number of times (1500 epochs) until you obtained well trained parameters.

2.7- Reduce overfitting

Add dropout, keep_prob=0.9, Result:

发现效果很差，可能的原因是本身神经网络层数就少，而且隐藏层结点个数也比较少，因此不适合使用dropout。

Summary

What you should remember:

Tensorflow is a programming framework used in deep learning
The two main object classes in tensorflow are Tensors and Operators.
When you code in tensorflow you have to take the following steps:
- Create a graph containing Tensors (Variables, Placeholders …) and Operations (tf.matmul, tf.add, …)
- Create a session
- Initialize the session
- Run the session to execute the graph
You can execute the graph multiple times as you’ve seen in model()
The backpropagation and optimization is automatically done when running the session on the “optimizer” object.