为什么要研究实例：Why look at case studies?

就像我们看别人的代码来学习编程一样，通过研究别人构建有效组件的实例也有利于我们的进步。实际上，在计算机视觉任务中表现良好的神经网络框架，往往也适用于其他任务。

本周框架：

尽管我的方向是NLP而不是CV，但我觉得学习这些知识也可能给我带来一些启发。

经典网络：Classic Network

LeNet-5

论文名称：1998, Gradient-based learning applied to document recognition

LeNet-5针对的是灰度图像，因此图片的大小为32x32x1。实际上，LeNet-5的结构与我们上一篇博客所讲的很像。由于这篇论文是在1998年写成的，当时人们更经常用平均池化，并且不使用padding。

随着网络层次的加深，可以发现信道数量一直在增加，而n_H和n_W不断减小。因此在现代的卷积神经网络中，我们会添加padding层。

AlexNet

论文名称：2012, ImageNet classification with deep convolutional neural network

另外，AlexNet有以下几个特点：

与LeNet-5相似，但是网络很大
AlexNet使用的是ReLu激活函数，而LeNet-5当时用的是sigmoid和tanh
由于当时的计算能力不强，因此AlexNet是在多个GPU上运行的，即将多个层的运算放置到不同的GPU上
AlexNet还使用了Local Response Normalization(LRN)，但现在基本不使用了，因此不讲解。

另外，AlexNet非常大，它具有着一共6000,000个参数。

VGG-16

VGG的参数比较少，它是只需要专注于构建卷积层的简单网络。其最大的优点就是简化了网络结构。

可以看到，所有的卷积核参数、池化层参数都是相同的。注意到图中的”x2”,”x3”表示连续做了几次卷积操作。尽管网络结构看起来很深，并且它的参数达到了1.38亿（很大），但由于其理解起来比较简单，因此受到了很多人的青睐。另外，可以注意到图像的高和宽很有规律地减小(224->112->56->28->14->7)，而卷积的通道数也有规律地增长(64->128->512)，因此也受到了吴老师的称赞。

尽管现在也有另一个VGG-19，这个网络要更大。但是由于VGG-16的表现与其差不多，因此更多人使用的是VGG-16。

阅读论文推荐顺序：AlexNet->VGG->LeNet-5

残差网络：Residual Networks

论文名称：2015，Deep residual networks for image recognition

ResNet是由残差块(residual block)构建的，下面介绍一下残差块。

在我们一般的神经网络计算中，假设我们从a[l]计算到a[l+2]，一般经过下面的几个阶段：

可以看到，在得到输入a[l]后，需要经过LINEAR->ReLU得到下一层输入a[l+1]，再继续同样的操作得到a[l+2]。而这一条主路径在残差网络中有所变化。

我们会将a[l]复制到神经网络的深层，在ReLu非线性激活前加上a[l]，我们把这条路称为”short cut/skip connection”。因此a[l]插入的时机是在LINEAR之后，ReLU之前。图示表示如下：

如图，右下角的新的a[l+2]的式子就代表了一个残差块。在实际操作中，a[l]可以跳过一层或者好几层，从而将信息传递到神经网络的更深层。ResNet的发明者发现使用残差块能够训练更深的神经网络。所以构建一个ResNet网络就是通过将很多这样的残差块堆积在一起，形成一个深层神经网络。

如上图所示，该神经网络是由5个残差块连接在一起构成的残差网络。

如果我们使用标准优化算法训练一个普通的神经网络(plain network)，没有残差块，凭借经验，我们会发现随着网络深度的加深，训练错误会减小，然后再增大。而理论上，随着网络深度的加深，训练错误应该越小。因此对于普通神经网络来说，深度越深，用优化算法越难训练，因此训练错误越多。在使用ResNet，随着网络加深，训练错误一直在减少。这种方式有助于解决梯度消失和梯度爆炸问题，让我们在训练更深网络的同时，又能保持良好的性能。

为什么ResNet表现好

下面给出一个例子来解释ResNet表现好的原因，至少可以说明，如何在构建更深层次的ResNet网络的同时，还不降低他们在训练集上的效率。通常来讲，网络在训练集上表现好，才能在hold-out交叉验证集上或dev/test set有好的表现。

在训练普通的神经网络时，我们发现网络越深，它在训练集上的表现越差，因此我们往往不使用太深的网络。但这一原则在训练ResNet的时候并不适用。下面看例子：

如图，我们给Big NN添加了两层，并且增加了一个残差块。此时的输出a[l+2]为残差块的输出。我们注意到，如果我们使用L2正则化或者权重衰减，它会压缩W[l+2]的值。如果X[l+2]=0, b=0，那么这几项就没有了，因此g(a[l]) = a[l]。

因此这意味着，即使给神经网络增加了这两层，它的效率也并不逊色于更简单的神经网络，因为学习identity function对该网络来说并不难。不论是把残差块添加到神经网络的中间还是末端位置，都不会影响网络的表现。吴老师认为，残差网络表现好的原因是这些残差块能够很容易地学习恒等函数(相比之下，普通深层网络是难以学习的），我们能确定网络性能不会受到影响，甚至可以提高我们的网络性能。

除此之外，关于残差网络的另一个值得探讨的细节是假设z[l+2]和a[l]具有相同的维度。因此在ResNet中使用了很多的卷积，使得a[l]的维度等于z[l+2]的维度。

接下来看一下网络对比的例子。

可以看到，ResNet的卷积核基本都是3x3相同的，保证了z[l+2]和a[l]的维度相同。

Network in Network and 1x1 convolutions

论文名称：Network in Network, 2013

在架构内容设计(designing content architectures)方面，其中一个比较有帮助的想法是使用1x1卷积。

我们看到，对于6x6x1平面而言，1x1卷积只是将原图像的每个数字乘以卷积核内的数字而已，因此没有起到作用。而对于一张x6x32的图片来说，使用1x1过滤器进行卷积效果更好。具体来说，1x1卷积核的作用是遍历这36个单元格，计算左图中32个数字和过滤器中32个数字的乘积，然后应用ReLU非线性函数，图例如下：

即将数字对应相乘后再相加，再应用到ReLU上。

所以1x1卷积从根本上理解，这32个单元都应用了一个全连接神经网络。而全连接层的作用是输入32个数字和过滤器数量，标记为n_C[l+1]，在36个单元上重复此过程，因此输出结果为6x6x#filters。这个方法通常被称为1x1卷积，有时也称为Network in Network。

下面给一个Network in Network的应用。假设我们有一个28x28x192的输入层，我们可以使用池化层压缩它的高度和宽度，而如果信道数量很大，我们可以用1x1卷积来缩小信道数量的大小。

如果我们想保持信道数量不变，也是可行的：

当然，想要增加信道数量也可以。

因此，通过1x1卷积的简单操作，我们可以压缩或者保持输入层中的信道数量，甚至是增加信道数量。在下一节，我们将讲述1x1卷积是如何帮助构建Inception网络的。

谷歌Inception网络

论文名称：Going deeper with convolutions, 2014

构建卷积层时，我们要决定卷积核的大小究竟是1x3/3x3/5↓，或者要不要添加池化层。而Inception网络的作用是代替你来做决定，尽管这样做使得网络结构变得复杂，但表现却非常好。

Inception核心模块

Inception网络的核心模块如下：

其基本思想是，Inception网络不需要人为决定使用哪个卷积核或是否需要池化，而是由网络自行确定这些参数，你可以给网络添加这些参数的所有可能值，然后把这些输出连接起来，让网络自己学习它需要什么样的参数、采用哪些卷积核组合。

The problem of computational cost

乘法运算的总次数为每个输出值所需的乘法运算次数(5x5x192)乘以输出值个数(28x28x32)，结果等于1.2亿。即使在现代，用计算机来进行1.2亿次乘法运算，其成本也相当高。

Using 1x1 convolution

我们要做的是把左边这个大的输入层压缩成较小的中间层(bottleneck layer)，再用卷积核恢复原来的输出大小。接下来看看计算成本。

首先，第一个卷积层的乘法次数为输出值个数(28x28x16)乘以每个输出值所需乘法运算次数(192)，相乘结果为240万；对于第二个卷积层的乘法次数为输出值个数(28x28x32)乘以每个输出值所需乘法运算次数(5x5x16)，相乘结果为1千万。因此总的计算成本为原来的十分之一，即1240万。所需的加法次数与乘法次数类似，因此只统计了乘法运算的次数。

总结：如果我们在构建神经网络层的时候，不想决定池化层是使用1x1、3x3、还是5x5的过滤器，那么inception模块可以让我们应用各种类型的过滤器，再将它们连接起来。事实证明，只要合理构建瓶颈层，我们既可以显著缩小表示层规模，又不会降低网络性能，从而节省了大量计算。

完整结构

在前面的几节，我们已经知道了Inception网络的基础模块。在本视频中，我们将学习如何将这些模块组合起来，构建我们的Inception网络。

一个Inception模块：

Inception网络所做的就是将多个Inception模块组合起来：

我们可以发现其中有很多Inception模块，另外网络中有一些额外的最大化池来改变维度中的高和宽。因此我们实际上就是用多个Inception模块在不同位置进行的组合。

Inception网络还有一些额外的分支。而这些分支所做的就是通过隐藏层来做输出，即通过一些全连接层，然后使用一个softmax来预测输出结果的标签。它确保了即使是隐藏单元和中间层，它们也参与了特征计算，可以预测图片的分类。这个特点在Inception网络中起到一种调整的作用，也能防止网络发生过拟合。

最后总结，如果我们理解了Inception模块，就能理解Inception网络，其实质就是将Inception模块一环接一环，最后组成网络。自从Inception模块诞生以来，经过研究者们的不断发展，衍生出许多新的版本。所以当我们在看一些比较新的Inception算法的论文时，发现人们使用这些新版本的算法效果也一样很好，比如Inception V2、V3以及V4，还有版本引入了跳跃连接(ResNet)的方法，也有特别好的效果。但所有的变体都建立在同一种基础的思想上，就是把许多Inception模块通过某种方式连接在一起。

接下来，我们会讲解如何真正使用这些算法来构建自己的计算机视觉系统。

使用开源的实现方案: Using open-source implementations

到目前我们已经学习了几个非常有效的神经网络和ConvNet架构。接下来会分享几条如何使用它们的实用性建议，首先从使用开放源码的实现开始。

事实证明很多神经网络复杂细致，因而难以复制，因为一些参数调整的细节问题，例如学习率衰减，会影响性能。幸运的是，很多深度学习者会将自己的成果开源，放在Github上。因此如果我们看到一篇研究论文想应用其成果，我们通常会在网络上寻找一个开源的实现，这比自己实现要好得多。

因此一个常用的工作流程是，选择一个喜欢的架构，接着寻找一个开源实现，从Github下载，再进行调整。

迁移学习

如果你要做一个计算机视觉的应用，相比于从头训练权重，或者说从随机初始化权重开始，如果我们能下载别人已经训练好的网络结构的权重，我们通常能够进展比较快。用这个预训练模型转移到我们感兴趣的任务上，也就是用迁移学习把公共数据集的知识迁移到我们自己的问题上。

数据集很小

举例子。假设我们要对猫图片分类，类别有Tiger、Misty和Neither。但我们没有那么多的猫图片，也就是我们的数据集很小。吴老师建议我们从网上下载一些神经网络开源的实现，不仅下载代码，同时也把权重下载下来。

比如对于ImageNet数据集，一共有1000个类别，因此在大多数网络结构最后都有一个Softmax分类器要预测1000种类别。这时候我们可以把最后一层去掉，而创建自己的Softmax单元。

如图，我们通常将前面已经预训练好的权重保持不变(freeze)，而只训练Softmax层的权重。通过使用其他人预训练的权重，我们可能得到很好的性能，即使我们的数据集很小。幸运的是，很多深度学习框架都支持这种操作，我们可以设置trainableParameter=0或者freeze=1的参数，保证前面预训练好的权重不参与训练，即允许我们指定是否训练特顶层的权重。

另一个加速训练的技巧是我们可以先计算出Softmax前一层的输出或激活值，将它们保存在硬盘中（因为前面部分不变），再运用Softmax函数进行预测。

数据集更大

根据经验，如果我们有一个更大的标记好的数据集，在这种情况下，我们可以冻结更少的层，然后训练后面的层。即我们的规律是，如果我们的数据集越大，那么需要冻结的层数越少，能够训练的层数也越多。

当然，如果我们的数据集很大，那么可以用开源的网络和它的权重作为初始化，然后训练整个网络（当然，输出层可能要根据自己的需要进行改变）。

数据增强: Data Augmentation

在实践中，更多的数据对大多数计算机视觉任务都有帮助。而不像其他领域，有时候得到充足的数据，但是效果却不怎么样。在现代，计算机视觉的主要问题就是没有办法得到充足的数据，因此这就意味着我们在训练计算机视觉模型时，数据增强可能会起到作用。

（color shifting主要针对光线照明之类的情况）

Implementing distortion during training

计算机视觉现状

通常不用于实际生产，而只用于竞赛或者baseline测试：

本周作业

Keras Tutorial - the Happy House

Keras is more restrictive than the lower-level frameworks, so there are some very complex models that you can implement in TensorFlow but not (without more difficulty) in Keras. That being said, Keras will work fine for many common models.

导入包：

import numpy as np
#import tensorflow as tf
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from kt_utils import *

import keras.backend as K
K.set_image_data_format('channels_last')
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

%matplotlib inline

1- The Happy House

For your next vacation, you decided to spend a week with five of your friends from school. It is a very convenient house with many things to do nearby. But the most important benefit is that everybody has commited to be happy when they are in the house. So anyone wanting to enter the house must prove their current state of happiness.

As a deep learning expert, to make sure the “Happy” rule is strictly applied, you are going to build an algorithm which that uses pictures from the front door camera to check if the person is happy or not. The door should open only if the person is happy.

You have gathered pictures of your friends and yourself, taken by the front-door camera. The dataset is labbeled.

Run the following code to normalize the dataset and learn about its shapes.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

Details of the “Happy” dataset:

Images are of shape (64,64,3)
Training: 600 pictures
Test: 150 pictures

2- Building a model in Keras

Keras is very good for rapid prototyping. In just a short time you will be able to build a model that achieves outstanding results.

Here is an example of a model in Keras:

def model(input_shape):
    # Define the input placeholder as a tensor with shape input_shape.
    X_input = Input(input_shape)
    # Zero-Padding: pads the border of X_input with zeroes
    X = ZeroPadding2D((3,3))(X_input)

    # CONV -> BN -> RELU Block applied to X
    X = Conv2D(32,(7,7),strides=(1,1),name='conv0')(X)
    X = BatchNormalization(axis=3, name='bn0')(X)
    X = Activation('relu')(X)

    # MAXPOOL
    X = MaxPooling2D((2,2),name='max_pool')(X)

    # FLATTEN X (means convert it to a vector) + FULLYCONNECTED
    X = Flatten()(X)
    X = Dense(1,activation='sigmoid',name='fc')(X)

    # Create model. This creates your Keras model instance, you'll use this instance to train/test the model.
    model = Model(inputs=X_input, outputs=X, name='HappyModel')

    return model

Note that Keras uses a different convention with variable names than we’ve previously used with numpy and TensorFlow. In particular, rather than creating and assigning a new variable on each step of forward propagation such as X, Z1, A1, Z2, A2, etc. for the computations for the different layers, in Keras code each line above just reassigns X to a new value using X = .... In other words, during each step of forward propagation, we are just writing the latest value in the commputation into the same variable X. The only exception was X_input, which we kept separate and did not overwrite, since we needed it at the end to create the Keras model instance (model = Model(inputs = X_input, ...) above).

Exercise: Implement a HappyModel(). This assignment is more open-ended than most. We suggest that you start by implementing a model using the architecture we suggest, and run through the rest of this assignment using that as your initial model. But after that, come back and take initiative to try out other model architectures. For example, you might take inspiration from the model above, but then vary the network architecture and hyperparameters however you wish. You can also use other functions such as AveragePooling2D(), GlobalMaxPooling2D(), Dropout().

Note: You have to be careful with your data’s shapes. Use what you’ve learned in the videos to make sure your convolutional, pooling and fully-connected layers are adapted to the volumes you’re applying it to.

def HappyModel(input_shape):
    """
    Implementation of the HappyModel.
    
    Arguments:
    input_shape -- shape of the images of the dataset

    Returns:
    model -- a Model() instance in Keras
    """
    ### START CODE HERE ###
    # Feel free to use the suggested outline in the text above to get started, and run through the whole
    # exercise (including the later portions of this notebook) once. The come back also try out other
    # network architectures as well. 
    X_input = Input(shape=input_shape)
    X = ZeroPadding2D(padding=(1, 1))(X_input)
    X = Conv2D(8, kernel_size=(3,3), strides=(1,1))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X)
    X = MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')(X)

    X = ZeroPadding2D(padding=(1, 1))(X)
    X = Conv2D(16, kernel_size=(3,3), strides=(1,1))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X)
    X = MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')(X)

    X = ZeroPadding2D(padding=(1, 1))(X)
    X = Conv2D(32, kernel_size=(3,3), strides=(1,1))(X)
    X = BatchNormalization(axis=3)(X)
    X = Activation('relu')(X)
    X = MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid')(X)

    # FC
    X = Flatten()(X)
    Y = Dense(1, activation='sigmoid')(X)
    
    model = Model(inputs = X_input, outputs = Y, name='HappyModel')

    return model

You have now built a function to describe your model. To train and test this model, there are four steps in Keras:

Create the model by calling the function above
Compile the model by calling model.compile(optimizer = "...", loss = "...", metrics = ["accuracy"])
Train the model on train data by calling model.fit(x = ..., y = ..., epochs = ..., batch_size = ...)
Test the model on test data by calling model.evaluate(x = ..., y = ...)

If you want to know more about model.compile(), model.fit(), model.evaluate() and their arguments, refer to the official Keras documentation.

### START CODE HERE ### (1 line)

# Step 1: create the model.
happyModel = HappyModel((64, 64, 3))

# Step 2: compile the model to configure the learning process. 
happyModel.compile(optimizer=keras.optimizers.Adam(lr=0.001,beta_1=0.9,beta_2=0.999,epsilon=1e-08,decay=0.0),loss='binary_crossentropy',metrics=['accuracy'])

# Step 3: train the model
happyModel.fit(x=X_train, y=Y_train, batch_size=16, epochs=20)

# Step 4: test/evaluate the model.
preds = happyModel.evaluate(x=X_test,y=Y_test)

print(preds)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
### END CODE HERE ###

测试结果输出为：

3- Conclusion

What we would like you to remember from this assignment:

Keras is a tool we recommend for rapid prototyping. It allows you to quickly try out different model architectures. Are there any applications of deep learning to your daily life that you’d like to implement using Keras?
Remember how to code a model in Keras and the four steps leading to the evaluation of your model on the test set. Create->Compile->Fit/Train->Evaluate/Test.

Test with your own image (Optional)

### START CODE HERE ###
img_path = 'images/my_image.jpg'
### END CODE HERE ###
img = image.load_img(img_path, target_size=(64, 64))
imshow(img)

x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

print(happyModel.predict(x))
``` 

### 5- Other useful functions in Keras (Optional)
Two other basic features of Keras that you'll find useful are:
- `model.summary()`: prints the details of your layers in a table with the sizes of its inputs/outputs
- `plot_model()`: plots your graph in a nice layout. You can even save it as ".png" using SVG() if you'd like to share it on social media ;). It is saved in "File" then "Open..." in the upper bar of the notebook.
![image.png](https://upload-images.jianshu.io/upload_images/8636110-7783b4f931ab61f7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
![image.png](https://upload-images.jianshu.io/upload_images/8636110-041ce06a21b93545.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

可以通过这个过程来巩固自己的卷积维数计算。

## Residual Network

**In this assignment, you will:**
- Implement the basic building blocks of ResNets. 
- Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

导入包:
``` py
import numpy as np
import tensorflow as tf
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline

import keras.backend as K
K.set_image_data_format('channels_last')
K.set_learning_phase(1)

1- The problem of very deep neural networks

Last week, you built your first convolutional neural network. In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow. More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

During training, you might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

You are now going to solve this problem by building a Residual Network!

2- Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly backpropagated to earlier layers:

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, you can form a very deep network.

We also saw in lecture that having ResNet blocks with the shortcut also makes it very easy for one of the blocks to learn an identity function. This means that you can stack on additional ResNet blocks with little risk of harming training set performance. (There is also some evidence that the ease of learning an identity function—even more than skip connections helping with vanishing gradients—accounts for ResNets’ remarkable performance.)

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. You are going to implement both of them.

2.1- The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]).

In this exercise, you’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:

Here’re the individual steps.

First component of main path:

The first CONV2D has $F_1$ filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

The second CONV2D has $F_2$ filters of shape $(f,f)$ and a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

The third CONV2D has $F_3$ filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

The shortcut and the input are added together.
Then apply the ReLU activation function. This has no name and no hyperparameters.

def identity_block(X, f, filters, stage, block):
    """
    Implementation of the identity block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    
    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    # Retrieve Filters
    F1, F2, F3 = filters

    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X

    # First component of main path
    # 这个的参数filters代表卷积核的数量。
    X = Conv2D(filters=F1, kernel_size=(1,1), strides=(1,1), padding='valid', name=conv_name_base+'2a', kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis=3, name=bn_name_base+'2a')(X)
    X = Activation('relu')(X)

    # Second component of main path (≈3 lines)
    X = Conv2D(filters=F2, kernel_size=(f,f), strides=(1,1), padding='same', name=conv_name_base+'2b', kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis=3, name=bn_name_base+'2b')(X)
    X = Activation('relu')(X)

    # Third component of main path (≈2 lines)
    X = Conv2D(filters=F3, kernel_size=(1,1), strides=(1,1), padding='valid', name=conv_name_base+'2c', kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis=3, name=bn_name_base+'2c')(X)

    # Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X, X_shortcut])
    X = Activation('relu')(X)

测试：

tf.reset_default_graph()
with tf.Session() as sess:
    np.random.seed(1)
    A_prev = tf.placeholder("float",[3,4,4,6])
    X = np.random.randn(3,4,4,6)
    A = identity_block(A_prev, f = 2, filters=[2,4,6], stage=1, block='a')
    sess.run(sess.global_variables_initializer())
    out = sess.run([A], feed_dict={A_prev:X, K.learning_phase():0})
    print("out=" + str(out[0][1][1][0]))

2.2- The convolutional block

You’ve implemented the ResNet identity block. Next, the ResNet “convolutional block” is the other type of block. You can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

The CONV2D layer in the shortcut path is used to resize the input xx to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. (This plays a similar role as the matrix WsWs discussed in lecture.) For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

The first CONV2D has $F_1$ filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

The second CONV2D has $F_2$ filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

The third CONV2D has $F_3$ filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

The CONV2D has $F_3$ filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

The shortcut and the main path values are added together.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Exercise: Implement the convolutional block. We have implemented the first component of the main path; you should implement the rest. As before, always use 0 as the seed for the random initialization, to ensure consistency with our grader.

def convolutional_block(X, f, filters, stage, block, s=2):
    """
    Implementation of the identity block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    
    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X

    ##### MAIN PATH #####
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (s,s), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    # Second component of main path (≈3 lines)
    X = Conv2D(filters=F2, kernel_size=(f,f), strides=(1,1), padding='same', name=conv_name_base+'2b', kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis=3,name=bn_name_base+'2b')(X)
    X = Activation('relu')(X)

    # Third component of main path (≈2 lines)
    X = Conv2D(filters=F3, kernel_size=(1,1), strides=(1,1), padding='valid', name=conv_name_base+'2c', kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis=3, name=bn_name_base+'2c')(X)

    ##### SHORTCUT PATH #### (≈2 lines)
    X_shortcut = Conv2D(filters=F3, kernel_size=(1,1), strides=(s,s), padding='valid', name=conv_name_base+'1', kernel_initializer=glorot_uniform(seed=0))(X_shortcut)
    X_shortcut = BatchNormalization(axis=3, name=bn_name_base+'1')(X_shortcut)

    # Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X, X_shortcut])
    X = Activation('relu')(X)

    return X

3- Building your first ResNet model(50 layers)

You now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means you should stack 3 identity blocks together.

The details of this ResNet-50 model are:

Zero-padding pads the input with a pad of (3,3)
Stage 1:
- The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
- BatchNorm is applied to the channels axis of the input.
- MaxPooling uses a (3,3) window and a (2,2) stride.
Stage 2:
- The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
- The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
Stage 3:
- The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
- The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
Stage 4:
- The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
- The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
Stage 5:
- The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
- The 2 identity blocks use three set of filters of size [256, 256, 2048], “f” is 3 and the blocks are “b” and “c”.
The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
The flatten doesn’t have any hyperparameters or name.
The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Exercise: Implement the ResNet with 50 layers described in the figure above. We have implemented Stages 1 and 2. Please implement the rest. (The syntax for implementing Stages 3-5 should be quite similar to that of Stage 2.) Make sure you follow the naming convention in the text above.

You’ll need to use this function:

Average pooling see reference

Here’re some other functions we used in the code below:

Conv2D: See reference
BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the features axis))
Zero padding: See reference
Max pooling: See reference
Fully conected layer: See reference
Addition: See reference

def ResNet50(input_shape=(64,64,3), classes=6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)

    # Stage 1
    X = Conv2D(64,kernel_size=(7,7),strides=(2,2),name="conv1",kernel_initializer=glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis3, name="bn_conv1")(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3,3), strides=(2,2))(X)

    # Stage 2
    X = convolutional_block(X, f=3, filters=[64,64,256], stage=2, block='a', s=1)
    X = identity_block(X, 3, [64,64,256], stage=2, block='b')
    X = identity_block(X, 3, [64,64,256], stage=2, block='c')

    # Stage 3
    X = convolutional_block(X, f=3, filters=[128,128,512], stage=3, block='a', s=2)
    X = identity_block(X, 3, [128,128,512], block='b', stage=3)
    X = identity_block(X, 3, [128,128,512], block='c', stage=3)
    X = identity_block(X, 3, [128,128,512], block='d', stage=3)

    # Stage 4
    X = convolutional_block(X, f=3, filters=[256,256,1024], s=2, block='a', stage=4)
    X = identity_block(X, 3, [256,256,1024], block='b', stage=4)
    X = identity_block(X, 3, [256,256,1024], block='c', stage=4)
    X = identity_block(X, 3, [256,256,1024], block='d', stage=4)
    X = identity_block(X, 3, [256,256,1024], block='e', stage=4)
    X = identity_block(X, 3, [256,256,1024], block='f', stage=4)

    # Stage 5
    X = convolutional_block(X, f=3, filters=[512,512,2048], s=2, block='a', stage=5)
    X = identity_block(X, 3, [256,256,2048], block='b', stage=5)
    X = identity_block(X, 3, [256,256,2048], block='c', stage=5)

    # AVGPOOL
    X = AveragePooling2D(pool_size=(2,2),name='avg_pool')(X)

接下来是之前一样的四个步骤：

model = ResNet50(input_shape = (64, 64, 3), classes = 6)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs = 20, batch_size = 32)
preds = model.evaluate(X_test, Y_test)

ResNet50 is a powerful model for image classification when it is trained for an adequate number of iterations. We hope you can use what you’ve learnt and apply it to your own classification problem to perform state-of-the-art accuracy.