下面进入神经网络基础的学习，这部分大多已经学过了，这里就当作是一次复习。

二元分类-Binary Classification

图片用红绿蓝三个通道，通过这些输入预测是否是猫。

将这张猫的图片表示成输入的特征向量：

对应的记号：

需要注意的是X是每个样本列向量的堆叠。

Logistic回归-Logistic Regression

logistic回归用于解决二元分类问题。这里要注意，P(y=1|x)指训练样本为x时，y为1的概率。在下图写的Output那里，最初我们是直接用一个线性分类器，但是因为我们希望输出的是概率，而w^T+b可能大于1，可能小于0，因此要添加sigmoid函数将其值域压缩到(0,1)。

另外，红色部分那里是另一种表示形式，将b作为其中一个参数theta_0。在实现神经网络时，最好用蓝色部分的符号表示。

更清晰的表示如下：

logistic回归损失函数-Logistic Regression cost function

具体公式推导可以见上图。我们可以发现，最开始我们写的损失函数L是误差平方公式，然而这个函数的优化形式是非凸函数，如果求解会有多个局部最优解。使用梯度下降可能就找不到全局最优解。因此需要进一步地对损失函数进行研究。

损失函数(loss function)定义为 L(y’, y) = -(ylogy’+(1-y)log(1-y’))，并且当y=1时，希望y’尽量大；当y=0时，希望y’尽量小。损失函数仅适用于单个的训练样本。
代价函数(cost function)如上图中所写，是整个训练集的平均损失函数。我们的目标就是找到合适的参数w和b来最小化cost function的值。

梯度下降法-Gradient Descent

梯度下降法，初始化参数时可以随机初始化，因为对应的成本函数是凸函数，因此总会到达全局最优解。

更新公式：

这里要注意导数的符号表示，当变量为两个或多于两个时，使用花体符号；如果只有一个变量，可以用d表示。当我们要在代码中实现导数时，可以直接写成dw,db。

计算图-Computation Graph

一个简单的计算图例子：

其中，计算图是指用蓝色箭头画出来的从左到右的计算。

计算图的导数计算-Derivatives with a Computation Graph

这一小节，吴老师用一个简单的例子介绍了求导的链式法则以及反向传播的概念，并且也告诉我们在编程中如果需要求关于某个变量的导数，可以直接写作da, dvar之类的。我们先求出了dv，然后利用dv求出da, du，接着再利用所求可以继续求出da, db, dc，即反向传播。

因此，一个计算图就是从左到右的计算成本函数J，再从反向计算导数。

Logistic回归中的梯度下降-Logistic Regression Gradient Descent

本节介绍的是单个样本的Logistic回归中的梯度下降过程。如图，从右往左计算导数，其中还用到了求导的链式法则。

M个样本上进行logistic回归的梯度下降过程

首先，我们需要回忆logistic回归的cost function如下。

假设我们要求dw1，对应的公式如下。可以发现，我们需要累加每个样本对应的dw，最后还需要求平均。

因此最终可以得到如下的更新过程：

然而，这样的更新过程有两个缺点：一是需要遍历所有的样本(i=1,2,…,m)；二是需要遍历所有的特征(dw1,dw2)。也就是说，需要两个for循环，然而在代码中显式地使用for循环会使效果低下，因此我们的解决方法是：Vectorization，向量化。

向量化-Vectorization

向量化通常是消除我们代码中显示for循环语句的艺术。

从图中可以看到，左边是非向量化写法，即用for循环来实现矩阵乘法；而右边则是向量化写法，使用python中的numpy库来实现。下面给出在juypter notebook中实现的两种写法，对比其时间差异：

显然，向量化的计算快了许多。可扩展深度学习一般是在GPU上运行的，而jupyter notebook是基于CPU的。GPU擅长SIMD指令(Single Instruction Multiple Data，但之灵多数据流)，而CPU的表现也不差。因此，我们看到numpy的向量化可以加速代码运行。因此，我们可以得到一个经验法则：不要显式地使用for循环。

More exmaples

Neural network programming guideline: Whenever possible, avoid explicit for-loops.

一个矩阵和向量的乘法计算：
计算向量中每个元素的指数运算：

因此上述例子也告诉我们：在实现代码时，先看看我们能否用numpy的内置函数，而不是使用for循环。

接着，我们对logistic回归中的一个有关特征的循环进行向量化：

向量化Logistic回归

前向传播

从图中可以看到，我们可以将Z和A用向量化的方法得到对应的矩阵，即一次性计算了所有样本的w.T+b和a，而不需要使用for循环。

梯度下降

首先，吴老师使用向量化计算出参数b和w的梯度db,dw：注意左边是for循环做法，右边是向量化做法。

接着是整个Logistic回归的向量化算法过程，将X作为矩阵参与计算：

可以发现，最后梯度下降的迭代仍然需要一个for循环来实现，这部分则是无法向量化的。

Python中的广播-Broadcasting in Python

首先，周老师给出了一个计算百分比的例子，用来说明广播的作用：

对应的python代码如下，可以发现，我们先使用了sum函数，这里axis=0时表示按列相加，而axis=1则表示按行相加；另外，这里用到广播的地方是percentage的计算。而老师说这里的reshape是可以去掉的，但是这个reshape其实也起到了一个保证我们的矩阵维数不会出错的作用。

更多的广播例子：

广播的一些常用法则：

A note on python/numpy vectors

消除代码中秩为1的数组

写向量时，不要这样写：

1 2	a = np.random.randn(5) # a.shape = (5,) # 既不是行向量，也不是列向量

而如果我们要把上述a转换成向量，可以用reshape函数

1 2	a = a.reshape((5,1)) a = a.reshape((1,5))

应该这样写：

a = np.random.randn(5,1) # a.shape=(5,1) column vector
a = np.random.randn(1,5) # a.shape=(1,5) row vector

assert(a.shape == (5,1)) # 确保这是一个向量，而且执行速度很快

不要害羞，使用reshape或assert来保证维度不出错

Logistic回归中成本函数的证明

Lost function函数的来由，我们最小化L(a,y)，实际上就是最大化logP(y|x)

而对于整个训练集的成本函数：

在假设样本集为独立同分布的前提下，通过最大似然估计，可以知道我们在最小化cost function的同时，也在最大化似然估计。

本周作业

参考别人记录的作业内容，这里就顺便抄题目了！

Part 1: Python Basics with Numpy (optional assignment)

What we need to Remember:

np.exp(x) works for any np.array x and applies the exponential function to every coordinate

the sigmoid function and its gradient

# GRADED FUNCTION: sigmoid

import numpy as np # this means you can access numpy functions by writing np.function() instead of numpy.function()

def sigmoid(x):
    """
    Compute the sigmoid of x

    Arguments:
    x -- A scalar or numpy array of any size

    Return:
    s -- sigmoid(x)
    """
    ### START CODE HERE ### (≈ 1 line of code)
    s = 1 / (1 + np.exp(-x))
    ### END CODE HERE ###
    
    return s
def sigmoid_derivative(x):
    """
    Compute the gradient (also called the slope or derivative) of the sigmoid function with respect to its input x.
    You can store the output of the sigmoid function into variables and then use it to calculate the gradient.
    
    Arguments:
    x -- A scalar or numpy array

    Return:
    ds -- Your computed gradient.
    """
    
    ### START CODE HERE ### (≈ 2 lines of code)
    s = sigmoid(x)
    ds = s*(1-s)
    ### END CODE HERE ###
    
    return ds

image2vector is commonly used in deep learning

# GRADED FUNCTION: image2vector
def image2vector(image):
    """
    Argument:
    image -- a numpy array of shape (length, height, depth)
    
    Returns:
    v -- a vector of shape (length*height*depth, 1)
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2], 1)
    ### END CODE HERE ###
    
    return v

np.reshape is widely used. In the future, you’ll see that keeping your matrix/vector dimensions straight will go toward eliminating a lot of bugs.
numpy has efficient built-in functions
broadcasting is extremely useful
Note that np.dot() performs a matrix-matrix or matrix-vector multiplication. This is different from np.multiply() and the * operator (which is equivalent to .* in Matlab/Octave), which performs an element-wise multiplication.
np.dot(x,x) = 所有对应位置元素相乘之和。
Vectorization is very important in deep learning. It provides computational efficiency and clarity.

You have reviewed the L1 and L2 loss.

def L1(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L1 loss function defined above
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    loss = np.sum(abs(y - yhat))
    ### END CODE HERE ###
    
    return loss

def L2(yhat, y):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    y -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L2 loss function defined above
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    loss = np.dot(y-yhat, y-yhat)
    ### END CODE HERE ###
    
    return loss

You are familiar with many numpy functions such as np.sum, np.dot, np.multiply, np.maximum, etc…

np.dot(),np.outer(),np.multiply(),*

np.dot()如果碰到的是秩为1的数组，那么执行的是对应位置的元素相乘再相加；如果遇到的是秩不为1的数组，那么执行的是矩阵相乘。需要注意的是矩阵与矩阵相乘是秩为2，矩阵和向量相乘秩为1。
np.multiply()表示的是数据和矩阵相应位置相乘，输出和输出的结果shape一致。
np.outer()表示的是两个向量相乘，拿第一个向量的元素分别与第二个向量所有元素相乘得到结果的一行。
*对数组执行的是对应位置相乘（成本函数里的就是这么计算！！！而不是用np.dot），对矩阵执行的是矩阵相乘。

Part 2: Logistic Regression with a Neural Network mindset

这部分作业是完成一个Logistic回归算法，来分辨图片是否为猫。

数据集预处理

很多时候，我们经常会遇到的bug是有关于矩阵/向量维数的，因此我们必须保证清楚了解自己设置的矩阵维数是否正确。因此在写代码的过程中，时不时使用X.shape查看矩阵/向量的维数。

有关X.reshape()的一个小技巧：A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b ∗ c ∗ d, a) is to use: 这样子之后，矩阵的每一列都是一个样本。

1	X_flatten = X.reshape(X.shape[0], -1).T # X.T is the transpose of X

What we need to remember:
Common steps for pre-processing a new dataset are:

Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, …)
Reshape the datasets such that each example is now a vector of size(num_px*num_px*3,1)
“Standardize” the data 标准化

Gereral Architecture of the learning algorithm

关键步骤：

Initialize the parameters of the model
Learn the parameters for the model by minimizing the cost
Use the learned parameters to make predictions (on the test set)
Analyse the results and conclude

算法的各个模块

主要步骤：

Define the model structure (such as number of input features)
Initialize the model’s parameters
Loop:
- Calculate current loss (forward propagation)
- Calculate current gradient (backward propagation)
- Update parameters (gradient descent)

Helper functions-Sigmoid

def sigmoid(z):
    """
    Compute the sigmoid of z

    Arguments:
    z -- A scalar or numpy array of any size.

    Return:
    s -- sigmoid(z)
    """

    ### START CODE HERE ### (≈ 1 line of code)
    s = 1 / (1 + np.exp(-z))
    ### END CODE HERE ###
    
    return s

初始化参数-Initializing parameters

def initialize_with_zeros(dim):
    """
    This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
    
    Argument:
    dim -- size of the w vector we want (or number of parameters in this case)
    
    Returns:
    w -- initialized vector of shape (dim, 1)
    b -- initialized scalar (corresponds to the bias)
    """
    
    ### START CODE HERE ### (≈ 1 line of code)
    w = np.zeros((dim, 1))
    b = 0
    ### END CODE HERE ###

    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))
    
    return w, b

Forward and Backward propagation

要用到的公式：

这里有个地方我以前经常搞混，就是不知道什么时候用np.dot，np.multiply和*，现在清楚了：普通的矩阵乘法就是用np.dot，而对应位置相乘，如上图里面cost function的计算，则要用到np.multiply或者*。弄清楚了就不会错了。

def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation explained above

    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

    Return:
    cost -- negative log-likelihood cost for logistic regression
    dw -- gradient of the loss with respect to w, thus same shape as w
    db -- gradient of the loss with respect to b, thus same shape as b
    
    Tips:
    - Write your code step by step for the propagation. np.log(), np.dot()
    """
    
    m = X.shape[1] # 样本数
    
    # FORWARD PROPAGATION (FROM X TO COST)
    ### START CODE HERE ### (≈ 2 lines of code)
    A = sigmoid(np.dot(w.T, X) + b)
    cost = -(1.0/m) * np.sum(Y*np.log(A) + (1-Y)*np.log(1-A))
    ### END CODE HERE ###
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    ### START CODE HERE ### (≈ 2 lines of code)
    dw = np.dot(X, (A-Y).T) / m
    db = np.sum(A-Y) / m
    ### END CODE HERE ###
    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())
    
    grads = {"dw": dw,
             "db": db}
    return grads, cost

Optimization-更新

利用更新公式: theta = theta - alpha * dtheta

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
    """
    This function optimizes w and b by running a gradient descent algorithm

    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of shape (num_px * num_px * 3, number of examples)
    Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.
    
    Tips:
    You basically need to write down two steps and iterate through them:
        1) Calculate the cost and the gradient for the current parameters. Use propagate().
        2) Update the parameters using gradient descent rule for w and b.
    """
    
    costs = []
    
    for i in range(num_iterations):
        
        
        # Cost and gradient calculation (≈ 1-4 lines of code)
        ### START CODE HERE ### 
        grads, cost = propagate(w, b, X, Y)
        ### END CODE HERE ###
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule (≈ 2 lines of code)
        ### START CODE HERE ###
        w = w - learning_rate*dw
        b = b - learning_rate*db
        ### END CODE HERE ###
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
        
        # Print the cost every 100 training examples
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

预测

def predict(w, b, X):
    '''
    Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)
    
    Arguments:
    w -- weights, a numpy array of size (num_px * num_px * 3, 1)
    b -- bias, a scalar
    X -- data of size (num_px * num_px * 3, number of examples)
    
    Returns:
    Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
    '''
    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1) # 保证维数正确
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    ### START CODE HERE ### (≈ 1 line of code)
    A = sigmoid(np.dot(w.T,X) + b)
    ### END CODE HERE ###

    for i in range(A.shape[1]):
        
        # Convert probabilities A[0,i] to actual predictions p[0,i]
        ### START CODE HERE ### (≈ 4 lines of code)
        if A[0][i] > 0.5: 
            Y_prediction[0][i] = 1
        else:
            Y_prediction[0][i] = 0
        ### END CODE HERE ###
    
    assert(Y_prediction.shape == (1, m))
    
    return Y_prediction

合并模块-Merge all functions into a model

def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):
    """
    Builds the logistic regression model by calling the function you've implemented previously
    
    Arguments:
    X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
    Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
    X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
    Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
    num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
    learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
    print_cost -- Set to true to print the cost every 100 iterations
    
    Returns:
    d -- dictionary containing information about the model.
    """
    
    ### START CODE HERE ###
    
    # initialize parameters with zeros (≈ 1 line of code)
    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent (≈ 1 line of code)
    params, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
    
    # Retrieve parameters w and b from dictionary "parameters"
    w = params["w"]
    b = params["b"]
    
    # Predict test/train set examples (≈ 2 lines of code)
    Y_prediction_train = predict(w, b, X_train)
    Y_prediction_test = predict(w, b, X_test)

    ### END CODE HERE ###

    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True) #训练

有关学习率的进一步学习

If the learning rate is too large we may “overshoot” the optimal value. Similarly, if it is too small we will need too many iterations to converge to the best values. That’s why it is crucial to use a well-tuned(精心调整) learning rate.

In deep learning, 我们建议：

Choose the learning rate that better minimizes the cost function. 选择合适的学习率。
If your model overfits, use other techniques to reduce overfitting. (We’ll talk about this in later videos.) 减少过拟合。

总结

通过这次任务，我们学习到：

Preprocessing the dataset is important.
You implemented each function separately: initialize(), propagate(), optimize(). Then you built a model().
Tuning the learning rate (which is an example of a “hyperparameter”) can make a big difference to the algorithm. You will see more examples of this later in this course!