为什么用序列模型：Why sequence models?

首先，我们看几个序列数据的例子:

所有的这些问题都可以作为使用标签数据(X,Y)作为训练集的监督学习。不过从上图，我们也可以知道这里的序列数据是非常不同的。有些输入输出都是序列，但长度不同；有的只有输入或输出才是序列，等等。

数学符号：Notation

Motivation Example

假设我们想要建立一个能够自动识别句中人名位置的序列模型。所以这就是一个命名实体识别问题，常用于搜索引擎。比如，索引过去24小时内所有新闻报道提及的人名。命名实体识别系统可以用来查找不同类型的文本中的人名、公司名、时间、地点、国家名、货币名等等。

假设我们的输入输出设置如下：

其中，y用一个向量对应句子中的每个单词，如果该单词为人名，那么为1，否则为0。从技术层面角度，这并不是最好的输出形式，还有其他更加复杂的输出形式。它不仅能够表明输入词是否是人名的一部分，还能够告诉你这个人名从句子哪里开始和结束。

我们将用x<1>,x<2>,…,x<9>来索引句子中单词的位置，用x<\t>表示序列的中间位置。这里的t意味着它们是时序序列，输出同样的用y<1>,y<2>,…y<9>来索引位置。另外，我们用T_y表示句子的长度。为了表示训练样本i的序列中第t个单词(元素)，我们用X(i)<\t>来表示，再用Tx(i)表示第i个训练样本的输入序列长度，这种表示方法对输出序列也成立。

Representing words

接下来探讨怎样表示一个句子里单个的词。首先，我们可能会做一张词汇表，将我们要表示的词按字典顺序放入。比如，这里有个词汇量为10000的词汇表，这对自然处理语言应用来说是非常小规模的。

接下来，我们可以用one-hot表示每个单词，这个每个单词用one-hot向量来表示：

我们的目标是，用这样的表示方式表示X，用序列模型在X和目标输出Y之间学习建立一个映射，我们会把它当做监督学习的问题来做。

另一个需要注意的问题是，如果我们遇到了一个不在单词表中的单词，我们就创建一个新的标记"<UNK>"，来表示这个单词不在词汇表中。

循环神经网络模型

Why not a standard network?

A standard network:

Problems:

Inputs, outputs can be different lengths in different examples.
Doesn’t share features learned across different positions of text.类似卷积神经网络中所学到的，我们希望将部分图片里学到的内容快速推广到图片的其他部分，而我们希望对序列数据也有相似的效果。

另外，用一个更好的representation能够让我们减少模型中的参数数量。

Recurrent Neural Network

RNN的基本结构如下：

首先顺序读取序列中的第一个单词，并且尝试预测一个输出值；再读取第二个单词的时候，我们不仅仅利用这个单词进行预测，而且会用到前一个隐藏层的输出作为当前层的输入来预测。就这样按照顺序进行下去。另外，我们通常为设置一个伪激活值a<0>作为RNN的最初始输入，通常为0向量。另外在本例中，Tx=Ty，如果不等，则需要对网络结构进行调整。

循环神经网络是从左向右扫描数据，同时每个时间步的参数也是共享的。我们用W_ax来表示从x<1>到隐藏层的连接的一系列参数，并且每个时间步所使用的都是相同的参数W_ax。而激活值，也就是水平联系，是由参数W_aa决定的，同时每一个时间步都使用相同的参数W_aa。同样的，输出都由参数W_ya决定。（对这里的参数名称做一个解释，比如W_ax，表示这个参数会乘以X来得到a，类似这样感知的理解就好，看图就能够明白了）

注意：最右侧循环图可以理解为左边图的简化版本，但难以理解。

在这种RNN中，当我们预测y<3>时，不仅要使用x<3>的信息，还要使用来自x<1>和x<2>的信息。但这个RNN的一个缺点是它只使用了这个序列中之前的信息来做出预测。即当预测y<3>时，它没有用到如x<4>、x<5>、x<6>等等的信息。这就造成了一个问题：

为了判断Teddy是否是人名的一部分，只使用前面的两个单词来预测是远远不够的，即之给定前面的三个单词，不可能确切知道Teddy是否是人名的一部分。

因此这个特定结构神经网络的缺点是，它在某一时刻的预测仅使用了从序列中之前时刻的输入信息，并没有使用序列之后地信息。这个问题我们会在Bidirectional RNN(双向RNN)中得到解答。

Forward Propagation

我们将网络结构更清晰的表示如下：

用等式表示为：

我们注意到，式子可以在最后很有规律地表示出其结果。另外，激活函数可以是不同的，比如在RNN中，g1往往选择为tanh，也可以为relu，而g2需要根据我们y的输出值来确定（如果是二分类问题，则选择为sigmoid，如本例的命名实体识别问题；如果是多分类，可以选择softmax等等）。

Simplified RNN notation

在上一小节，我们得到了以下公式：

我们将a<\t>的式子简写作：

其中，具体的解释如下：

将y<\t>写作：

此时，W_y和b_y的下标只有y，这表示会输出什么类型的量，所以W_y是计算y类型的量的权重矩阵；而上面的W_a和b_a表示这些参数是用来计算a类型输出的。

Back propagation through time

我们已经学习了循环神经网络的基础结构，在本节，我们将了解反向传播是怎样在循环神经网络中运行的。

我们已知的前向传播过程：

接下来，为了计算反向传播，我们还需要一个损失函数。

由于类似二元分类，因此我们选取的损失函数为交叉熵损失函数。

将损失函数和反向传播表示到网络中：

其中最为重要的递归操作为从右往左的梯度计算。

不同类型的RNN

本文参考Andrej Karpathy的博客: The Unreasonable Effectiveness of Recurrent Neural Networks（对应翻译版：https://blog.csdn.net/menc15/article/details/78775010 ）

在上一节，我们介绍的RNN结构里Tx和Ty是相等的。但是在实际应用中，Tx和Ty不一定相等。本节会进行介绍。

常见序列数据：

Examples of RNN architectures

many-to-many(输入输出长度相同), many-to-one(例如情感分类), one-to-one(普通神经网络):

one-to-many(音乐生成), many-to-many(输入输出长度不同，如翻译，用的是encoder-decoder模型):

另外还有一个attention模型，之后会讲解。

Summary of RNN

语言模型和序列生成

一个语言模型能够计算出句子的可能性，如语言识别中，举一个简单的例子：

因此，语言模型会计算出某个特定句子出现的概率是多少。这个模型是两种系统的基本组成部分，即语音识别系统和机器翻译系统，它能正确输出最接近的句子。而语言模型做的基本工作就是输入一个句子，准确地说是一个文本序列，然后会估计该序列中各个单词出现的可能性。

Language modeling with an RNN

首先我们需要一个训练集，这个训练集可能是一个大型的英文语料库，也可能是其他我们需要的语言的语料库。

第一件事需要进行tokenize，分解文本流为词，或将其转化为序列（比如用one-hot向量表示每一个词）。
另一件可能要做的事是定义句子的结尾，一般的做法是增加一个额外的标记，叫做"<EOS>"。这样能够让我们清楚一个句子什么时候结束。因此EOS标记可以添加到训练集中每一个句子的结尾。（注意句号可以去掉，或者当做一个单词也计入词典）
另一件是如果我们的训练集里有一些词并不在词典里（这个词典可能是10,000个常见的英文单词），那么将不在里面的单词用"<UNK>"取代，即用"<UNK>"代替未知词。我们只针对"<UNK>"来建立概率模型。
下一步，我们要构建一个RNN来构建这些序列的概率模型。

RNN model

由图上的RNN模型可知，我们首先输入x<1>和a<0>均为零向量，然后开始尝试输出第一个词y_hat<1>。随后，我们将y_hat<1>和a<1>作为计算a<2>的输入，接着通过softmax函数计算出第二个预测词y_hat<2>（这时候，我们所求的是使P(?|Cats)最大的单词，这样依次计算下去…

我们使用的是Softmax函数作为输出层的激活函数，因此选用的loss function为交叉熵损失函数。其中，y_i<\t>为真实的输出，而y_hat_i<\t>则为预测的单词输出，然后再将所有时刻的loss相加，得到最后总的loss function。

假设现在有一个新句子，为了简单起见，假设其含有是三个单词。那么第一层Softmax会计算P(y<1>)，第二层Softmax计算P(y<2>|y<1>)，第三层Softmax计算P(y<3>|y<1>,y<2>)。从而整个句子的概率为：

对新序列采样：sampling novel sequences

在训练完一个序列模型之后，要想了解这个模型学到了什么，一种非正式的方法就是进行一次新序列采样。

Sampling a sequence from a trained RNN

我们记得一个序列模型模拟了任意特定单词序列的概率，而我们要做的是对这个概率分布进行采样，来生成一个新的单词序列。

已知我们训练时所用的RNN模型如下：

然而为了采样，我们会做一些不同的事情。第一步要做的是对我们想要模型生成的第一个词进行采样，于是我们输入x<1>=0,a<0>=0，因此现在我们的第一个时间步得到的是所有可能的输出，即经过Softmax层后得到的概率。然后根据这个Softmax的分布进行随机采样，使用numpy命令如np.random.choice来对第一个词进行采样；接下来我们进入第二个时间步，把刚刚采样的y<1>传递到下一个位置作为输入，接着Softmax层会预测第二个词；这样依次进行…

什么时候一个句子算结束呢？一个方法是如果代表句子结尾的标识"<EOS>"在词典中，那么我们可以一直进行采样直到得到"<EOS>"，这代表我们已经抵达结尾，可以停止随机采样；另一个方法是词典中没有这个标识，那么我们可以决定从20个或100个或其他词中进行采样，直到达到所设定的时间步。

这种方法可能会出现预测出"<UNK>"的情况。如果我们想要避免这种情况，那么可以在"<UNK>"出现时就继续在剩下的词中进行重采样，直到得到一个不是"<UNK>"的单词；当然，如果我们不介意有未知标识的产生，也可以不理会。

以上就是我们从RNN语言模型中生成一个随机选择的句子。

Character-level language model

在之前我们所建立的都是基于词汇的RNN模型，也就是说，字典中的词都是英语单词。根据我们的实际应用，我们还可以构建一个基于字符的RNN模型。这时，我们的字典中不再是单词，而是常见字符。

此时输入输出都是单个字符，而不再是单独的词汇。

基于字符的语言模型的优点是我们不必担心会出现未知的标识，比如基于字符的语言模型会把’Mao’视为概率非0的序列，而基于词汇的语言模型，如果’Mao’不在字典中，那么我们只能把它当做未知标识。

然后基于字符的语言模型的一个明显缺点是我们最后会得到太长的序列。所以基于字符的语言模型在捕捉句子中的依赖关系，也就是句子较前部分如何影响较后部分，不如基于词汇的语言模型，并且它的计算成本也会很大。

因此在自然语言领域中，大多数应用是使用基于词汇的语言模型。在随着计算能力的提高，在一些特殊情况下，人们也会开始使用基于字符的模型，但这也需要更昂贵的计算成本。

下面展示一些采样后的结果：

Vanishing gradients with RNNs

在前面的学习我们已经了解了RNN是如何工作的，并且知道如何将RNN应用到具体问题上，比如命名实体识别(name entity recognition)、语言模型等等。然后基本的RNN模型有一个很大的问题，也就是梯度消失的问题。

下面解释梯度消失。首先给出两个句子，这两个句子有着长期的依赖，也就是很前面的单词对句子很后面的单词有影响。(cat对应was，cats对应were)

The cat, which already ate…, was full.
The cats, which already ate…, were full.
但是目前我们见到的基本RNN模型不擅长捕获这种长期依赖效应。在之前的讨论，我们知道在训练很深的网络时，我们讨论了梯度消失的问题。如果深度很深，那么从输出y得到的梯度将很难传播回去，很难影响到前面层的权重。

而RNN也有着同样的问题，所以其反向传播也比较困难。因为同样的梯度消失的问题，后面层的输出误差很难影响前面层的计算。这就意味着，实际上能难让一个神经网络能够意识到它要记住看到的是单数名词还是复数名词，然后在序列后面生成依赖单复数形式。也就是说，基本的RNN模型里，一个单词只能被其前面的几个单词所影响。这就是RNN的一个缺点。

如果出现了梯度爆炸的情况，一般很容易发现，因为梯度可能会出现如NaN(数值溢出之类的现象。一个解决方法就是进行梯度修剪(gradient clipping)，也就是观察我们的梯度向量，如果它大于某个阈值，那么我们缩放梯度向量，保证它不会太大。

但梯度消失是更难解决的。

另外复习:

ReLu激活函数的主要贡献是：
- 解决了梯度消失、爆炸的问题
- 计算方便、计算速度快
- 加速了网络的训练
BatchNorm: 本质上是解决反向传播过程中的梯度问题。反向传播式子中有w的存在，所以w的大小影响了梯度的消失和爆炸。BatchNorm就是通过对每一层的输出规范化为均值和方差一致的方法，消除了w带来的放大缩小的影响，进而解决梯度消失和爆炸的问题。
残差结构：解决了梯度消失过快的问题，因此即使构建的网络很深层也不必担心。

参考博客：https://blog.csdn.net/qq_25737169/article/details/78847691

GRU: Gated Recurrent Unit

参考论文：

On the properties of neural machine translation: Encoder-decoder approaches, 2014.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, 2014.

我们已经学习了基础的RNN模型的运行机制，在本节，我们会学习门控循环单元GPU。它改变了RNN的隐藏层，使其更好地捕捉深层连接，并改善了梯度消失问题。

RNN unit

GRU(simplified)

GRU设置了一个新的变量C，为记忆细胞(memory cell)，记忆细胞的作用是提供了记忆的能力，比如一只猫是单数还是复数。

在GRU中，c<\t>等于激活值a<\t>。（因为在LSTM中，两个值并不相等，因此为避免混淆，这里采用两种不同的名称表示，即便他们的值是一样的。）

接下来写出GPU(简化版本)中的关键公式：

其中，第一个式子表示记忆细胞的候选值，我们用tanh计算出来。

而GPU的关键思想就是第二个式子，也就是门（用”Γ”代替，另外小写u表示”update”)。为了思考GRU的工作机制，我们思考门的原理。可以看到它是用一个sigmoid函数来计算，对大多数输入来说，sigmoid值在大多数情况下都接近0或者1。

接下来看第三个式子。如果当Γ_u=1时，说明我们将c<\t>设置为计算的候选值c_tilda<\t>。而如果Γ_u=0，说明我们将c<\t>设置为c<\t-1>，这也说明，Γ_u越小，则前一时刻的状态信息带入越多。而针对我们之前所说的猫吃饱没的例子，我们应该将cat和was之间的所有Γ_u都设置为0，即不进行更新，只设置为旧的值。这样到了was的时候，神经网络还能记得cats的信息。

我们将GRU图示化，以便更好理解：

而GRU的优点是，通过门来决定，当我们从左到右扫描一个句子的时候，这个时机是应该更新还是不更新记忆细胞。由于sigmoid值很可能取0，因此我们可以通过门的设定来维持记忆细胞的值，即c<\t>=c<\t-1>，而因为Γ_u很接近0，从而不会有梯度消失的问题了。因为Γ_u很接近0，所以c<\t>几乎等于c<\t-1>，而且c<\t>的值也很好地被维持了，即使经过很多很多的时间步。所以这就是缓解梯度消失问题的关键，因此允许神经网络运行在非常庞大的依赖词上，比如cat和was单词，即便它们被很多单词分隔，也仍然能够运行。

下面补充一点实现的细节。在上面有关记忆细胞候选值的式子里，c<\t>可以是一个向量。如果我们有一个100维的隐藏的激活值，那么c<\t>也是100维的，c_tilda<\t>也是相同的维度，从而Γ_u也是相同的维度。因此说明c<\t>的式子实际上是元素对应的乘积。

而这100维的记忆细胞里，我们只更新需要更新的比特。

当然，在实际应用中Γ_u不会真的等于0或1，有时候它是0到1的中间值，但这对于直观思考是很方便地。而元素对应的成绩做的是告诉GRU单元哪个记忆细胞的向量维度在每个时间步要做更新，因此我们可以选择保持一些比特不变，而去更新其他的比特。

Full GRU

针对完整版本的GRU，我们添加了一个变量Γ_r，这个Γ_r告诉我们计算出的c<\t-1>与c<\t>的候选值的相关性有多大。

为什么需要Γ_r呢？这是因为多年来研究者们试验过很多不同可能的方法来设计这些单元，尝试让神经网络有更深层的连接，尝试产生更大范围的影响，并且解决梯度消失的问题。

参考其他博客的解释

为了让自己更加深刻的理解GRU，接下来还参考了以下博文：https://www.cnblogs.com/jiangxinyang/p/9376021.html

图中的zt和rt分别表示更新门和重置门。更新门用于控制前一时刻的状态信息被带入到当前状态中的程度，更新门的值越大说明前一时刻的状态信息带入越多。重置门控制前一状态有多少信息被写入到当前的候选集 h_tilda_t 上，重置门越小，前一状态的信息被写入的越少。

GRU的前向传播：
GRU的训练过程：从前向传播过程中的公式可以看出要学习的参数有Wr、Wz、Wh、Wo。其中前三个参数都是拼接的（因为后先的向量也是拼接的），所以在训练的过程中需要将他们分割出来：

输出层的输入和输出层的输出分别为（注意到输出的激活函数是sigmoid）：

损失函数（这里用的是平方损失函数）：

接下来求偏导（我没有进行推导）：

其中各中间参数为：

在算出了对各参数的偏导之后，就可以更新参数，依次迭代直到损失收敛。

长短时记忆网络：LSTM

论文标题：Long shot-term memory, 1997.

在上一节，我们学习了GRU，它能够让我们在序列中学习非常深的连接(long range connection)。LSTM单元也能做到这一点。

下面是GRU和LSTM的公式对照。

我们注意到，LSTM不再有a<\t>=c<\t>。和GRU一样，有一个更新门，而LSTM的新特性是不只有一个更新门Γ_u控制，而新增加了一个遗忘门Γ_f。然后增加了一个输出门来用于输出a<\t>。

接下来将LSTM表示为图形：

接下来我们将多个LSTM单元按顺序连接起来，可以发现一件有趣的事：我们会发现在连接的图中有一条线（即所画的红线），这条线显示了只要我们正确地设置了遗忘门和更新门，LSTM是能够很容易地将c<0>的值一直传递到右边，比如c<3>=c<0>。这就是为什么LSTM和GRU非常擅长于长时间记忆某个值，即使经过很长的时间步，依旧能够保持住存在于记忆细胞中的某个值。

然而，这里介绍的和一般使用的版本不太一样。最常用的版本(LSTM的变体：Recurrent Nets that Time and Count, 2000)是我们的门值不仅取决于a<\t-1>和x<\t>，有时候人们也会偷窥一下c<\t-1>的值，这叫做peephole connection(窥视孔连接)。它的意思是，门值不仅取决于a<\t-1>和x<\t>，也取决于上一个记忆细胞的值。

那么我们应该什么时候用GRU，什么时候用LSTM？这里没有统一的准则。尽管我们先介绍了GRU，但在深度学习历史上，LSTM是更早出现的，而GRU是近几年才发明出来的，它可能源于在更加复杂的LSTM模型中做出的简化。研究者们在很多不同问题上尝试了这两种模型，看看在不同问题不同算法中哪个模型更好。

GRU的优点是更加简单，所以更容易创建一个更大的网络，而且它只有两个门，在计算性上也运行得更快，可以扩大模型的规模；而LSTM更加强大和灵活，因为它有三个门而不是两个。如果我们必须选择一个来使用，那么LSTM应该会作为默认选择来尝试。

LSTM反向传播推导(未验证)

门求偏导：

参数求偏导：

最后，计算隐藏状态、记忆状态和输入的偏导数：

双向RNN：Bidirectional RNN(BRNN)

到目前为止我们已经学习了RNN模型的关键构件，但还有两个方法能够让我们构建更好的模型。其中一个方法就是双向RNN模型，这个模型能够让我们在序列的某点处不仅可以获取之前的信息，还可以获取之后的信息；第二个方法是深层RNN，我们将在下一节讲解。

为了了解RNN的动机，我们先看一下之前在命名实体识别中见过多次的神经网络。这个网络的问题是，在判断Teddy是否是人名的一部分时，光看句子的前面部分是不够的。为了判断y<3>是0还是1，除了前三个单词，我们还需要更多的信息。所以这是一个单向的RNN。无论里面的单元是GRU、LSTM，这个结论总是成立的，即这是一个单向的RNN神经网络。

下面我们来解释BRNN的工作原理。为了简单，我们用四个输入或者说只有四个单词的句子，这样输入只有x<1>到x<4>。然后我们有四个前向的循环单元，这四个循环单元都有一个当前输入x<\t>，进而得到预测的y_hat<\t>。这是最初的结构：

接下来，我们要增加一个反向循环层。

这样，网络就形成了一个无环图(acyclic graph)。给定一个输入序列x<1>到x<4>，这个序列首先计算前向的a<1>，然后计算前向的a<2>，接着a<3>、a<4>。而反向序列从计算a’<4>开始反向进行，计算反向的a’<3>，直到计算完成。再把所有的激活值都计算完成后，就可以得到预测结果了。（这里我们用a<1>表示前向激活值，a’<1>表示后向激活值）

而预测结果的计算公式如下式子：

通过这种方式，我们的预测不仅利用了过去的信息，也利用了未来的信息。而神经网络里的基础单元不仅仅可以是标准RNN单元，也可以是GRU单元或者LSTM单元。事实上，很多的NLP问题，有LSTM单元的双向RNN模型是用得最多的。所以如果有NLP问题，并且文本句子都是完整的，首先需要标定这些句子，那么一个有LSTM单元的双向RNN模型，既有前向过程也有反向过程，将是不错的首选。

BPNN不仅能用于基本的RNN结构，也能用于GRU和LSTM。通过这些改变，我们可以用一个RNN或GRU或LSTM构建的模型，并且能够预测任意位置，即使在句子的中间位置，因为模型能够考虑整个句子的信息。而这个双向RNN网络模型的缺点是我们需要完整的数据序列，才能够预测任意位置。比如我们需要构建一个语音识别系统，那么BRNN需要考虑整个语音表达。也就是说，我们需要等某个人说完，获取整个语音表达才能处理这段语音，并进一步做语音识别。因此，对于实际的语音识别的应用，通常会有更加复杂的模块，而不仅仅用我们见过的标准BRNN模型。当然，对于很多自然语言处理的应用，如果我们总是可以获取整个句子，那么标准的BRNN模型实际上会很有效。

深层循环神经网络: Deep RNNs

目前我们学到的不同RNN的版本，每个都可以独当一面。但是要学习非常复杂的函数时，通常我们会把RNN的多个层堆叠在一起构建更深的模型。

我们知道标准的神经网络结构如下：

下面给出一个具有三个隐层的RNN网络：

我们下面来讲解一下a[2]<3>应该怎么计算。激活值a[2]<3>有两个输入，一个从左边传来，一个从下边传来，那么计算的时候，用激活函数作用于权重矩阵如下：

另外，Wa^[2]和ba^[2]也用于同一层的计算，同理第一层有参数Wa^[1]和ba^[1]参与计算。对于RNN来说，能够有三层已经很多了，一般我们设置为1或者2。（注意，time step为深度，这个为层数num_layers）

当然，有一种结构也会比较常见。即我们将输出去掉，替换成一些深层，而这些层并没有水平连接，只是一个深层网络，然后用来预测y<1>，同样也加深层网络来预测y<2>，等等。

而这些单元既可以是标准的RNN单元，也可以是GRU或者LSTM单元。通常，深层的RNN训练需要很多计算资源，需要很长的时间。

本周作业

第一个作业里有关基本RNN和LSTM的反向传播的部分，没有写完，就先放着了，以后有机会再重新填坑。

Building your Recurrent Neural Network - Step by Step

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have “memory”. They can read inputs x⟨t⟩ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future.

一些符号标记：

1-Forward propagation for the basic Recurrent Neural Network

Later this week, you will generate music using an RNN. The basic RNN that you will implement has the structure below. In this example, Tx = Ty.

Here’s how you can implement an RNN:

Steps:

Implement the calculations needed for one time-step of the RNN.
Implement a loop over $T_x$ time-steps in order to process all the inputs, one at a time.

1.1- RNN cell

A Recurrent neural network can be seen as the repetition of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell.

一个RNN cell处理一个词，而下面代码将其向量化，所以一个RNN cell可以同时处理m个样本中的词。

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Implements a single forward step of the RNN-cell as described in Figure (2)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m). n_x为词向量的长度，m为样本个数
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Waa,a_prev)+np.dot(Wax,xt)+ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya,a_next)+by)

    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)

    return a_next, yt_pred, cache

测试：

np.random.seed(1)
xt = np.random.randn(3,10)
a_prev = np.random.randn(5,10)
Waa = np.random.randn(5,5)
Wax = np.random.randn(5,3)
Wya = np.random.randn(2,5)
ba = np.random.randn(5,1)
by = np.random.randn(2,1)
parameters = {"Waa":Waa, "Wax": Wax, "Wya": Wya, "ba": ba, "by": by}

a_next, yt_pred, cache = rnn_cell_forward(xt, a_prev, parameters)

1.2- RNN forward pass

You can see an RNN as the repetition of the cell you’ve just built. If your input sequence or data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell takes as input the hidden state from the previous cell(a<\t-1>) and the current time-step’s input data(x<\t>). It outputs a hidden state(a<\t>) and a prediciton(y<\t>) for this time-step.

Exercise: Code the forward propagation of the RNN described in Figure(3).

Instructions:

Create a vector of zeros (a) that will store all the hidden states computed by the RNN.
Initialize the “next” hidden state as a0 (initial hidden state).
Start looping over each time step, your incremental index is t:
- Update the “next” hidden state and the cache by running rnn_step_forward
- Store the “next” hidden state in a(t^th position)
- Store the prediction in y
- Add the cache to the list of caches

Return a, y and caches

def rnn_forward(x, a0, parameters):
    '''
    Implement the forward propagation of the recurrent neural network described in Figure(3).

    Arguments:
        x -- Input data for every time-step, of shape(n_x, m, T_x).
        a0 -- Initial hidden state, of shape (n_a, m).
        parameters -- python dictionary containing:
                Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                ba --  Bias numpy array of shape (n_a, 1)
                by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    '''

    # Initialze "caches" which will contain the list of all caches
    caches = []

    # Retrieve dimensions from shapes of x and Wy
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape

    # initialize "a" and "y" with zeros
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))

    a_next = a0

    for t in range(T_x):
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t],a_next, parameters)
        a[:,:,t] = a_next
        y_pred[:,:,t] = yt_pred
        caches.append(cache)

    # store values needed for backward propagation in cache
    caches = (caches, x)
    return a, y_pred, caches

You’ve successfully built the forward propagation of a recurrent neural network from scratch. This will work well enough for some applications, but it suffers from vanishing gradient problems. So it works best when each output y<\t> can be estimated using mainly “local” context(meaning information from inputs x<\t’>) where t’ is not too far from t.

2-backward propagation for the basic Recurrent Neural Network

2.1- Basic RNN backward pass

We start by computing the backward pass for the basic RNN-cell.

注意，上图中右边求导数的公式只计算了相对于a_next的梯度，完整的公式需要乘上a_next相对于cost function J的梯度。

def rnn_cell_backward(da_next, cache):
    """
    Implements the backward pass for the RNN-cell (single time-step).

    Arguments:
    da_next -- Gradient of loss with respect to next hidden state
    cache -- python dictionary containing useful values (output of rnn_step_forward())

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradients of input data, of shape (n_x, m)
                        da_prev -- Gradients of previous hidden state, of shape (n_a, m)
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dba -- Gradients of bias vector, of shape (n_a, 1)
    """
    # Retrieve values from cache
    (a_next, a_prev, xt, parameters) = cache

    # Retrieve values from parameters
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    # compute the gradient of tanh with respect to a_next
    dtanh = (1-a_next*a_next)*da_next
    dxt = np.dot(Wax.T, dtanh)
    dWax = np.dot(dtanh, xt.T)
    da_prev = np.dot(Waa.T, dtanh)
    dWaa = np.dot(dtanh, a_prev.T)
    dba = np.sum(dtanh, axis=1, keepdims=True)
    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    return gradients

Backward pass through the RNN

好难，先占坑。

Computing the gradients of the cost with respect to a<\t> at every time-step t is useful because it is what helps the gradient backpropagate to the previous RNN-cell To do so, you need to iterate through all the time steps starting at the end, and at each step, you increment the overall dba, dWaa, dWax and you store dx.

def rnn_backward(da, caches):
    """
    Implement the backward pass for a RNN over an entire sequence of input data.

    Arguments:
    da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
    caches -- tuple containing information from the forward pass (rnn_forward)
    
    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                        dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                        dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                        dba -- Gradient w.r.t the bias, of shape (n_a, 1)
    """
    # Retrieve values from the first cache (t=1) of caches (≈2 lines)
   
    
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    
    
    # initialize the gradients with the right sizes (≈6 lines)

3- Forward propagation of Long Short-Term Memory(LSTM) network

This following figure shows the operations of an LSTM-cell.

Similar to the RNN example above, you will start by implementing the LSTM cell for a single time-step. Then you can iteratively call it from inside a for-loop to have it process an input with T_x time-steps.

About the gates

Forget gate: For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this:

Here, Wf are weights that govern the forget gate’s behavior. We concatenate [a⟨t−1⟩,x⟨t⟩] and multiply by Wf. The equation above results in a vector Γ⟨t⟩f with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state c⟨t−1⟩. So if one of the values of Γ⟨t⟩fΓf⟨t⟩ is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of c⟨t−1⟩. If one of the values is 1, then it will keep the information.
Update gate: Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formula for the update gate:
Updating the cell: To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is:
Output gate: To decide which outputs we will use, we will use the following two formulas:

3.1- LSTM cell

Exercise: Implement the LSTM cell described in the Figure (3).

Instructions:

直接按照公式计算即可，注意concat（矩阵拼接）的写法。可以用np.vstack((a,b))，表示竖直拼接。

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    Implement a single forward step of the LSTM-cell as described in Figure (4)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the save gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the focus gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the focus gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilda),
          c stands for the memory value
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    ### START CODE HERE ###
    # Concatenate a_prev and xt (≈3 lines)
    # concat = np.zeros([n_x+n_a, m])
    # concat[:n_a, :] = a_prev
    # concat[n_a:, :] = xt
    concat = np.vstack((a_prev, xt))
    # Compute values for ft, it, cct, c_next, ot, a_next using the formulas given figure (4) (≈6 lines)
    ft = sigmoid(np.dot(Wf, concat) + bf)
    it = sigmoid(np.dot(Wi, concat) + bi)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    c_next = ft*c_prev + it*cct
    ot = sigmoid(np.dot(Wo, concat) + bo)
    a_next = ot * np.tanh(c_next)
    # Compute prediction of the LSTM cell (≈1 line)
    
    ### END CODE HERE ###

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

3.2- Forward pass for LSTM

Now that you have implemented one step of an LSTM, you can now iterate this over this using a for-loop to process a sequence of Tx inputs.

def lstm_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network using an LSTM-cell described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the save gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the focus gate, numpy array of shape (n_a, n_a + n_x)
                        bo -- Bias of the focus gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
    """
    # Initialize "caches", which will track the list of all the caches
    caches = []
    
    ### START CODE HERE ###
    # Retrieve dimensions from shapes of xt and Wy (≈2 lines)
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wy"].shape
    
    # initialize "a", "c" and "y" with zeros (≈3 lines)
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # Initialize a_next and c_next (≈2 lines)
    a_next = a0
    c_next = np.zeros((n_a, m))
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, next memory state, compute the prediction, get the cache (≈1 line)
        a_next, c_next, yt_pred, cache = lstm_cell_forward(x[:,:,t],a_next,c_next,parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y[:,:,t] = yt_pred
        # Save the value of the next cell state (≈1 line)
        c[:,:,t] = c_next
        # Append the cache into caches (≈1 line)
        caches.append(cache)
        
    ### END CODE HERE ###
    
    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches

4- Backward propagation for LSTM

占坑。