词嵌入: Word Embedding

我们之前用的词向量表示法为one-hot向量，但这种表示方法存在很大的缺陷，我们用o_3455表示该向量。

比如，苹果和梨具有相似性，但用one-hot向量表示的话，神经网络无法捕捉他们之间的相似性。这是因为两个不同one-hot向量的内积为0，即不同单词之间的距离相同。而很明显，苹果和梨的距离，是要小于苹果和国家的距离的。

因此我们考虑用特征化后的向量来表示词，举个简单例子，假设有很多不同的特征，使我们得到新的词嵌入表示。我们用e_5391来表示特征化后的向量，此时使用这种向量表示，苹果和橙子之间的距离就很接近了，且算法会发现苹果和橙子要比苹果和国家更相似。

当然，实际上我们学习的特征向量是比较复杂的，而不是一个个具体的特征。如果我们学习到了300维的词嵌入，我们通常会把这300维向量嵌入到一个二维空间，从而实现可视化。通常我们用t-SNE算法来实现。

使用词嵌入：Using word embeddings

先举一个命名实体识别的例子。假设我们的训练集很小，甚至都不知道durian(榴莲)这个词。那么我们可以用事先训练出来的词向量。
学习词向量的算法会从大量的文本中学习，所以即使我们的训练集很小，如果我们使用已训练出来的词向量，那么结果也不会差。这也算是迁移学习的一种，即我们将从大量文本中学习到的词向量，应用到自己的任务上。

在本例中，我们可以通过后文的farmer来意识到前面的单词是人名，因此最好使用双向RNN模型。

下面总结用词嵌入做迁移学习的步骤。

Learn word embeddings from large text corpus.(1-100B words), or download pre-trained embedding online.
Transfer embedding to new task with smaller training set.(say, 100k words.) 比如用一个300维的词嵌入来表示单词，这样的好处是可以用更低维度的特征向量来代替原来的10,000维的one-hot向量，相比之下低维向量会更加紧凑。
Optional: Continue to finetune the word embeddings with new data. 这一步一般在我们的数据集比较大的时候才做。

当我们的训练集相对较小时，词嵌入的作用最明显，因此其广泛用于NLP领域。比如它已经用在了命名实体识别、文本摘要、文本解析、指代消解里，这些都是很标准的NLP任务；词嵌入在语言模型、机器翻译领域用得少一些，尤其是我们做语言模型或者机器翻译任务时，这些任务我们有大量的数据。在其他的迁移学习场景下也一样，如果我们从某一任务A迁移到某一个任务B，只有A中有大量数据，B中数据少时，迁移才有用。

这里还举了与人脸识别的不同。对于人脸识别，我们训练一个Siamese网络来学习不同人脸的128维表示，然后比较编码结果来判断两个图片是否为同一个人脸。只要输入一个人脸，就返回一个编码结果。两者对比，人脸编码可能涉及海量的图片，而自然语言处理有一个固定的词汇表，像没有出现的就标记为”\“。

词嵌入的特性

词嵌入的一个特性是可以帮助实现类比推理。我们希望词嵌入可以捕捉单词的特征表示，假如我们提出一个问题，man对应woman，那么king对应什么？用词嵌入可以实现这种推导。

如图，通过e_man-e_woman和e_king-e_queen，我们发现他们的主要差异，通过向量表示，可以发现是gender(性别)上的差异。所以得出这种类比推理的结果方法是，当算法被问及man对应woman，那king对应什么时，算法所做的就是计算e_man-e_woman，然后找出一个向量，使得e_man-e_woman约等于e_king-e_?。也就是说，当这个新词为queen时，式子近似相等。这种思想帮助很多研究者对词嵌入领域建立了更深刻的理解。

论文标题：Linguistic regularities in continuous space word representations, 2013.

实现类比推理的方法即找到相似度最大的单词：

另外，注意到我们用t-SNE算法映射到的二维空间，不一定能够像左图那样呈现出一个平行四边形，因为这种映射是使用了一种很复杂的非线性映射。

下面列举常用的相似度函数：

余弦相似度（常用来衡量词嵌入之间的相似度）：
欧式距离：

通过在大量的语料库上学习，词嵌入算法可以发现像下面这样的类比推理(analogy reasoning)模式：

词嵌入矩阵

将嵌入矩阵E与单词对应的one-hot矩阵相乘，我们可以得到对应单词的词向量，即E·O_j = E_j。但一般我们实际上是用专门的函数，即找出对应的列，来找到对应单词的词向量，这样更高效，比如Keras有个函数叫keras.layers.Embedding可以实现。我们在学习的时候，会随机初始化E矩阵，然后通过梯度下降方法来求出E。

具体算法学习词嵌入

论文标题：A neural probabilistic language model, 2003.

介绍一个早期最成功的用于学习嵌入矩阵E的NLP模型，比如假定给出四个单词，预测下一个单词会是什么。：

Other context/target pairs:

在前面我们知道了算法预测出了某个单词juice，我们将其称为target words，它是通过前面的context(last 4 words)学到的。所以如果我们的目标是学习一个嵌入向量，研究人员已经尝试过很多不同类型的上下文，如果我们要构建一个语言模型，那么一般选取目标词之前的几个词作为上下文；但如果我们的目标不是学习语言模型本身，而是学习词嵌入，那么我们可以选择其他上下文。

比如，我们可以提出一个学习问题，而它的上下文是左边和右边的各四个词，即我们可以把target word左右两个的词作为上下文。因此我们的算法获得了a glass of orange和to go along with，然后要求预测出中间这个词。提出这样一个问题，这个问题需要将左边4个词和右边4个词的嵌入向量提供给神经网络，来预测中间的单词是什么。或者上下文只有前一个词的嵌入向量，然后用来预测下一个词。或者上下文可以是附近的一个词，比如glass，利用它来推导juice。

而这种利用上下文前后两个单词的思想与Word2Vec中的skip gram模型一致，下文会介绍Word2Vec.

Word2Vec

论文标题：Efficient estimation of word representations in vector space, 2013.

假设给定一个句子：

在skip gram模型中，我们要做的是抽取上下文和目标词配对，来构造一个监督学习问题。而我们的上下文不一定总是在目标词之前离得最近的n个单词。我们会随机选一个词作为context word，如orange，然后我们要做的是随机在一定词距内(比如context word前后5个词或10个词内）选择目标词target word，比如juice。

因此我们构造一个监督学习问题，它给定context word，要求你预测在这个词正负10个词距或者5个词距内随机选择的某个目标词。而构造这个监督学习问题的目标并不是解决这个监督学习问题本身，而是我们想要用这个学习问题来学到一个好的词嵌入模型。

Skip gram模型

假设我们的单词总数量为10,000.那么给定一个context word作为输入，我们要求预测出target word。（这里之所以叫skip-gram，就是因为我们预测的是context word从左数或者右数的某个target word。）

我们的网络结构是这样的：输入context word的one-hot向量，然后经过嵌入矩阵的相乘得到e_c，再通过一个Softmax单元得到目标词的one-hot表示。因此这里的参数有两个部分，一个是嵌入矩阵本身，一个是softmax单元本身的参数。所以我们的网络结构和具体Softmax函数如下：

而Softmax函数的损失函数如下，注意到y和y_hat都是one-hot向量：

而Skip-gram模型实际上存在一些问题，尤其是在softmax模型中，每次我们想要计算概率时，我们需要对词汇表的所有词做求和运算，而这个词汇表可能会很大，那么分母的求和操作就会相当慢。

解决方法有Hierachical softmax（分级的Softmax分类器），简单来说就是不是一开始就确定到底是属于哪一类，而是先告诉我们该词是属于哪一级别，相当于利用一个树形结构。我们使用的是霍夫曼树，相当于使用了二元分类，即二元逻辑回归的方法。在实践中，分级的Softmax分类器会被构造成常用词在顶部，而不常用词则在树的深部，即不对称的二叉树（不同的经验法则）。更具体的解释：https://www.cnblogs.com/pinard/p/7243513.html

接下来我们需要理解，怎么对上下文C进行采样？一旦我们对上下文进行采样，那么目标词t就会在上下文前后词距比如10的单词中进行采样。一种选择是我们可以在语料库中随机均匀地采样，这样做我们会发现有一些词像the/a/of/and会频繁出现，而其他的apple/durian的词则很少出现，这是我们不希望的情况（很多时间更新频繁词）；因此我们通常采用一些启发式方法来平衡频繁词和普通词的采样。

CBOW模型(Continuous Bag-of-Words)

CBOW模型通过获得中间词的上下文，然后用这些周围的词去预测中间的词。

负采样：Negative Sampling

负采样可以比较好的解决Skip gram模型的计算问题。我们在这个算法中要做的是构造一个新的监督学习问题，比如给定一对单词，如orange和juice，我们要去预测这是否是context-target pair？在这个例子中，orange和juice就是一个正样本，即为1。而比如orange和king，我们将其视为负样本，即为0。所以我们要做的是采样得到一个context word和target word，也就是表中的第一样，给出了一个正样本；接着我们再使用相同的上下文词，在词典中随机选取几个词（如果我们的词在上下文词的词距中也没关系），作为负样本。接着我们构造的监督问题就是输入这对词，然后去预测目标的标签，即预测输出y。

我们的目的就是区分这两个词是通过对靠近的两个词采样，还是随机采样的。我们的目的就是区分两种不同的采样方式。

所以这就是如何生成训练集的方法。而如何选择K呢？如果是小数据集，那么K从5到20比较好，但如果是大数据集，那么K为2-5。在本例中，我们选择K为4.

对应的模型：原本我们使用的是Softmax分类器，但是计算成本过高。

因此我们采用负采样的方法来进行训练。此时我们的负采样输入和输出分别为：

所以我们会使用二元逻辑回归分类器来判断是否为正样本还是负样本。

因此整个的Skip gram模型优化后如下。（给定上下文词orange，通过与嵌入矩阵E相乘得到嵌入向量，然后我们会得到10,000个可能的logistic回归问题，其中一个将会是用来判断目标词是否是juice的分类器，而其他的可能用来预测king是否是目标词之类的。把他们看成10,000个二元分类器，但并不是每次迭代都训练全部的10,000个，即每次迭代我们只训练其他的K+1个分类器。这样每次迭代的成本要比Softmax小很多。）

这个算法还有一个细节，即如何选择负采样的样本？我们可以对候选的目标词进行采样，可以根据其在语料中的经验频率进行采样（会导致the,of,and等多次被采样），而另一个极端就是用1除以词汇表总词数，均匀且随机地抽取负样本。而论文的作者发现的一个经验方法是既不用经验频率，也不是均匀采样，而可以用介于他们之间的方法。他们做的是对词频的3/4次方除以整体的值进行采样。

GloVe词向量

论文标题：Global vector for word representation, 2014.

在之前，我们曾通过挑选语料库中位置相近的两个词，列举出词对，即上下文和目标词，而GloVe算法做的是使其关系明确化。假设X_ij是单词i在单词j上下文中出现的次数，那么这里i和j的功能就和t和c的功能一样，所以我们可以认为X_ij等同于X_tc。根据上下文和目标词的定义，我们可以得出X_ij等于X_ji的结论。事实上，如果我们将上下文和目标词的范围定义为出现于左右各1词以内的话，就会有对称关系，但如果上下文总是目标词前一个词的话，那就不对称了。

对于GloVe算法，我们可以定义上下文和目标词Wie任意两个位置相近的单词，假设是左右各10个词额距离，那么X_ij就是一个能够获取单词i和单词j出现位置相近时或者彼此接近的频率的计算器。

Glove模型做的就是最小化他们之间的差距：

而公式中的点乘就是要告诉我们这两个单词之间有多少联系，t和c之间有多紧密，i和j之间联系程度如何，换句话说他们同时出现的频率是多少，这是由X_ij影响的。接着我们需要解决参数theta和e的问题，然后用梯度下降法来最小化上面的公式。需要补充的细节是，如果X_ij=0，那么log0是未定义的，所以我们添加了一个额外的加权项f(X_ij)(weighting term)。

如果X_ij等于0，我们会约定0log0=0。因此这个求和公式表明，这个和仅是一个上下文和目标词关系里连续出现至少一次的词对的和。f(X_ij)的另一个作用是，有些词在英语里出现十分频繁，比如this,is,of,a等等，这些词称为”停止词”，在频繁词和不常用词之间也会有一个连续体(continumm: 相邻两者相似但起首与末尾截然不同的)。另外也有一些不常用词，比如durion，但我们还是想考虑在内，但又不像常用词那么频繁。因此，这个加权项f(X_ij)就可以是一个函数，给予这些出现频率不同的词不同的权重。（具体可以看GloVe算法的论文）

最后一个关于此算法有趣的事是theta和e是完全堆成的。因此有一种训练算法的方法是一致地初始化theta和e，然后使用梯度下降来最小化输出。当每个词都处理完之后去平均值，所以给定一个词w，我们有：

A note on the featurization view of embeddings

在前面讲词嵌入的时候，我们是用下面这样一个简单的例子来解释的。但是，实际上我们训练出来的词向量很难对每个维度有这样清楚的理解，即很难知道哪个轴代表gender之类的。

例如，对于我们在前面学到的GloVe算法：

我们可以乘积项利用线性代数的知识表示为：

我们就知道，我们不能保证这些用来表示特征的轴能够等同于人类可能简单理解的轴。具体而言，第一个特征可能是gender/roya/age/foot/等的组合，它也许是名词或是一个行为动词和其他所有特征的组合，所以很难看出独立组成部分。

尽管有这种类型的线性变换，但是这个平行四边形映射也说明我们解决了这个问题。因此尽管存在特征量潜在的任意线性变换，我们最终还是能学习出解决类似问题的平行四边形映射。

情感分类：Sentiment classification

问题阐述，一般是对评论进行分类：

情感分类的一个最大挑战是可能标记的训练集没有那么多。对于情感分类任务来说，训练集大小从10,000到100,000个单词都很常见，也可能小于10,000个单词，而使用词嵌入能够带来更好的效果，尤其是只有很小的训练集时。

一个简单的情感分类的模型如下：

注意到取平均的操作使得我们的模型可以适用于任意长短的评论。

而这个算法的一个问题是没有考虑词序。尤其是，对于这样一个负面的评价”Completely lacking in good taste, good service, and good ambience.”，但由于good出现了很多次，那么仅仅平均或求和得到的嵌入向量可能会多次出现good的含义，因此我们的分类器可能会认为这是一个正面的评价。所以我们可以用一个RNN模型来做情感分类。

RNN for sentiment classification

我们可以使用如下的RNN模型：

这个模型就是我们之前所介绍过的many-to-one结构。有了这样的模型，考虑词的顺序，这样就会有更好的效果了。

词嵌入除偏：Debias word embedding

论文标题：Man is computer programmer as woman is homemaker? Debiasing word embeddings 2016

现在机器学习和人工智能算法正渐渐地被信任用以辅助或者指定极其重要的决策，因此我们想尽可能地确保它们不受非预期形式偏见影响，比如性别歧视(gender bias)、种族歧视(ethnic bias)等等。本节会展示词嵌入中一些有关减少或是消除这些形式的偏见的方法。

常见词嵌入bias如下：

具体来说，word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model. 通常，这些偏见都和社会经济状态相关。

假设下面这些词的嵌入画在平面图如下：

第一步我们要做的是identify bias direction（定义步），比如采用取平均的方法：

（实际上取平均的算法过于简单，原论文里不是这样做的，而是做了奇异值分解，从而确定了bias direction，也就是偏见的方向。

第二步是Neutralize（中和步）: For every word that is not definitional（定义不明确）, project to get rid of bias，比如图片上的babysitter、docter。对于这样的词，我们可以减小将它们在bias direction上进行处理，来减少或消除它们性别歧视趋势的成分。

第三步是Equalize pairs(均衡步)。比如我们有这样的词对grandmother和grandfather，girl和boy，我们只希望这些词对的不同体现在性别上，确保和babysitter和docter之类的词有相似的距离。

所以论文作者所做的就是尝试训练一个分类器，来尝试解决哪些词是明确定义的，哪些词不是。结果表明大部分英文单词在性别方面是没有明确定义的，而只有一部分词不是性别中立的。

本周作业

Operations on word vectors

1- Cosine similarity

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u,v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.sqrt(np.sum(u*u))
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.sqrt(np.sum(v*v))
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot / (norm_u * norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

2- Word analogy task

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a],word_to_vec_map[word_b],word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words: # w is string
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the combined_vector and the current word (≈1 line)
        cosine_sim = cosine_similarity(e_b-e_a, word_to_vec_map[w]-e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

3- Debiasing word vectors (OPTIONAL/UNGRADED)

3.1- Neutralize bias for non-gender specific words

def neutralize(word, g, word_to_vec_map):
    """
    Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. 
    This function ensures that gender neutral words are zero in the gender subspace.
    
    Arguments:
        word -- string indicating the word to debias
        g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
        word_to_vec_map -- dictionary mapping words to their corresponding vectors.
    
    Returns:
        e_debiased -- neutralized word vector representation of the input "word"
    """
    
    ### START CODE HERE ###
    # Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
    e = word_to_vec_map[word]
    
    # Compute e_biascomponent using the formula give above. (≈ 1 line)
    
    e_biascomponent = np.dot(e,g) / np.square(np.sqrt(np.sum(g**2))) * g
    
    #e_biascomponent = np.dot(e, g) / np.square(np.linalg.norm(g)) * g
 
    # Neutralize e by substracting e_biascomponent from it 
    # e_debiased should be equal to its orthogonal projection. (≈ 1 line)
    e_debiased = e - e_biascomponent
    ### END CODE HERE ###
    
    return e_debiased

3.2- Equalization algorithm for gender-specific words

Next, lets see how debiasing can also be applied to word pairs such as “actress” and “actor.” Equalization is applied to pairs of words that you might want to have differ only through the gender property. As a concrete example, suppose that “actress” is closer to “babysit” than “actor.” By applying neutralizing to “babysit” we can reduce the gender-stereotype associated with babysitting. But this still does not guarantee that “actor” and “actress” are equidistant from “babysit.” The equalization algorithm takes care of this.

The derivation of the linear algebra to do this is a bit more complex. (See Bolukbasi et al., 2016 for details.) But the key equations are:

def equalize(pair, bias_axis, word_to_vec_map):
    """
    Debias gender specific words by following the equalize method described in the figure above.
    
    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") 
    bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
    word_to_vec_map -- dictionary mapping words to their corresponding vectors
    
    Returns
    e_1 -- word vector corresponding to the first word
    e_2 -- word vector corresponding to the second word
    """
    
    ### START CODE HERE ###
    # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
    w1, w2 = pair
    e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]
    
    # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
    mu = (e_w1 + e_w2) / 2

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
    mu_B = np.dot(mu, bias_axis) / np.sum(bias_axis**2) * bias_axis
    mu_orth = mu - mu_B

    # Step 4: Set e1_orth and e2_orth to be equal to mu_orth (≈2 lines)
    e_w1B = np.dot(e_w1, bias_axis) / np.sum(bias_axis**2) * bias_axis
    e_w2B = np.dot(e_w2, bias_axis) / np.sum(bias_axis**2) * bias_axis
        
    # Step 5: Adjust the Bias part of u1 and u2 using the formulas given in the figure above (≈2 lines)
    corrected_e_w1B = np.sqrt(np.abs(1-np.sum(mu_orth**2))) * ((e_w1B-mu_B)/np.linalg.norm(e_w1-mu_orth-mu_B))
    corrected_e_w2B = np.sqrt(np.abs(1-np.sum(mu_orth**2))) * ((e_w2B-mu_B)/np.linalg.norm(e_w2-mu_orth-mu_B))
    
    # Step 6: Debias by equalizing u1 and u2 to the sum of their projections (≈2 lines)
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth
    ### END CODE HERE ###
    
    return e1, e2

Please feel free to play with the input words in the cell above, to apply equalization to other pairs of words.

These debiasing algorithms are very helpful for reducing bias, but are not perfect and do not eliminate all traces of bias. For example, one weakness of this implementation was that the bias direction g was defined using only the pair of words woman and man. As discussed earlier, if g were defined by computing g1=e_woman−e_man; g2=em_other−e_father; g3=e_girl−e_boy; and so on and averaging over them, you would obtain a better estimate of the “gender” dimension in the 50 dimensional word embedding space. Feel free to play with such variants as well.

Emojify!

1- Baseline model: Emojifier-V1

1.1- Dataset EMOJISET

Let’s start by building a simple baseline classifier.

You have a tiny dataset (X, Y) where:

X contains 127 sentences (strings)
Y contains a integer label between 0 and 4 corresponding to an emoji for each sentence

1.2- Overview of the Emojifier-V1

In this part, you are going to implement a baseline model called “Emojifier-v1”.

To get our labels into a format suitable for training a softmax classifier, lets convert YY from its current shape current shape (m,1) into a “one-hot representation” (m,5), where each row is a one-hot vector giving the label of one example, You can do so using this next code snipper. Here, Y_oh stands for “Y-one-hot” in the variable names Y_oh_train and Y_oh_test:

1 2	Y_oh_train = convert_to_one_hot(Y_train, C = 5) Y_oh_test = convert_to_one_hot(Y_test, C = 5)

1.3- Implementing Emojifier-V1

As shown in Figure (2), the first step is to convert an input sentence into the word vector representation, which then get averaged together. Similar to the previous exercise, we will use pretrained 50-dimensional GloVe embeddings. Run the following cell to load the word_to_vec_map, which contains all the vector representations.

1	word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

对词向量取平均：

def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    ### START CODE HERE ###
    # Step 1: Split sentence into list of lower case words (≈ 1 line) 小写
    words = list(sentence.lower().split())

    # Initialize the average word vector, should have the same shape as your word vectors.
    avg = np.zeros((50,))
    
    # Step 2: average the word vectors. You can loop over the words in the list "words".
    for w in words:
        avg += word_to_vec_map[w]
    avg = avg / len(words)
    
    ### END CODE HERE ###
    
    return avg

模型：

def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    
    Arguments:
    X -- input data, numpy array of sentences as strings, of shape (m, 1)
    Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    learning_rate -- learning_rate for the stochastic gradient descent algorithm
    num_iterations -- number of iterations
    
    Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    np.random.seed(1)

    # Define number of training examples
    m = Y.shape[0]                          # number of training examples
    n_y = 5                                 # number of classes  
    n_h = 50                                # dimensions of the GloVe vectors 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh = convert_to_one_hot(Y, C = n_y) 
    
    # Optimization loop
    for t in range(num_iterations):                       # Loop over the number of iterations
        for i in range(m):                                # Loop over the training examples
            
            ### START CODE HERE ### (≈ 4 lines of code)
            # Average the word vectors of the words from the i'th training example
            avg = sentence_to_avg(X[i], word_to_vec_map)

            # Forward propagate the avg through the softmax layer
            z = np.dot(W,avg)+b
            a = softmax(z)

            # Compute cost using the j'th training label's one hot representation and "A" (the output of the softmax)
            cost = -np.sum(Y_oh[i]*np.log(a))
            ### END CODE HERE ###
            
            # Compute gradients 
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            pred = predict(X, Y, W, b, word_to_vec_map)

    return pred, W, b

1.4- Examining test set performance

Printing the confusion matrix can also help understand which classes are more difficult for your model. A confusion matrix shows how often an example whose label is one class (“actual” class) is mislabeled by the algorithm with a different class (“predicted” class).

What you should remember from this part:

Even with a 127 training examples, you can get a reasonably good model for Emojifying. This is due to the generalization power word vectors gives you.
Emojify-V1 will perform poorly on sentences such as “This movie is not good and not enjoyable” because it doesn’t understand combinations of words—it just averages all the words’ embedding vectors together, without paying attention to the ordering of words. You will build a better algorithm in the next part.

2- Emojifier-V2: Using LSTMs in Keras:

import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)

2.1- Overview of the model

2.2- Keras and mini-batching

In this exercise, we want to train Keras using mini-batches. However, most deep learning frameworks require that all sequences in the same mini-batch have the same length. This is what allows vectorization to work: If you had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it’s just not possible to do them both at the same time.

The common solution to this is to use padding. Specifically, set a maximum sequence length, and pad all sequences to the same length. For example, of the maximum sequence length is 20, we could pad every sentence with “0”s so that each input sentence is of length 20. Thus, a sentence “i love you” would be represented as (e_i,e_love,e_you,0⃗ ,0⃗ ,…,0⃗ ). In this example, any sentences longer than 20 words would have to be truncated. One simple way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.

2.3- The Embedding layer

In Keras, the embedding matrix is represented as a “layer”, and maps positive integers (indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pretrained embedding. In this part, you will learn how to create an Embedding() layer in Keras, initialize it with the GloVe 50-dimensional vectors loaded earlier in the notebook. Because our training set is quite small, we will not update the word embeddings but will instead leave their values fixed. But in the code below, we’ll show you how Keras allows you to either train or leave fixed this layer.

The Embedding() layer takes an integer matrix of size (batch size, max input length) as input. This corresponds to sentences converted into lists of indices (integers), as shown in the figure below.

The largest integer (i.e. word index) in the input should be no larger than the vocabulary size. The layer outputs an array of shape (batch size, max input length, dimension of word vectors).

The first step is to convert all your training sentences into lists of indices, and then zero-pad all these lists so that their length is the length of the longest sentence.

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()` (described in Figure 4). 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    ### START CODE HERE ###
    # Initialize X_indices as a numpy matrix of zeros and the correct shape
    X_indices = np.zeros((m,max_len))
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = list(X[i].lower().split())
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j] = word_to_index[w]
            # Increment j to j + 1
            j = j + 1
            
    ### END CODE HERE ###
    
    return X_indices

Build the Embedding() layer in Keras, using pre-trained word vectors.

def pretrained_embedding_layer(word_to_vec_map,word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """

    vocab_len = len(word_to_index)+1
    emb_dim = word_to_vec_map["cucumber"].shape[0]

    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim)) #所有单词的嵌入矩阵

    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable.
    # Use Embedding(...). Make sure to set trainable=False.
    embedding_layer = Embedding(input_dim=vocab_len,output_dim=emb_dim,trainable=False)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))

    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])

    return embedding_layer

2.3- Building the Emojifier-V2

注意LSTM的参数设值，比如return_sequences。

def Emojify_V2(input_shape,word_to_index,word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """

    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(input_shape, dtype='int32')

    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)

    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices) #注意这里！

    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(units=128,return_sequences=True)(embeddings) # 注意参数设置，要一个个认真检查

    # Add dropout with a probability of 0.5
    X = Dropout(rate=0.5)(X)

    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(units=128,return_sequences=False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(rate=0.5)(X)

    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(units=5)(X)
    # Add a softmax activation
    X = Activation('softmax')(X)

    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
    
    ### END CODE HERE ###
    
    return model

接下来的步骤：

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

# 训练
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)

What we should remember:

If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
- To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
- An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
- LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
- You can use Dropout() right after LSTM() to regularize your network.

总结：Keras使用还是不太熟练啊。另外，感觉自己缺乏编程能力，需要多练习，特别是针对深度学习框架。