10.4 Encoder-Decoder Sequence-to-Sequence Architectures

We have seen in figure 10.5 how an RNN can map an input sequence to a fixed-size vector. We have seen in figure 10.9 how an RNN can map a fixed-size vector to a sequence. We have seen in figures 10.3, 10.4, 10.10, 10.11 , and how an RNN can map an input sequence to an output sequence of the same length.

Figure 10.12: Example of an encoder-decoder or sequence-to-sequence RNN architecture, for learning to generate an output sequence $( y^{ (1)} ,...,y ^{(n_y )} )$ given an input sequence $( x^{(1)} ,x^{(2)} ,...,x^{(n_x )} )$ . It is composed of an encoder RNN that reads the input sequence and a decoder RNN that generates the output sequence (or computes the probability of a given output sequence). The final hidden state of the encoder RNN is used to compute a generally fixed-size context variable C which represents a semantic summary of the input sequence and is given as input to the decoder RNN.

Here we discuss how an RNN can be trained to map an input sequence to an output sequence which is not necessarily of the same length. This comes up in many applications, such as speech recognition, machine translation or question answering, where the input and output sequences in the training set are generally not of the same length (although their lengths might be related).

We often call the input to the RNN the “context.” We want to produce a representation of this context, C . The context C might be a vector or sequence of vectors that summarize the input sequence $X = ( x^{(1)} ,...,x^{(n_x )} )$ .

THINKING : Bidirectional RNNs如何和Encoder-Decoder Sequence-to-Sequence architecture的LSTM相结合？

The simplest RNN architecture for mapping a variable-length sequence to another variable-length sequence was first proposed by Cho et al(2014a) and shortly after by Sutskever et al. (2014 ), who independently developed that architecture and were the first to obtain state-of-the-art translation using this approach. The former system is based on scoring proposals generated by another machine translation system, while the latter uses a standalone recurrent network to generate the translations. These authors respectively called this architecture, illustrated in figure 10.12, the encoder-decoder or sequence-to-sequence architecture. The idea is very simple: (1) an encoder or reader or input RNN processes the input sequence. The encoder emits the context C , usually as a simple function of its final hidden state. (2) a decoder or writer or output RNN is conditioned on that fixed-length vector (just like in figure 10.9) to generate the output sequence $Y = ( y^{(1)} ,...,y^{(n_y)} )$ . The innovation of this kind of architecture over those presented in earlier sections of this chapter is that the lengths $n_x$ and $n_y$ can vary from each other, while previous architectures constrained $n_x = n_y = τ$ . In a sequence-to-sequence architecture, the two RNNs are trained jointly to maximize the average of $log_P{ ( y^{(1)} ,...,y^{(n_y )} | x^{(1)} ,...,x^{(n_x )} )}$ over all the pairs of $x$ and $y$ sequences in the training set. The last state $h_{n_x}$ of the encoder RNN is typically used as a representation C of the input sequence that is provided as input to the decoder RNN.

If the context C is a vector, then the decoder RNN is simply a vector-to-sequence RNN as described in section 10.2.4. As we have seen, there are at least two ways for a vector-to-sequence RNN to receive input. The input can be provided as the initial state of the RNN, or the input can be connected to the hidden units at each time step. These two ways can also be combined.

There is no constraint that the encoder must have the same size of hidden layer as the decoder.

One clear limitation of this architecture is when the context C output by the encoder RNN has a dimension that is too small to properly summarize a long sequence. This phenomenon was observed by Bahdanau et al( 2015). in the context of machine translation. They proposed to make C a variable-length sequence rather than a fixed-size vector. Additionally, they introduced an attention mechanism that learns to associate elements of the sequence C to elements of the output sequence. See section for more details.