12.4.5 Neural Machine Translation

Machine translation is the task of reading a sentence in one natural language and emitting a sentence with the equivalent meaning in another language. Machine translation systems often involve many components. At a high level, there is often one component that proposes many candidate translations. Many of these translations will not be grammatical due to differences between the languages. For example, many languages put adjectives after nouns, so when translated to English directly they yield phrases such as “apple red.” The proposal mechanism suggests many variants of the suggested translation, ideally including “red apple.” A second component of the translation system, a language model, evaluates the proposed translations, and can score “red apple” as better than “apple red.”

The earliest use of neural networks for machine translation was to upgrade the language model of a translation system by using a neural language model (Schwenk et al., 2006; Schwenk 2010 , ). Previously, most machine translation systems had used an n -gram model for this component. The n -gram based models used for machine translation include not just traditional back-off n -gram models (Jelinek and Mercer 1980 Katz 1987 Chen and Goodman 1999 , ; , ; , ) but also maximum entropy language models (Berger et al. , 1996 ), in which an affine-softmax layer predicts the next word given the presence of frequent n-grams in the context.

Traditional language models simply report the probability of a natural language sentence. Because machine translation involves producing an output sentence given an input sentence, it makes sense to extend the natural language model to be conditional. As described in section 6.2.1.1, it is straightforward to extend a model that defines a marginal distribution over some variable to define a conditional distribution over that variable given a context C , where C might be a single variable or a list of variables. Devlin et al. 2014 machine translation benchmarks by using an MLP to score a phrase $t_1, t_2, \ldots , t_k$ in the target language given a phrase $s_1, s_2, \ldots , s_n$ in the source language. The MLP estimates

$$ P( t_1, t_2, \ldots , t_k \mid s_1, s_2, \ldots , s_n ) $$ The estimate formed by this MLP replaces the estimate provided by conditional n-gram models.

A drawback of the MLP-based approach is that it requires the sequences to be preprocessed to be of fixed length. To make the translation more flexible, we would like to use a model that can accommodate variable length inputs and variable length outputs. An RNN provides this ability. Section 10.2.4 describes several ways of constructing an RNN that represents a conditional distribution over a sequence given some input, and section 10.4 describes how to accomplish this conditioning when the input is a sequence. In all cases, one model first reads the input sequence and emits a data structure that summarizes the input sequence. We call this summary the “context” C . The context C may be a list of vectors, or it may be a vector or tensor. The model that reads the input to produce C may be an RNN ( , ; Cho et al. 2014a Sutskever 2014 Jean 2014 et al., ; et al., ) or a convolutional network (Kalchbrenner and Blunsom 2013 , ). A second model, usually an RNN, then reads the context C and generates a sentence in the target language. This general idea of an encoder-decoder framework for machine translation is illustrated in figure 12.5.

12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

Figure 12.6: A modern attention mechanism, as introduced by ( Bahdanau et al. 2015 ), is essentially a weighted average. A context vector c is formed by taking a weighted average of feature vectors $h^{(t)}$ with weights $α^{(t)}$ . In some applications, the feature vectors $h$ are hidden units of a neural network, but they may also be raw input to the model. The weights $α^{(t)}$ are produced by the model itself. They are usually values in the interval [0 , 1] and are intended to concentrate around just one $h^{(t)}$ so that the weighted average approximates reading that one specific time step precisely. The weights are $α^{(t)}$ usually produced by applying a softmax function to relevance scores emitted by another portion of the model. The attention mechanism is more expensive computationally than directly indexing the desired $h^{(t)}$ , but direct indexing cannot be trained with gradient descent. The attention mechanism based on weighted averages is a smooth, differentiable approximation that can be trained with existing optimization algorithms.

We can think of an attention-based system as having three components:

A process that “reads” raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position.
A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them.
A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight).

The third component generates the translated sentence.

When words in a sentence written in one language are aligned with corresponding words in a translated sentence in another language, it becomes possible to relate the corresponding word embeddings. Earlier work showed that one could learn a kind of translation matrix relating the word embeddings in one language with the word embeddings in another (Kočiský., 2014 et al ), yielding lower alignment error rates than traditional approaches based on the frequency counts in the phrase table.There is even earlier work on learning cross-lingual word vectors (Klementiev et al., 2012). Many extensions to this approach are possible. For example, more efficient cross-lingual alignment ( , ) allows training on larger datasets.