10.3 Bidirectional RNNs

However, in many applications we want to output a prediction of $y^{(t)}$ which may depend on the whole input sequence. For example, in speech recognition, the correct interpretation of the current sound as a phoneme may depend on the next few phonemes because of co-articulation and potentially may even depend on the next few words because of the linguistic dependencies between nearby words: if there are two interpretations of the current word that are both acoustically plausible, we may have to look far into the future (and the past) to disambiguate them. This is also true of handwriting recognition and many other sequence-to-sequence learning tasks, described in the next section.

Bidirectional recurrent neural networks (or bidirectional RNNs) were invented to address that need (Schuster and Paliwal 1997 , ). They have been extremely successful (Graves 2012 , ) in applications where that need arises, such as handwriting recognition (Graves 2008 Graves and Schmidhuber 2009 et al., ; , ), speech recognition (Graves and Schmidhuber 2005 Graves 2013 Baldi , ; et al., ) and bioinformatics（生物信息学） (et al., ).

As the name suggests, bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with another RNN that moves backward through time beginning from the end of the sequence. Figure 10.11 illustrates the typical bidirectional RNN, with $h (t)$ standing for the state of the sub-RNN that moves forward through time and $g (t)$ standing for the state of the sub-RNN that moves backward through time. This allows the output units $o (t) $ to compute a representation that depends on both the past and the future but is most sensitive to the input values around time $t$ , without having to specify a fixed-size window around $t$ (as one would have to do with a feedforward network, a convolutional network, or a regular RNN with a fixed-size look-ahead buffer).

This idea can be naturally extended to 2-dimensional input, such as images, by having RNNs, each one going in one of the four directions: up, down, left, four right. At each point $( i,j )$ of a 2-D grid, an output $O( i,j)$ could then compute a representation that would capture mostly local information but could also depend on long-range inputs, if the RNN is able to learn to carry that information. Compared to a convolutional network, RNNs applied to images are typically more expensive but allow for long-range lateral interactions between features in the same feature map ( , ; Visin et al. 2015 Kalchbrenner 2015 et al., ). Indeed, the forward propagation equations for such RNNs may be written in a form that showst hey use a convolution that computes the bottom-up input to each layer, prior to the recurrent propagation across the feature map that incorporates the lateral interactions.