Recurrent Neural Networks: A Beginner’s Guide
What is a Recurrent Neural Network?
A recurrent neural network (RNN) is a type of artificial neural network that is used to process sequential data.
Sequential data is data that has a temporal order, such as text, speech, or music. RNNs are able to learn the temporal dependencies between the elements of a sequence, which allows them to perform tasks such as language modeling, speech recognition, and machine translation.
Recurrent Neural Networks Architecture: An Overview
The basic architecture of an RNN is shown below. The network consists of a sequence of hidden layers, each of which has an output and a feedback connection to the previous layer. The feedback connections allow the network to remember information from previous time steps, which is essential for processing sequential data.
The output of the final hidden layer is used to make a prediction about the current time step. The network is trained by adjusting the weights of the connections between the layers so that the network makes accurate predictions.
How do Recurrent Neural Networks work?
RNNs work by learning the temporal dependencies between the elements of a sequence. This is done by the feedback connections, which allow the network to remember what it has seen before. The network learns to use this information to make predictions about the current time step.
The training of an RNN is done using a technique called backpropagation through time (BPTT). BPTT is a variant of backpropagation that is used to train recurrent neural networks. BPTT works by propagating the error from the output of the network back through the time steps, updating the weights of the connections as it goes.
Types of Recurrent Neural Networks
1. One-to-one
The most straightforward type of RNN is One-to-One, which allows a single input and a single output. It has fixed input and output sizes and acts as a standard neural network. The One-to-One application can be found in Image Classification.
2. One-to-Many
One-to-Many is a type of RNN that expects multiple outputs on a single input given to the model. The input size is fixed and gives a series of data outputs. Its applications can be found in applications like Music Generation and Image Captioning.
3. Many-to-one
Many-to-One RNN converges a sequence of inputs into a single output by a series of hidden layers learning the features. Sentiment Analysis is a common example of this type of Recurrent Neural Network.
4. Many-to-many
Many-to-Many is used to generate a sequence of output data from a sequence of input units. It is further divided into the following two subcategories
- Equal Size: In this case, the input and output layer size is exactly the same.
- Unequal Size: In this case, inputs and outputs have different numbers of units. Its application can be found in Machine Translation.
RNN’s Challenges
Training a RNN or be it any Neural Network is done by defining a loss function that measures the error/deviation between the predicted value and the ground truth. The input features are passed through multiple hidden layers consisting of different/same activation functions and the output is predicted. The total loss function is computed and this marks the forward pass finished. The second part of the training is the backward pass where the various derivatives are calculated. This training becomes all the more complex in Recurrent Neural Networks processing sequential time-sequence data as the model backpropagate the gradients through all the hidden layers and also through time. Hence, in each time step it has to sum up all the previous contributions until the current timestamp.
1. Exploding gradients
In some cases the value of the gradients keep on getting larger and becomes infinity exponentially fast causing very large weight updates and gradient descent to diverge making the training process very unstable. This problem is called the exploding gradient.
2. Vanishing gradients
In some other cases, as the background propagation advances from the output layer to the input layer, the gradient term goes to zero exponentially fast, which which eventually leaves the weights of the initial or lower layers nearly unchange and makes it difficult to learn some long period dependencies. As a result, the gradient descent never converges to the optimum. This problem is called the vanishing gradient.
Recurrent Neural Networks Architectures
1. Bidirectional recurrent neural networks (BRNN)
A typical RNN relies on past and present events. However, there can be situations where a prediction depends on past, present, and future events.
For example, predicting a word to be included in a sentence might require us to look into the future, i.e., a word in a sentence could depend on a future event. Such linguistic dependencies are customary in several text prediction tasks.
Thus, capturing and analyzing both past and future events is helpful.
To enable straight (past) and reverse traversal of input (future), Bidirectional RNNs or BRNNs are used. A BRNN is a combination of two RNNs — one RNN moves forward, beginning from the start of the data sequence, and the other, moves backward, beginning from the end of the data sequence. The outputs of the two RNNs are usually concatenated at each time step, though there are other options, e.g. summation. The individual network blocks in a BRNN can either be a traditional RNN, GRU, or LSTM depending upon the use-case.
2. Gated Recurrent Units (GRU)
There can be scenarios where learning from the immediately preceding data in a sequence is insufficient. Consider a case where you are trying to predict a sentence from another sentence that was introduced a while back in a book or article. In this case, remembering the immediately preceding data and the earlier ones is crucial. A RNN, owing to the parameter sharing mechanism, uses the same weights at every time step. Thus back propagation makes the gradient either explodes or vanishes, and the neural network doesn’t learn much from the data, which is far from the current position.
GRU uses update and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep long-term information without washing it through time or remove information which is irrelevant to the prediction.
- The update gate is responsible for determining the amount of previous information that needs to pass along the next state.
- The reset gate is used from the model to decide how much of the past information is needed to neglect
3. Long Short Term Memory (LSTM)
A LSTM is another variant of Recurrent Neural Network that is capable of learning long-term dependencies. Unlike in an RNN, where there’s a simple layer in a network block, an LSTM block does some additional operations. Using input, output, and forget gates, it remembers the crucial information and forgets the unnecessary information that it learns throughout the network.
- Input gate finds which value from input should be used to modify the memory.
- Forget gate learns what details to be discarded from the block.
- Output gate discovers the input and the memory of the block is used to decide the output.
The key difference between GRU and LSTM is that GRU’s architecture has two gates that are reset and update while LSTM has three gates that are input, output, forget. GRU is less complex than LSTM because it has less number of gates. Hence, if the dataset is small then GRU is preferred otherwise LSTM for the larger dataset.
Conclusion
- Recurrent Neural Networks, or RNNs, are a specialized class of neural networks used to process sequential data.
- Modeling sequential data requires persisting the data learned from the previous instances. RNN learns and remembers this data so as to formulate a decision, and this is dependent on the previous learning.
- It implements Parameter Sharing by using same weights at every time stamp so as to accommodate varying lengths of the sequential data for which it makes use of feedback loops.
- One main limitation of RNN is that the gradient either explodes or vanishes; The network doesn’t learn much from the data which is far away from the current position.
- RNNs have short term memory problem. To overcome this problem specialized versions of RNN are created like LSTM, GRU.
- Another limitation of RNN is that it processes inputs in a strict temporal order. This means current input has context of previous inputs but not the future.
- Bidirectional RNN (BRNN) duplicates the RNN processing chain so that inputs are processed in both forward and reverse time order.
- RNNs are widely used in the following domains/ applications: Machine Translation, Speech Recognition, Generating Image Descriptions, Video Tagging, Text Summarization etc.