What are RNN, LSTM, and GRU, and What are Their Differences? - 面试题

RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit) are three important neural network architectures for processing sequential data. They are widely used in NLP tasks, each with unique characteristics and suitable scenarios.

RNN (Recurrent Neural Network)

Basic Principle

Basic architecture for processing sequential data
Pass information through hidden states
Output at each time step depends on current input and previous hidden state

Forward Propagation

shell
h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)
y_t = W_hy · h_t + b_y

Advantages

Simple structure, easy to understand
Relatively few parameters
Suitable for variable-length sequences
Theoretically can capture dependencies of arbitrary length

Disadvantages

Gradient vanishing: Gradients gradually decay in long sequences
Gradient exploding: Gradients grow infinitely during backpropagation
Cannot effectively capture long-range dependencies
Difficult to train, slow convergence
Cannot be parallelized

Application Scenarios

Short text classification
Simple sequence labeling
Time series prediction

LSTM (Long Short-Term Memory)

Basic Principle

Solves gradient vanishing problem of RNN
Introduces gating mechanisms to control information flow
Can remember important information for long periods

Core Components

1. Forget Gate

Decides what information to discard
Formula: f_t = σ(W_f · [h_, x_t] + b_f)

2. Input Gate

Decides what new information to store
Formula: i_t = σ(W_i · [h_, x_t] + b_i)

3. Candidate Memory Cell

Generates candidate values
Formula: C̃_t = tanh(W_C · [h_, x_t] + b_C)

4. Memory Cell Update

Updates cell state
Formula: C_t = f_t ⊙ C_ + i_t ⊙ C̃_t

5. Output Gate

Decides what information to output
Formula: o_t = σ(W_o · [h_, x_t] + b_o)
h_t = o_t ⊙ tanh(C_t)

Advantages

Effectively solves gradient vanishing problem
Can capture long-range dependencies
Flexible information flow control through gating
Excellent performance on long sequence tasks

Disadvantages

Large number of parameters (4x RNN)
High computational complexity
Long training time
Still cannot be parallelized

Application Scenarios

Machine translation
Text summarization
Long text classification
Speech recognition

GRU (Gated Recurrent Unit)

Basic Principle

Simplified version of LSTM
Reduces number of gates
Maintains long-range dependency capability

Core Components

1. Reset Gate

Controls influence of previous hidden state
Formula: r_t = σ(W_r · [h_, x_t] + b_r)

2. Update Gate

Controls information update
Formula: z_t = σ(W_z · [h_, x_t] + b_z)

3. Candidate Hidden State

Generates candidate values
Formula: h̃_t = tanh(W_h · [r_t ⊙ h_, x_t] + b_h)

4. Hidden State Update

Updates hidden state
Formula: h_t = (1 - z_t) ⊙ h_ + z_t ⊙ h̃_t

Advantages

Fewer parameters than LSTM (about 30% less)
Higher computational efficiency
Faster training speed
Performance comparable to LSTM on some tasks

Disadvantages

Slightly lower expressiveness than LSTM
May not perform as well as LSTM on very long sequences
Less theoretical understanding

Application Scenarios

Real-time applications
Resource-constrained environments
Medium-length sequence tasks

Comparison of the Three

Parameter Count

RNN: Minimum
GRU: Medium (about 2x RNN)
LSTM: Maximum (about 4x RNN)

Computational Complexity

RNN: O(1) per time step
GRU: O(1) per time step, but larger constant
LSTM: O(1) per time step, largest constant

Long-range Dependencies

RNN: Poor (gradient vanishing)
GRU: Good
LSTM: Best

Training Speed

RNN: Fast (but may not converge)
GRU: Fast
LSTM: Slow

Parallelization Capability

None can be parallelized (must compute in time order)
This is the main difference from Transformer

Selection Recommendations

Choose RNN When

Sequence is very short (< 10 time steps)
Extremely limited computational resources
Need rapid prototyping
Simple task without long-range dependencies

Choose LSTM When

Sequence is very long (> 100 time steps)
Need to precisely capture long-range dependencies
Sufficient computational resources
Complex tasks like machine translation

Choose GRU When

Medium-length sequence (10-100 time steps)
Need to balance performance and efficiency
Limited computational resources
Real-time applications

Practical Tips

1. Initialization

Use appropriate initialization methods
Xavier/Glorot initialization
He initialization

2. Regularization

Dropout (on recurrent layers)
Gradient clipping (prevent gradient explosion)
L2 regularization

3. Optimization

Use Adam or RMSprop optimizers
Learning rate scheduling
Gradient clipping threshold

4. Architecture Design

Bidirectional RNN/LSTM/GRU
Multi-layer stacking
Combine with attention mechanism

Comparison with Transformer

Transformer Advantages

Fully parallelizable
Better long-range dependencies
Stronger expressiveness
Easier to scale

RNN Series Advantages

Higher parameter efficiency
More friendly to small datasets
Smaller memory footprint during inference
More suitable for streaming processing

Selection Recommendations

Large dataset + large compute: Transformer
Small dataset + limited resources: RNN series
Real-time streaming: RNN series
Offline batch processing: Transformer

Latest Developments

1. Improved RNN Architectures

SRU (Simple Recurrent Unit)
QRNN (Quasi-Recurrent Neural Network)
IndRNN (Independently Recurrent Neural Network)

2. Hybrid Architectures

RNN + Attention
RNN + Transformer
Hierarchical RNN

3. Efficient Variants

LightRNN
Skim-RNN
Dynamic computation RNN

Code Examples

LSTM Implementation (PyTorch)

python
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, 
                           batch_first=True)
        self.fc = nn.Linear(hidden_dim, 2)
    
    def forward(self, x):
        x = self.embedding(x)
        output, (h_n, c_n) = self.lstm(x)
        return self.fc(h_n[-1])

GRU Implementation (PyTorch)

python
class GRUModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers,
                         batch_first=True)
        self.fc = nn.Linear(hidden_dim, 2)
    
    def forward(self, x):
        x = self.embedding(x)
        output, h_n = self.gru(x)
        return self.fc(h_n[-1])

Summary

RNN: Basic architecture, suitable for short sequences
LSTM: Powerful but complex, suitable for long sequences
GRU: Simplified LSTM, balances performance and efficiency
Transformer: Modern standard, suitable for large-scale tasks

The choice of architecture depends on task requirements, data scale, and computational resources. In practice, it's recommended to start with simple models and gradually try more complex architectures.