RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), and GRU (Gated Recurrent Unit) are three important neural network architectures for processing sequential data. They are widely used in NLP tasks, each with unique characteristics and suitable scenarios.
RNN (Recurrent Neural Network)
Basic Principle
- Basic architecture for processing sequential data
- Pass information through hidden states
- Output at each time step depends on current input and previous hidden state
Forward Propagation
shellh_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h) y_t = W_hy · h_t + b_y
Advantages
- Simple structure, easy to understand
- Relatively few parameters
- Suitable for variable-length sequences
- Theoretically can capture dependencies of arbitrary length
Disadvantages
- Gradient vanishing: Gradients gradually decay in long sequences
- Gradient exploding: Gradients grow infinitely during backpropagation
- Cannot effectively capture long-range dependencies
- Difficult to train, slow convergence
- Cannot be parallelized
Application Scenarios
- Short text classification
- Simple sequence labeling
- Time series prediction
LSTM (Long Short-Term Memory)
Basic Principle
- Solves gradient vanishing problem of RNN
- Introduces gating mechanisms to control information flow
- Can remember important information for long periods
Core Components
1. Forget Gate
- Decides what information to discard
- Formula: f_t = σ(W_f · [h_, x_t] + b_f)
2. Input Gate
- Decides what new information to store
- Formula: i_t = σ(W_i · [h_, x_t] + b_i)
3. Candidate Memory Cell
- Generates candidate values
- Formula: C̃_t = tanh(W_C · [h_, x_t] + b_C)
4. Memory Cell Update
- Updates cell state
- Formula: C_t = f_t ⊙ C_ + i_t ⊙ C̃_t
5. Output Gate
- Decides what information to output
- Formula: o_t = σ(W_o · [h_, x_t] + b_o)
- h_t = o_t ⊙ tanh(C_t)
Advantages
- Effectively solves gradient vanishing problem
- Can capture long-range dependencies
- Flexible information flow control through gating
- Excellent performance on long sequence tasks
Disadvantages
- Large number of parameters (4x RNN)
- High computational complexity
- Long training time
- Still cannot be parallelized
Application Scenarios
- Machine translation
- Text summarization
- Long text classification
- Speech recognition
GRU (Gated Recurrent Unit)
Basic Principle
- Simplified version of LSTM
- Reduces number of gates
- Maintains long-range dependency capability
Core Components
1. Reset Gate
- Controls influence of previous hidden state
- Formula: r_t = σ(W_r · [h_, x_t] + b_r)
2. Update Gate
- Controls information update
- Formula: z_t = σ(W_z · [h_, x_t] + b_z)
3. Candidate Hidden State
- Generates candidate values
- Formula: h̃_t = tanh(W_h · [r_t ⊙ h_, x_t] + b_h)
4. Hidden State Update
- Updates hidden state
- Formula: h_t = (1 - z_t) ⊙ h_ + z_t ⊙ h̃_t
Advantages
- Fewer parameters than LSTM (about 30% less)
- Higher computational efficiency
- Faster training speed
- Performance comparable to LSTM on some tasks
Disadvantages
- Slightly lower expressiveness than LSTM
- May not perform as well as LSTM on very long sequences
- Less theoretical understanding
Application Scenarios
- Real-time applications
- Resource-constrained environments
- Medium-length sequence tasks
Comparison of the Three
Parameter Count
- RNN: Minimum
- GRU: Medium (about 2x RNN)
- LSTM: Maximum (about 4x RNN)
Computational Complexity
- RNN: O(1) per time step
- GRU: O(1) per time step, but larger constant
- LSTM: O(1) per time step, largest constant
Long-range Dependencies
- RNN: Poor (gradient vanishing)
- GRU: Good
- LSTM: Best
Training Speed
- RNN: Fast (but may not converge)
- GRU: Fast
- LSTM: Slow
Parallelization Capability
- None can be parallelized (must compute in time order)
- This is the main difference from Transformer
Selection Recommendations
Choose RNN When
- Sequence is very short (< 10 time steps)
- Extremely limited computational resources
- Need rapid prototyping
- Simple task without long-range dependencies
Choose LSTM When
- Sequence is very long (> 100 time steps)
- Need to precisely capture long-range dependencies
- Sufficient computational resources
- Complex tasks like machine translation
Choose GRU When
- Medium-length sequence (10-100 time steps)
- Need to balance performance and efficiency
- Limited computational resources
- Real-time applications
Practical Tips
1. Initialization
- Use appropriate initialization methods
- Xavier/Glorot initialization
- He initialization
2. Regularization
- Dropout (on recurrent layers)
- Gradient clipping (prevent gradient explosion)
- L2 regularization
3. Optimization
- Use Adam or RMSprop optimizers
- Learning rate scheduling
- Gradient clipping threshold
4. Architecture Design
- Bidirectional RNN/LSTM/GRU
- Multi-layer stacking
- Combine with attention mechanism
Comparison with Transformer
Transformer Advantages
- Fully parallelizable
- Better long-range dependencies
- Stronger expressiveness
- Easier to scale
RNN Series Advantages
- Higher parameter efficiency
- More friendly to small datasets
- Smaller memory footprint during inference
- More suitable for streaming processing
Selection Recommendations
- Large dataset + large compute: Transformer
- Small dataset + limited resources: RNN series
- Real-time streaming: RNN series
- Offline batch processing: Transformer
Latest Developments
1. Improved RNN Architectures
- SRU (Simple Recurrent Unit)
- QRNN (Quasi-Recurrent Neural Network)
- IndRNN (Independently Recurrent Neural Network)
2. Hybrid Architectures
- RNN + Attention
- RNN + Transformer
- Hierarchical RNN
3. Efficient Variants
- LightRNN
- Skim-RNN
- Dynamic computation RNN
Code Examples
LSTM Implementation (PyTorch)
pythonimport torch.nn as nn class LSTMModel(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True) self.fc = nn.Linear(hidden_dim, 2) def forward(self, x): x = self.embedding(x) output, (h_n, c_n) = self.lstm(x) return self.fc(h_n[-1])
GRU Implementation (PyTorch)
pythonclass GRUModel(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.gru = nn.GRU(embed_dim, hidden_dim, num_layers, batch_first=True) self.fc = nn.Linear(hidden_dim, 2) def forward(self, x): x = self.embedding(x) output, h_n = self.gru(x) return self.fc(h_n[-1])
Summary
- RNN: Basic architecture, suitable for short sequences
- LSTM: Powerful but complex, suitable for long sequences
- GRU: Simplified LSTM, balances performance and efficiency
- Transformer: Modern standard, suitable for large-scale tasks
The choice of architecture depends on task requirements, data scale, and computational resources. In practice, it's recommended to start with simple models and gradually try more complex architectures.