Transformer is a deep learning architecture based on self-attention mechanism, proposed by Google in 2017, which has revolutionized the NLP field.
Core Components
1. Self-Attention Mechanism
Self-attention allows the model to attend to all other words in the input sequence when processing each word, thereby capturing long-range dependencies.
Computation Steps:
- Generate Query (Q), Key (K), Value (V) vectors
- Calculate attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k)V
- d_k is the scaling factor to prevent gradient vanishing
Multi-Head Attention:
- Split Q, K, V into multiple heads
- Each head learns different attention patterns independently
- Concatenate outputs of all heads and apply linear transformation
2. Positional Encoding
Since Transformer doesn't use recurrent structure, it cannot capture sequence order information, so positional information needs to be explicitly injected.
Methods:
- Use sine and cosine functions to generate positional encodings
- Add positional encoding to word embeddings as final input
- Allows model to learn relative position relationships
3. Encoder-Decoder Structure
Encoder:
- Stacked with multiple identical layers
- Each layer contains multi-head self-attention and feed-forward neural network
- Uses residual connections and layer normalization
Decoder:
- Contains encoder-decoder attention layer
- Masked self-attention prevents seeing future information
- Uses autoregressive approach for sequence generation
Key Advantages
1. Parallel Computation
- Unlike RNNs which require sequential processing, can compute in parallel
- Significantly improves training and inference speed
2. Long-range Dependencies
- Self-attention directly connects any two positions
- Not limited by sequence length
3. Interpretability
- Attention weight visualization shows model focus
- Facilitates understanding and debugging
Application Scenarios
- Machine translation
- Text summarization
- Question answering systems
- Language model pre-training (BERT, GPT)
- Image processing (Vision Transformer)
Variants and Improvements
BERT (Bidirectional Encoder Representations from Transformers)
- Uses only encoder part
- Bidirectional context understanding
- Masked language model pre-training
GPT (Generative Pre-trained Transformer)
- Uses only decoder part
- Autoregressive generation
- Large-scale pre-training
T5 (Text-to-Text Transfer Transformer)
- Unifies all NLP tasks as text-to-text format
- Encoder-decoder architecture
Practical Points
Hyperparameter Tuning
- Number of attention heads: typically 8-16
- Hidden layer dimension: 512-2048
- Feed-forward network dimension: 4x hidden layer dimension
- Number of layers: 6-24 layers
Optimization Techniques
- Learning rate warm-up
- Label smoothing
- Dropout regularization
- Gradient clipping
Computation Optimization
- Mixed precision training
- Gradient accumulation
- Model parallelization
- Flash Attention