Word vectors are techniques that map words to continuous vector spaces, enabling computers to understand and process semantic information of words. Word vectors capture semantic and grammatical relationships between words and are the foundation of modern NLP.
Basic Concepts of Word Vectors
Definition
- Represent discrete words as continuous real-valued vectors
- Typically 50-1000 dimensions
- Semantically similar words are closer in vector space
Advantages
- Capture semantic similarity
- Reduce dimensionality, improve computational efficiency
- Support vector operations
- Solve sparsity problems
Traditional Word Vector Methods
1. One-Hot Encoding
Principle
- Each word represented by a sparse vector
- Only one position is 1, others are 0
- Vector dimension equals vocabulary size
Disadvantages
- Dimensionality curse: extremely high dimensions with large vocabulary
- Sparsity: most elements are 0
- Cannot capture semantic relationships
- Cannot calculate word similarity
Example
shellVocabulary: [I, like, NLP] I: [1, 0, 0] like: [0, 1, 0] NLP: [0, 0, 1]
2. TF-IDF
Principle
- TF (Term Frequency): frequency of word in document
- IDF (Inverse Document Frequency): measure of word importance
- TF-IDF = TF × IDF
Advantages
- Considers word importance
- Suitable for information retrieval
Disadvantages
- Still sparse vectors
- Cannot capture semantics
- Ignores word order information
Modern Word Vector Methods
1. Word2Vec
Proposed by Google in 2013, includes two architectures:
CBOW (Continuous Bag-of-Words)
- Predict center word from context
- Fast, suitable for common words
- Average words in context window
Skip-gram
- Predict context from center word
- Better performance on rare words
- Higher computational cost
Training Techniques
- Negative sampling: accelerate training
- Hierarchical softmax: optimize computation
- Subsampling: balance word frequency
Example
shellking - man + woman ≈ queen
2. GloVe (Global Vectors for Word Representation)
Proposed by Stanford in 2014
Principle
- Combines global matrix factorization and local context window
- Based on co-occurrence matrix
- Minimize difference between word vector dot product and co-occurrence probability
Advantages
- Utilizes global statistical information
- Excellent performance on similarity tasks
- Fast training
Formula
shellMinimize: ∑(w_i · w_j + b_i + b_j - log X_ij)²
3. FastText
Proposed by Facebook in 2016
Core Innovation
- Subword-based word vectors
- Handle out-of-vocabulary (OOV) words
- Consider character-level n-grams
Advantages
- Handle languages with rich morphology
- Robust to spelling errors
- Support multiple languages
Example
shellSubwords of "apple": <ap, app, ppl, ple, le>
Context-Aware Word Vectors
1. ELMo (Embeddings from Language Models)
Features
- Bidirectional LSTM
- Dynamically generate word vectors based on context
- Same word has different representations in different contexts
Advantages
- Solve polysemy problem
- Capture complex semantics
Disadvantages
- High computational cost
- Cannot train in parallel
2. BERT (Bidirectional Encoder Representations from Transformers)
Features
- Based on Transformer
- Deep bidirectional context
- Pre-training + fine-tuning paradigm
Advantages
- Powerful context understanding
- Suitable for various NLP tasks
- Transferable learning
3. GPT (Generative Pre-trained Transformer)
Features
- Unidirectional (left-to-right)
- Autoregressive generation
- Large-scale pre-training
Advantages
- Powerful generation capability
- Few-shot learning
Applications of Word Vectors
1. Semantic Similarity Calculation
- Cosine similarity
- Euclidean distance
- Manhattan distance
2. Text Classification
- Represent sentences as average or weighted sum of word vectors
- Use as input to neural networks
3. Named Entity Recognition
- Word vectors as features
- Combine with CRF and other models
4. Machine Translation
- Align source and target language word vectors
- Improve translation quality
5. Information Retrieval
- Vector representation of documents and queries
- Calculate relevance
Evaluation of Word Vectors
Intrinsic Evaluation
Word Similarity Tasks
- Human-annotated word pair similarity
- Calculate correlation between word vector similarity and human annotations
Word Analogy Tasks
- Test vector arithmetic capabilities
- Example: king - man + woman = queen
Common Datasets
- WordSim-353
- SimLex-999
- MEN
Extrinsic Evaluation
Downstream Task Performance
- Text classification
- Named entity recognition
- Sentiment analysis
- Question answering systems
Practical Recommendations
1. Choose Appropriate Word Vectors
- Pre-trained word vectors: use vectors trained on large-scale corpora
- Domain adaptation: continue training on domain corpora
- Context-aware: BERT, GPT and other pre-trained models
2. Dimension Selection
- 50-300 dimensions: suitable for most tasks
- Higher dimensions: may improve performance but increase computational cost
- Experimental validation: determine optimal dimensions through experiments
3. Training Data
- Large-scale corpora: Wikipedia, Common Crawl
- Domain corpora: domain-specific text
- Data quality: cleaning and preprocessing
4. Hyperparameter Tuning
- Window size: typically 5-10
- Minimum word frequency: filter low-frequency words
- Negative sampling count: 5-20
- Number of iterations: 10-100
Latest Developments
1. Multilingual Word Vectors
- MUSE: multilingual word vector alignment
- LASER: multilingual sentence embeddings
- XLM-R: multilingual pre-trained model
2. Contrastive Learning
- SimCSE: sentence embeddings based on contrastive learning
- E5: text embedding model
- BGE: Chinese embedding model
3. Large Language Models
- ChatGPT, GPT-4: powerful language understanding
- LLaMA: open-source large models
- ChatGLM: Chinese-optimized model
4. Multimodal Embeddings
- CLIP: image-text alignment
- ALIGN: large-scale visual-language model
- Flamingo: multimodal few-shot learning