What are Word Vectors and What are Common Word Vector Methods in NLP? - 面试题

Word vectors are techniques that map words to continuous vector spaces, enabling computers to understand and process semantic information of words. Word vectors capture semantic and grammatical relationships between words and are the foundation of modern NLP.

Basic Concepts of Word Vectors

Definition

Represent discrete words as continuous real-valued vectors
Typically 50-1000 dimensions
Semantically similar words are closer in vector space

Advantages

Capture semantic similarity
Reduce dimensionality, improve computational efficiency
Support vector operations
Solve sparsity problems

Traditional Word Vector Methods

1. One-Hot Encoding

Principle

Each word represented by a sparse vector
Only one position is 1, others are 0
Vector dimension equals vocabulary size

Disadvantages

Dimensionality curse: extremely high dimensions with large vocabulary
Sparsity: most elements are 0
Cannot capture semantic relationships
Cannot calculate word similarity

Example

shell
Vocabulary: [I, like, NLP]
I:     [1, 0, 0]
like:   [0, 1, 0]
NLP: [0, 0, 1]

2. TF-IDF

Principle

TF (Term Frequency): frequency of word in document
IDF (Inverse Document Frequency): measure of word importance
TF-IDF = TF × IDF

Advantages

Considers word importance
Suitable for information retrieval

Disadvantages

Still sparse vectors
Cannot capture semantics
Ignores word order information

Modern Word Vector Methods

1. Word2Vec

Proposed by Google in 2013, includes two architectures:

CBOW (Continuous Bag-of-Words)

Predict center word from context
Fast, suitable for common words
Average words in context window

Skip-gram

Predict context from center word
Better performance on rare words
Higher computational cost

Training Techniques

Negative sampling: accelerate training
Hierarchical softmax: optimize computation
Subsampling: balance word frequency

Example

shell
king - man + woman ≈ queen

2. GloVe (Global Vectors for Word Representation)

Proposed by Stanford in 2014

Principle

Combines global matrix factorization and local context window
Based on co-occurrence matrix
Minimize difference between word vector dot product and co-occurrence probability

Advantages

Utilizes global statistical information
Excellent performance on similarity tasks
Fast training

Formula

shell
Minimize: ∑(w_i · w_j + b_i + b_j - log X_ij)²

3. FastText

Proposed by Facebook in 2016

Core Innovation

Subword-based word vectors
Handle out-of-vocabulary (OOV) words
Consider character-level n-grams

Advantages

Handle languages with rich morphology
Robust to spelling errors
Support multiple languages

Example

shell
Subwords of "apple":
<ap, app, ppl, ple, le>

Context-Aware Word Vectors

1. ELMo (Embeddings from Language Models)

Features

Bidirectional LSTM
Dynamically generate word vectors based on context
Same word has different representations in different contexts

Advantages

Solve polysemy problem
Capture complex semantics

Disadvantages

High computational cost
Cannot train in parallel

2. BERT (Bidirectional Encoder Representations from Transformers)

Features

Based on Transformer
Deep bidirectional context
Pre-training + fine-tuning paradigm

Advantages

Powerful context understanding
Suitable for various NLP tasks
Transferable learning

3. GPT (Generative Pre-trained Transformer)

Features

Unidirectional (left-to-right)
Autoregressive generation
Large-scale pre-training

Advantages

Powerful generation capability
Few-shot learning

Applications of Word Vectors

1. Semantic Similarity Calculation

Cosine similarity
Euclidean distance
Manhattan distance

2. Text Classification

Represent sentences as average or weighted sum of word vectors
Use as input to neural networks

3. Named Entity Recognition

Word vectors as features
Combine with CRF and other models

4. Machine Translation

Align source and target language word vectors
Improve translation quality

5. Information Retrieval

Vector representation of documents and queries
Calculate relevance

Evaluation of Word Vectors

Intrinsic Evaluation

Word Similarity Tasks

Human-annotated word pair similarity
Calculate correlation between word vector similarity and human annotations

Word Analogy Tasks

Test vector arithmetic capabilities
Example: king - man + woman = queen

Common Datasets

WordSim-353
SimLex-999
MEN

Extrinsic Evaluation

Downstream Task Performance

Text classification
Named entity recognition
Sentiment analysis
Question answering systems

Practical Recommendations

1. Choose Appropriate Word Vectors

Pre-trained word vectors: use vectors trained on large-scale corpora
Domain adaptation: continue training on domain corpora
Context-aware: BERT, GPT and other pre-trained models

2. Dimension Selection

50-300 dimensions: suitable for most tasks
Higher dimensions: may improve performance but increase computational cost
Experimental validation: determine optimal dimensions through experiments

3. Training Data

Large-scale corpora: Wikipedia, Common Crawl
Domain corpora: domain-specific text
Data quality: cleaning and preprocessing

4. Hyperparameter Tuning

Window size: typically 5-10
Minimum word frequency: filter low-frequency words
Negative sampling count: 5-20
Number of iterations: 10-100

Latest Developments

1. Multilingual Word Vectors

MUSE: multilingual word vector alignment
LASER: multilingual sentence embeddings
XLM-R: multilingual pre-trained model

2. Contrastive Learning

SimCSE: sentence embeddings based on contrastive learning
E5: text embedding model
BGE: Chinese embedding model

3. Large Language Models

ChatGPT, GPT-4: powerful language understanding
LLaMA: open-source large models
ChatGLM: Chinese-optimized model

4. Multimodal Embeddings

CLIP: image-text alignment
ALIGN: large-scale visual-language model
Flamingo: multimodal few-shot learning