乐闻世界logo
搜索文章和话题

What are Word Vectors and What are Common Word Vector Methods in NLP?

2月18日 17:38

Word vectors are techniques that map words to continuous vector spaces, enabling computers to understand and process semantic information of words. Word vectors capture semantic and grammatical relationships between words and are the foundation of modern NLP.

Basic Concepts of Word Vectors

Definition

  • Represent discrete words as continuous real-valued vectors
  • Typically 50-1000 dimensions
  • Semantically similar words are closer in vector space

Advantages

  • Capture semantic similarity
  • Reduce dimensionality, improve computational efficiency
  • Support vector operations
  • Solve sparsity problems

Traditional Word Vector Methods

1. One-Hot Encoding

Principle

  • Each word represented by a sparse vector
  • Only one position is 1, others are 0
  • Vector dimension equals vocabulary size

Disadvantages

  • Dimensionality curse: extremely high dimensions with large vocabulary
  • Sparsity: most elements are 0
  • Cannot capture semantic relationships
  • Cannot calculate word similarity

Example

shell
Vocabulary: [I, like, NLP] I: [1, 0, 0] like: [0, 1, 0] NLP: [0, 0, 1]

2. TF-IDF

Principle

  • TF (Term Frequency): frequency of word in document
  • IDF (Inverse Document Frequency): measure of word importance
  • TF-IDF = TF × IDF

Advantages

  • Considers word importance
  • Suitable for information retrieval

Disadvantages

  • Still sparse vectors
  • Cannot capture semantics
  • Ignores word order information

Modern Word Vector Methods

1. Word2Vec

Proposed by Google in 2013, includes two architectures:

CBOW (Continuous Bag-of-Words)

  • Predict center word from context
  • Fast, suitable for common words
  • Average words in context window

Skip-gram

  • Predict context from center word
  • Better performance on rare words
  • Higher computational cost

Training Techniques

  • Negative sampling: accelerate training
  • Hierarchical softmax: optimize computation
  • Subsampling: balance word frequency

Example

shell
king - man + woman ≈ queen

2. GloVe (Global Vectors for Word Representation)

Proposed by Stanford in 2014

Principle

  • Combines global matrix factorization and local context window
  • Based on co-occurrence matrix
  • Minimize difference between word vector dot product and co-occurrence probability

Advantages

  • Utilizes global statistical information
  • Excellent performance on similarity tasks
  • Fast training

Formula

shell
Minimize: ∑(w_i · w_j + b_i + b_j - log X_ij)²

3. FastText

Proposed by Facebook in 2016

Core Innovation

  • Subword-based word vectors
  • Handle out-of-vocabulary (OOV) words
  • Consider character-level n-grams

Advantages

  • Handle languages with rich morphology
  • Robust to spelling errors
  • Support multiple languages

Example

shell
Subwords of "apple": <ap, app, ppl, ple, le>

Context-Aware Word Vectors

1. ELMo (Embeddings from Language Models)

Features

  • Bidirectional LSTM
  • Dynamically generate word vectors based on context
  • Same word has different representations in different contexts

Advantages

  • Solve polysemy problem
  • Capture complex semantics

Disadvantages

  • High computational cost
  • Cannot train in parallel

2. BERT (Bidirectional Encoder Representations from Transformers)

Features

  • Based on Transformer
  • Deep bidirectional context
  • Pre-training + fine-tuning paradigm

Advantages

  • Powerful context understanding
  • Suitable for various NLP tasks
  • Transferable learning

3. GPT (Generative Pre-trained Transformer)

Features

  • Unidirectional (left-to-right)
  • Autoregressive generation
  • Large-scale pre-training

Advantages

  • Powerful generation capability
  • Few-shot learning

Applications of Word Vectors

1. Semantic Similarity Calculation

  • Cosine similarity
  • Euclidean distance
  • Manhattan distance

2. Text Classification

  • Represent sentences as average or weighted sum of word vectors
  • Use as input to neural networks

3. Named Entity Recognition

  • Word vectors as features
  • Combine with CRF and other models

4. Machine Translation

  • Align source and target language word vectors
  • Improve translation quality

5. Information Retrieval

  • Vector representation of documents and queries
  • Calculate relevance

Evaluation of Word Vectors

Intrinsic Evaluation

Word Similarity Tasks

  • Human-annotated word pair similarity
  • Calculate correlation between word vector similarity and human annotations

Word Analogy Tasks

  • Test vector arithmetic capabilities
  • Example: king - man + woman = queen

Common Datasets

  • WordSim-353
  • SimLex-999
  • MEN

Extrinsic Evaluation

Downstream Task Performance

  • Text classification
  • Named entity recognition
  • Sentiment analysis
  • Question answering systems

Practical Recommendations

1. Choose Appropriate Word Vectors

  • Pre-trained word vectors: use vectors trained on large-scale corpora
  • Domain adaptation: continue training on domain corpora
  • Context-aware: BERT, GPT and other pre-trained models

2. Dimension Selection

  • 50-300 dimensions: suitable for most tasks
  • Higher dimensions: may improve performance but increase computational cost
  • Experimental validation: determine optimal dimensions through experiments

3. Training Data

  • Large-scale corpora: Wikipedia, Common Crawl
  • Domain corpora: domain-specific text
  • Data quality: cleaning and preprocessing

4. Hyperparameter Tuning

  • Window size: typically 5-10
  • Minimum word frequency: filter low-frequency words
  • Negative sampling count: 5-20
  • Number of iterations: 10-100

Latest Developments

1. Multilingual Word Vectors

  • MUSE: multilingual word vector alignment
  • LASER: multilingual sentence embeddings
  • XLM-R: multilingual pre-trained model

2. Contrastive Learning

  • SimCSE: sentence embeddings based on contrastive learning
  • E5: text embedding model
  • BGE: Chinese embedding model

3. Large Language Models

  • ChatGPT, GPT-4: powerful language understanding
  • LLaMA: open-source large models
  • ChatGLM: Chinese-optimized model

4. Multimodal Embeddings

  • CLIP: image-text alignment
  • ALIGN: large-scale visual-language model
  • Flamingo: multimodal few-shot learning
标签:NLP