Evaluating NLP models is a complex process that requires selecting appropriate evaluation metrics and methods based on task types. Here are evaluation methods for various NLP tasks.
Text Classification Tasks
Common Metrics
1. Accuracy
- Number of correctly predicted samples / Total samples
- Suitable for balanced datasets
- Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
2. Precision
- Proportion of actual positive cases among predicted positives
- Focuses on prediction accuracy
- Formula: Precision = TP / (TP + FP)
3. Recall
- Proportion of correctly predicted cases among actual positives
- Focuses on completeness
- Formula: Recall = TP / (TP + FN)
4. F1 Score
- Harmonic mean of precision and recall
- Important metric balancing both
- Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
5. Macro and Micro Averages
- Macro average: Simple average of metrics across classes
- Micro average: Overall metric across all samples
- Commonly used in multi-class classification
Practical Recommendations
- Prioritize F1 score for imbalanced classes
- Use confusion matrix to analyze error types
- Choose appropriate metrics based on business requirements
Named Entity Recognition (NER)
Evaluation Methods
1. Entity-based Evaluation
- Precision: Correctly identified entities / Total identified entities
- Recall: Correctly identified entities / Total actual entities
- F1 score: Harmonic mean of precision and recall
2. Token-based Evaluation
- Evaluate word-by-word whether tags are correct
- Exact match: Both entity boundaries and types must be correct
- Relaxed match: Only type correctness required
3. Common Tools
- CoNLL evaluation script
- seqeval library
- spaCy evaluation tools
Practical Recommendations
- Distinguish evaluation by entity type
- Focus on boundary errors and type errors
- Use confusion matrix to analyze error patterns
Machine Translation
Automatic Evaluation Metrics
1. BLEU (Bilingual Evaluation Understudy)
- Based on n-gram matching
- Range: 0-1, higher is better
- Considers precision and brevity penalty
- Formula: BLEU = BP × exp(∑w_n log p_n)
2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Mainly used for summarization evaluation
- ROUGE-N: Recall based on n-grams
- ROUGE-L: Based on longest common subsequence
3. METEOR
- Considers synonyms and morphological variations
- Balances precision and recall
- Closer to human evaluation than BLEU
4. TER (Translation Error Rate)
- Translation error rate
- Lower is better
- Calculates edit distance
Human Evaluation
- Fluency: Whether translation is natural and fluent
- Adequacy: Whether original meaning is fully conveyed
- Grammatical correctness: Whether it follows target language grammar
- Semantic consistency: Whether original semantics are maintained
Text Summarization
Evaluation Metrics
1. ROUGE Metrics
- ROUGE-1: Word-level recall
- ROUGE-2: Bigram recall
- ROUGE-L: Longest common subsequence
- ROUGE-S: Based on skip-bigrams
2. Content Coverage
- Whether key information is included
- Information completeness
- Factual accuracy
3. Fluency and Coherence
- Whether sentences are fluent
- Logical coherence
- Grammatical correctness
Practical Recommendations
- Combine automatic and human evaluation
- Focus on summary length and compression ratio
- Consider domain-specific metrics
Question Answering Systems
Extractive QA
1. Exact Match (EM)
- Answer matches exactly
- Strict metric
- Formula: EM = Number of completely correct answers / Total questions
2. F1 Score
- Word-level F1
- Allows partial correctness
- More lenient evaluation
3. Position Accuracy
- Whether answer start position is correct
- Whether answer end position is correct
Generative QA
1. BLEU/ROUGE
- Evaluate answer quality
- Similarity to reference answers
2. Semantic Similarity
- Calculate similarity using embedding models
- BERTScore, MoverScore, etc.
3. Human Evaluation
- Answer relevance
- Answer accuracy
- Answer completeness
Sentiment Analysis
Evaluation Metrics
1. Classification Metrics
- Accuracy, precision, recall, F1
- Confusion matrix
- ROC curve and AUC
2. Fine-grained Evaluation
- Polarity classification (positive/negative/neutral)
- Intensity classification (strong/medium/weak)
- Sentiment categories (happy, sad, angry, etc.)
3. Domain Adaptability
- Cross-domain performance
- Domain transfer capability
Language Models
Perplexity
Definition
- Metric measuring model prediction ability
- Lower is better
- Formula: PP(W) = exp(-1/N ∑log P(w_i|context))
Calculation Methods
- Calculate based on test set
- Consider context window
- Exponential of negative average log probability
Limitations
- Doesn't directly reflect downstream task performance
- Sensitive to model size
- Requires large amount of test data
Other Metrics
- Word Error Rate (WER)
- Character Error Rate (CER)
- BLEU (for generation tasks)
Evaluation Practices
Data Splitting
- Training, validation, and test sets
- Ensure consistent data distribution
- Consider splitting for time series data
Cross-validation
- K-fold cross-validation
- Stratified cross-validation
- Time series cross-validation
Statistical Significance Testing
- t-test
- Wilcoxon signed-rank test
- Bootstrap method
Error Analysis
- Qualitative analysis of error cases
- Categorize error types
- Identify model weaknesses
Evaluation Tools and Libraries
Python Libraries
- scikit-learn: Classification metrics
- nltk: BLEU, ROUGE
- sacrebleu: Standardized BLEU calculation
- rouge-score: ROUGE metrics
- seqeval: Sequence labeling evaluation
Online Evaluation Platforms
- GLUE: General Language Understanding Evaluation
- SuperGLUE: More challenging evaluation benchmark
- SQuAD: Question answering evaluation
- WMT: Machine translation evaluation
Best Practices
1. Choose Appropriate Metrics
- Select based on task type
- Consider business requirements
- Balance multiple metrics
2. Combine Automatic and Human Evaluation
- Automatic evaluation is fast but limited
- Human evaluation is accurate but costly
- Best results when combined
3. Focus on Generalization Ability
- Evaluate on multiple datasets
- Cross-domain testing
- Adversarial testing
4. Reproducibility
- Fix random seeds
- Record evaluation configuration
- Publicize evaluation code and data
5. Continuous Monitoring
- Production environment monitoring
- Data drift detection
- Performance degradation alerts