How to Perform Text Data Preprocessing with NLP? - 面试题

Text data preprocessing is a critical step in NLP projects that directly impacts model performance. High-quality preprocessing can improve model accuracy, accelerate training, and reduce noise.

Importance of Text Preprocessing

Why Preprocessing is Needed

Raw text contains lots of noise
Data from different sources has inconsistent formats
Models require standardized input
Improve model performance and training efficiency

Preprocessing Goals

Clean noisy data
Standardize text format
Extract useful features
Reduce data dimensionality

Basic Text Cleaning

1. Remove Special Characters

HTML Tags

python
from bs4 import BeautifulSoup

def remove_html(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

URLs and Emails

python
import re

def remove_urls_emails(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    return text

Special Symbols

python
def remove_special_chars(text):
    text = re.sub(r'[^\w\s]', '', text)
    return text

2. Handle Whitespace

Remove Extra Spaces

python
def remove_extra_spaces(text):
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

Handle Newlines

python
def normalize_newlines(text):
    text = text.replace('\r\n', '\n')
    text = text.replace('\r', '\n')
    return text

3. Handle Numbers

Number Normalization

python
def normalize_numbers(text):
    text = re.sub(r'\d+', '<NUM>', text)
    return text

Preserve Specific Numbers

python
def preserve_specific_numbers(text):
    # Preserve years
    text = re.sub(r'\b(19|20)\d{2}\b', '<YEAR>', text)
    # Preserve phone numbers
    text = re.sub(r'\d{3}-\d{3}-\d{4}', '<PHONE>', text)
    return text

Text Normalization

1. Case Conversion

All Lowercase

python
def to_lowercase(text):
    return text.lower()

Capitalize First Letter

python
def capitalize_first(text):
    return text.capitalize()

Title Case

python
def to_title_case(text):
    return text.title()

2. Spelling Correction

Using TextBlob

python
from textblob import TextBlob

def correct_spelling(text):
    blob = TextBlob(text)
    return str(blob.correct())

Using pyspellchecker

python
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spelling(text):
    words = text.split()
    corrected = [spell.correction(word) for word in words]
    return ' '.join(corrected)

3. Contraction Expansion

Common Contractions

python
contractions = {
    "can't": "cannot",
    "won't": "will not",
    "n't": "not",
    "'re": "are",
    "'s": "is",
    "'d": "would",
    "'ll": "will",
    "'ve": "have",
    "'m": "am"
}

def expand_contractions(text):
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    return text

Tokenization

1. Chinese Tokenization

Using jieba

python
import jieba

def chinese_tokenization(text):
    return list(jieba.cut(text))

Using HanLP

python
from pyhanlp import HanLP

def chinese_tokenization(text):
    return HanLP.segment(text)

Using LTP

python
from ltp import LTP

ltp = LTP()
def chinese_tokenization(text):
    return ltp.cut(text)

2. English Tokenization

Using NLTK

python
from nltk.tokenize import word_tokenize

def english_tokenization(text):
    return word_tokenize(text)

Using spaCy

python
import spacy

nlp = spacy.load('en_core_web_sm')
def english_tokenization(text):
    doc = nlp(text)
    return [token.text for token in doc]

Using BERT Tokenizer

python
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def bert_tokenization(text):
    return tokenizer.tokenize(text)

3. Subword Tokenization

WordPiece

python
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def wordpiece_tokenization(text):
    return tokenizer.tokenize(text)

BPE (Byte Pair Encoding)

python
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def bpe_tokenization(text):
    return tokenizer.tokenize(text)

SentencePiece

python
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load('model.model')
def sentencepiece_tokenization(text):
    return sp.encode_as_pieces(text)

Stopword Handling

1. Remove Stopwords

English Stopwords

python
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered)

Chinese Stopwords

python
import jieba

def load_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return set([line.strip() for line in f])

stop_words = load_stopwords('chinese_stopwords.txt')

def remove_chinese_stopwords(text):
    words = jieba.cut(text)
    filtered = [word for word in words if word not in stop_words]
    return ' '.join(filtered)

2. Custom Stopwords

python
def create_custom_stopwords():
    custom_stopwords = {
        '的', '了', '在', '是', '我', '有', '和', '就',
        '不', '人', '都', '一', '一个', '上', '也', '很',
        '到', '说', '要', '去', '你', '会', '着', '没有',
        '看', '好', '自己', '这'
    }
    return custom_stopwords

Lemmatization and Stemming

1. Lemmatization

Using NLTK

python
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def lemmatize(text):
    words = word_tokenize(text)
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized)

Using spaCy

python
import spacy

nlp = spacy.load('en_core_web_sm')

def lemmatize(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

2. Stemming

Porter Stemmer

python
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem(text):
    words = word_tokenize(text)
    stemmed = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed)

Snowball Stemmer

python
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

def stem(text):
    words = word_tokenize(text)
    stemmed = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed)

Text Augmentation

1. Synonym Replacement

python
from nltk.corpus import wordnet
import random

def synonym_replacement(text, n=1):
    words = word_tokenize(text)
    new_words = words.copy()
    
    for _ in range(n):
        word_to_replace = random.choice(words)
        synonyms = []
        for syn in wordnet.synsets(word_to_replace):
            for lemma in syn.lemmas():
                if lemma.name() != word_to_replace:
                    synonyms.append(lemma.name())
        
        if synonyms:
            replacement = random.choice(synonyms)
            idx = new_words.index(word_to_replace)
            new_words[idx] = replacement
    
    return ' '.join(new_words)

2. Random Deletion

python
def random_deletion(text, p=0.1):
    words = word_tokenize(text)
    if len(words) == 1:
        return text
    
    new_words = []
    for word in words:
        if random.random() > p:
            new_words.append(word)
    
    if len(new_words) == 0:
        return random.choice(words)
    
    return ' '.join(new_words)

3. Random Swap

python
def random_swap(text, n=1):
    words = word_tokenize(text)
    for _ in range(n):
        if len(words) < 2:
            return text
        
        idx1, idx2 = random.sample(range(len(words)), 2)
        words[idx1], words[idx2] = words[idx2], words[idx1]
    
    return ' '.join(words)

4. Back-Translation

python
from googletrans import Translator

translator = Translator()

def back_translate(text, intermediate_lang='fr'):
    # Translate to intermediate language
    translated = translator.translate(text, dest=intermediate_lang).text
    # Translate back to original language
    back_translated = translator.translate(translated, dest='en').text
    return back_translated

Feature Extraction

1. TF-IDF

python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(texts)

2. N-gram

python
from sklearn.feature_extraction.text import CountVectorizer

# Unigrams
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))

# Bigrams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

# Trigrams
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))

3. Word Vectors

Word2Vec

python
from gensim.models import Word2Vec

sentences = [word_tokenize(text) for text in texts]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

GloVe

python
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)

BERT Embeddings

python
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt')
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)

Dataset Processing

1. Data Splitting

python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=42
)

2. Stratified Sampling

python
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(texts, labels):
    X_train, X_test = [texts[i] for i in train_index], [texts[i] for i in test_index]
    y_train, y_test = [labels[i] for i in train_index], [labels[i] for i in test_index]

3. Data Balancing

python
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Oversampling
oversampler = RandomOverSampler()
X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train)

# Undersampling
undersampler = RandomUnderSampler()
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)

Complete Preprocessing Pipeline

python
class TextPreprocessor:
    def __init__(self, language='english'):
        self.language = language
        self.setup_tools()
    
    def setup_tools(self):
        if self.language == 'english':
            from nltk.corpus import stopwords
            from nltk.tokenize import word_tokenize
            from nltk.stem import WordNetLemmatizer
            
            self.stop_words = set(stopwords.words('english'))
            self.lemmatizer = WordNetLemmatizer()
        elif self.language == 'chinese':
            import jieba
            self.stop_words = self.load_chinese_stopwords()
    
    def preprocess(self, text):
        # Cleaning
        text = self.clean_text(text)
        
        # Normalization
        text = self.normalize_text(text)
        
        # Tokenization
        tokens = self.tokenize(text)
        
        # Stopword removal
        tokens = self.remove_stopwords(tokens)
        
        # Lemmatization
        tokens = self.lemmatize(tokens)
        
        return ' '.join(tokens)
    
    def clean_text(self, text):
        # Remove HTML tags
        from bs4 import BeautifulSoup
        text = BeautifulSoup(text, 'html.parser').get_text()
        
        # Remove URLs
        text = re.sub(r'http\S+', '', text)
        
        # Remove special characters
        text = re.sub(r'[^\w\s]', '', text)
        
        # Remove extra spaces
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    def normalize_text(self, text):
        # Lowercase
        text = text.lower()
        
        # Expand contractions
        text = self.expand_contractions(text)
        
        return text
    
    def tokenize(self, text):
        if self.language == 'english':
            return word_tokenize(text)
        elif self.language == 'chinese':
            return list(jieba.cut(text))
    
    def remove_stopwords(self, tokens):
        return [token for token in tokens if token.lower() not in self.stop_words]
    
    def lemmatize(self, tokens):
        if self.language == 'english':
            return [self.lemmatizer.lemmatize(token) for token in tokens]
        return tokens

Best Practices

1. Preprocessing Order

Clean text (remove noise)
Normalize (case, contractions)
Tokenize
Remove stopwords
Lemmatize/Stem
Feature extraction

2. Avoid Over-processing

Preserve important information
Consider task requirements
Don't over-clean

3. Consistency

Keep processing pipeline consistent
Document all steps
Reproducibility

4. Performance Optimization

Batch processing
Parallelization
Cache results

Tools and Libraries

Python Libraries

NLTK: Classic NLP tools
spaCy: Industrial-grade NLP
jieba: Chinese tokenization
HanLP: Chinese NLP
TextBlob: Simple and easy to use
gensim: Word vectors

Pre-trained Models

BERT: Contextual embeddings
GPT: Generation models
T5: Text-to-text

Summary

Text data preprocessing is the foundation of NLP projects, directly impacting model performance. Choosing appropriate preprocessing methods requires considering task requirements, data characteristics, and model types. Through a systematic preprocessing pipeline, you can significantly improve model performance and training efficiency.