Text data preprocessing is a critical step in NLP projects that directly impacts model performance. High-quality preprocessing can improve model accuracy, accelerate training, and reduce noise.
Importance of Text Preprocessing
Why Preprocessing is Needed
- Raw text contains lots of noise
- Data from different sources has inconsistent formats
- Models require standardized input
- Improve model performance and training efficiency
Preprocessing Goals
- Clean noisy data
- Standardize text format
- Extract useful features
- Reduce data dimensionality
Basic Text Cleaning
1. Remove Special Characters
HTML Tags
pythonfrom bs4 import BeautifulSoup def remove_html(text): soup = BeautifulSoup(text, 'html.parser') return soup.get_text()
URLs and Emails
pythonimport re def remove_urls_emails(text): text = re.sub(r'http\S+', '', text) text = re.sub(r'\S+@\S+', '', text) return text
Special Symbols
pythondef remove_special_chars(text): text = re.sub(r'[^\w\s]', '', text) return text
2. Handle Whitespace
Remove Extra Spaces
pythondef remove_extra_spaces(text): text = re.sub(r'\s+', ' ', text) text = text.strip() return text
Handle Newlines
pythondef normalize_newlines(text): text = text.replace('\r\n', '\n') text = text.replace('\r', '\n') return text
3. Handle Numbers
Number Normalization
pythondef normalize_numbers(text): text = re.sub(r'\d+', '<NUM>', text) return text
Preserve Specific Numbers
pythondef preserve_specific_numbers(text): # Preserve years text = re.sub(r'\b(19|20)\d{2}\b', '<YEAR>', text) # Preserve phone numbers text = re.sub(r'\d{3}-\d{3}-\d{4}', '<PHONE>', text) return text
Text Normalization
1. Case Conversion
All Lowercase
pythondef to_lowercase(text): return text.lower()
Capitalize First Letter
pythondef capitalize_first(text): return text.capitalize()
Title Case
pythondef to_title_case(text): return text.title()
2. Spelling Correction
Using TextBlob
pythonfrom textblob import TextBlob def correct_spelling(text): blob = TextBlob(text) return str(blob.correct())
Using pyspellchecker
pythonfrom spellchecker import SpellChecker spell = SpellChecker() def correct_spelling(text): words = text.split() corrected = [spell.correction(word) for word in words] return ' '.join(corrected)
3. Contraction Expansion
Common Contractions
pythoncontractions = { "can't": "cannot", "won't": "will not", "n't": "not", "'re": "are", "'s": "is", "'d": "would", "'ll": "will", "'ve": "have", "'m": "am" } def expand_contractions(text): for contraction, expansion in contractions.items(): text = text.replace(contraction, expansion) return text
Tokenization
1. Chinese Tokenization
Using jieba
pythonimport jieba def chinese_tokenization(text): return list(jieba.cut(text))
Using HanLP
pythonfrom pyhanlp import HanLP def chinese_tokenization(text): return HanLP.segment(text)
Using LTP
pythonfrom ltp import LTP ltp = LTP() def chinese_tokenization(text): return ltp.cut(text)
2. English Tokenization
Using NLTK
pythonfrom nltk.tokenize import word_tokenize def english_tokenization(text): return word_tokenize(text)
Using spaCy
pythonimport spacy nlp = spacy.load('en_core_web_sm') def english_tokenization(text): doc = nlp(text) return [token.text for token in doc]
Using BERT Tokenizer
pythonfrom transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def bert_tokenization(text): return tokenizer.tokenize(text)
3. Subword Tokenization
WordPiece
pythonfrom transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def wordpiece_tokenization(text): return tokenizer.tokenize(text)
BPE (Byte Pair Encoding)
pythonfrom transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') def bpe_tokenization(text): return tokenizer.tokenize(text)
SentencePiece
pythonimport sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load('model.model') def sentencepiece_tokenization(text): return sp.encode_as_pieces(text)
Stopword Handling
1. Remove Stopwords
English Stopwords
pythonfrom nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_words = set(stopwords.words('english')) def remove_stopwords(text): words = word_tokenize(text) filtered = [word for word in words if word.lower() not in stop_words] return ' '.join(filtered)
Chinese Stopwords
pythonimport jieba def load_stopwords(file_path): with open(file_path, 'r', encoding='utf-8') as f: return set([line.strip() for line in f]) stop_words = load_stopwords('chinese_stopwords.txt') def remove_chinese_stopwords(text): words = jieba.cut(text) filtered = [word for word in words if word not in stop_words] return ' '.join(filtered)
2. Custom Stopwords
pythondef create_custom_stopwords(): custom_stopwords = { '的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这' } return custom_stopwords
Lemmatization and Stemming
1. Lemmatization
Using NLTK
pythonfrom nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet lemmatizer = WordNetLemmatizer() def lemmatize(text): words = word_tokenize(text) lemmatized = [lemmatizer.lemmatize(word) for word in words] return ' '.join(lemmatized)
Using spaCy
pythonimport spacy nlp = spacy.load('en_core_web_sm') def lemmatize(text): doc = nlp(text) return ' '.join([token.lemma_ for token in doc])
2. Stemming
Porter Stemmer
pythonfrom nltk.stem import PorterStemmer stemmer = PorterStemmer() def stem(text): words = word_tokenize(text) stemmed = [stemmer.stem(word) for word in words] return ' '.join(stemmed)
Snowball Stemmer
pythonfrom nltk.stem import SnowballStemmer stemmer = SnowballStemmer('english') def stem(text): words = word_tokenize(text) stemmed = [stemmer.stem(word) for word in words] return ' '.join(stemmed)
Text Augmentation
1. Synonym Replacement
pythonfrom nltk.corpus import wordnet import random def synonym_replacement(text, n=1): words = word_tokenize(text) new_words = words.copy() for _ in range(n): word_to_replace = random.choice(words) synonyms = [] for syn in wordnet.synsets(word_to_replace): for lemma in syn.lemmas(): if lemma.name() != word_to_replace: synonyms.append(lemma.name()) if synonyms: replacement = random.choice(synonyms) idx = new_words.index(word_to_replace) new_words[idx] = replacement return ' '.join(new_words)
2. Random Deletion
pythondef random_deletion(text, p=0.1): words = word_tokenize(text) if len(words) == 1: return text new_words = [] for word in words: if random.random() > p: new_words.append(word) if len(new_words) == 0: return random.choice(words) return ' '.join(new_words)
3. Random Swap
pythondef random_swap(text, n=1): words = word_tokenize(text) for _ in range(n): if len(words) < 2: return text idx1, idx2 = random.sample(range(len(words)), 2) words[idx1], words[idx2] = words[idx2], words[idx1] return ' '.join(words)
4. Back-Translation
pythonfrom googletrans import Translator translator = Translator() def back_translate(text, intermediate_lang='fr'): # Translate to intermediate language translated = translator.translate(text, dest=intermediate_lang).text # Translate back to original language back_translated = translator.translate(translated, dest='en').text return back_translated
Feature Extraction
1. TF-IDF
pythonfrom sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=1000) tfidf_matrix = vectorizer.fit_transform(texts)
2. N-gram
pythonfrom sklearn.feature_extraction.text import CountVectorizer # Unigrams unigram_vectorizer = CountVectorizer(ngram_range=(1, 1)) # Bigrams bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # Trigrams trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
3. Word Vectors
Word2Vec
pythonfrom gensim.models import Word2Vec sentences = [word_tokenize(text) for text in texts] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
GloVe
pythonfrom gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)
BERT Embeddings
pythonfrom transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') def get_bert_embedding(text): inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs) return outputs.last_hidden_state.mean(dim=1)
Dataset Processing
1. Data Splitting
pythonfrom sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( texts, labels, test_size=0.2, random_state=42 ) X_train, X_val, y_train, y_val = train_test_split( X_train, y_train, test_size=0.25, random_state=42 )
2. Stratified Sampling
pythonfrom sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in sss.split(texts, labels): X_train, X_test = [texts[i] for i in train_index], [texts[i] for i in test_index] y_train, y_test = [labels[i] for i in train_index], [labels[i] for i in test_index]
3. Data Balancing
pythonfrom imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler # Oversampling oversampler = RandomOverSampler() X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train) # Undersampling undersampler = RandomUnderSampler() X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)
Complete Preprocessing Pipeline
pythonclass TextPreprocessor: def __init__(self, language='english'): self.language = language self.setup_tools() def setup_tools(self): if self.language == 'english': from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer self.stop_words = set(stopwords.words('english')) self.lemmatizer = WordNetLemmatizer() elif self.language == 'chinese': import jieba self.stop_words = self.load_chinese_stopwords() def preprocess(self, text): # Cleaning text = self.clean_text(text) # Normalization text = self.normalize_text(text) # Tokenization tokens = self.tokenize(text) # Stopword removal tokens = self.remove_stopwords(tokens) # Lemmatization tokens = self.lemmatize(tokens) return ' '.join(tokens) def clean_text(self, text): # Remove HTML tags from bs4 import BeautifulSoup text = BeautifulSoup(text, 'html.parser').get_text() # Remove URLs text = re.sub(r'http\S+', '', text) # Remove special characters text = re.sub(r'[^\w\s]', '', text) # Remove extra spaces text = re.sub(r'\s+', ' ', text) return text.strip() def normalize_text(self, text): # Lowercase text = text.lower() # Expand contractions text = self.expand_contractions(text) return text def tokenize(self, text): if self.language == 'english': return word_tokenize(text) elif self.language == 'chinese': return list(jieba.cut(text)) def remove_stopwords(self, tokens): return [token for token in tokens if token.lower() not in self.stop_words] def lemmatize(self, tokens): if self.language == 'english': return [self.lemmatizer.lemmatize(token) for token in tokens] return tokens
Best Practices
1. Preprocessing Order
- Clean text (remove noise)
- Normalize (case, contractions)
- Tokenize
- Remove stopwords
- Lemmatize/Stem
- Feature extraction
2. Avoid Over-processing
- Preserve important information
- Consider task requirements
- Don't over-clean
3. Consistency
- Keep processing pipeline consistent
- Document all steps
- Reproducibility
4. Performance Optimization
- Batch processing
- Parallelization
- Cache results
Tools and Libraries
Python Libraries
- NLTK: Classic NLP tools
- spaCy: Industrial-grade NLP
- jieba: Chinese tokenization
- HanLP: Chinese NLP
- TextBlob: Simple and easy to use
- gensim: Word vectors
Pre-trained Models
- BERT: Contextual embeddings
- GPT: Generation models
- T5: Text-to-text
Summary
Text data preprocessing is the foundation of NLP projects, directly impacting model performance. Choosing appropriate preprocessing methods requires considering task requirements, data characteristics, and model types. Through a systematic preprocessing pipeline, you can significantly improve model performance and training efficiency.