乐闻世界logo
搜索文章和话题

How to Perform Text Data Preprocessing with NLP?

2月18日 17:02

Text data preprocessing is a critical step in NLP projects that directly impacts model performance. High-quality preprocessing can improve model accuracy, accelerate training, and reduce noise.

Importance of Text Preprocessing

Why Preprocessing is Needed

  • Raw text contains lots of noise
  • Data from different sources has inconsistent formats
  • Models require standardized input
  • Improve model performance and training efficiency

Preprocessing Goals

  • Clean noisy data
  • Standardize text format
  • Extract useful features
  • Reduce data dimensionality

Basic Text Cleaning

1. Remove Special Characters

HTML Tags

python
from bs4 import BeautifulSoup def remove_html(text): soup = BeautifulSoup(text, 'html.parser') return soup.get_text()

URLs and Emails

python
import re def remove_urls_emails(text): text = re.sub(r'http\S+', '', text) text = re.sub(r'\S+@\S+', '', text) return text

Special Symbols

python
def remove_special_chars(text): text = re.sub(r'[^\w\s]', '', text) return text

2. Handle Whitespace

Remove Extra Spaces

python
def remove_extra_spaces(text): text = re.sub(r'\s+', ' ', text) text = text.strip() return text

Handle Newlines

python
def normalize_newlines(text): text = text.replace('\r\n', '\n') text = text.replace('\r', '\n') return text

3. Handle Numbers

Number Normalization

python
def normalize_numbers(text): text = re.sub(r'\d+', '<NUM>', text) return text

Preserve Specific Numbers

python
def preserve_specific_numbers(text): # Preserve years text = re.sub(r'\b(19|20)\d{2}\b', '<YEAR>', text) # Preserve phone numbers text = re.sub(r'\d{3}-\d{3}-\d{4}', '<PHONE>', text) return text

Text Normalization

1. Case Conversion

All Lowercase

python
def to_lowercase(text): return text.lower()

Capitalize First Letter

python
def capitalize_first(text): return text.capitalize()

Title Case

python
def to_title_case(text): return text.title()

2. Spelling Correction

Using TextBlob

python
from textblob import TextBlob def correct_spelling(text): blob = TextBlob(text) return str(blob.correct())

Using pyspellchecker

python
from spellchecker import SpellChecker spell = SpellChecker() def correct_spelling(text): words = text.split() corrected = [spell.correction(word) for word in words] return ' '.join(corrected)

3. Contraction Expansion

Common Contractions

python
contractions = { "can't": "cannot", "won't": "will not", "n't": "not", "'re": "are", "'s": "is", "'d": "would", "'ll": "will", "'ve": "have", "'m": "am" } def expand_contractions(text): for contraction, expansion in contractions.items(): text = text.replace(contraction, expansion) return text

Tokenization

1. Chinese Tokenization

Using jieba

python
import jieba def chinese_tokenization(text): return list(jieba.cut(text))

Using HanLP

python
from pyhanlp import HanLP def chinese_tokenization(text): return HanLP.segment(text)

Using LTP

python
from ltp import LTP ltp = LTP() def chinese_tokenization(text): return ltp.cut(text)

2. English Tokenization

Using NLTK

python
from nltk.tokenize import word_tokenize def english_tokenization(text): return word_tokenize(text)

Using spaCy

python
import spacy nlp = spacy.load('en_core_web_sm') def english_tokenization(text): doc = nlp(text) return [token.text for token in doc]

Using BERT Tokenizer

python
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def bert_tokenization(text): return tokenizer.tokenize(text)

3. Subword Tokenization

WordPiece

python
from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') def wordpiece_tokenization(text): return tokenizer.tokenize(text)

BPE (Byte Pair Encoding)

python
from transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') def bpe_tokenization(text): return tokenizer.tokenize(text)

SentencePiece

python
import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.load('model.model') def sentencepiece_tokenization(text): return sp.encode_as_pieces(text)

Stopword Handling

1. Remove Stopwords

English Stopwords

python
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_words = set(stopwords.words('english')) def remove_stopwords(text): words = word_tokenize(text) filtered = [word for word in words if word.lower() not in stop_words] return ' '.join(filtered)

Chinese Stopwords

python
import jieba def load_stopwords(file_path): with open(file_path, 'r', encoding='utf-8') as f: return set([line.strip() for line in f]) stop_words = load_stopwords('chinese_stopwords.txt') def remove_chinese_stopwords(text): words = jieba.cut(text) filtered = [word for word in words if word not in stop_words] return ' '.join(filtered)

2. Custom Stopwords

python
def create_custom_stopwords(): custom_stopwords = { '的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这' } return custom_stopwords

Lemmatization and Stemming

1. Lemmatization

Using NLTK

python
from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet lemmatizer = WordNetLemmatizer() def lemmatize(text): words = word_tokenize(text) lemmatized = [lemmatizer.lemmatize(word) for word in words] return ' '.join(lemmatized)

Using spaCy

python
import spacy nlp = spacy.load('en_core_web_sm') def lemmatize(text): doc = nlp(text) return ' '.join([token.lemma_ for token in doc])

2. Stemming

Porter Stemmer

python
from nltk.stem import PorterStemmer stemmer = PorterStemmer() def stem(text): words = word_tokenize(text) stemmed = [stemmer.stem(word) for word in words] return ' '.join(stemmed)

Snowball Stemmer

python
from nltk.stem import SnowballStemmer stemmer = SnowballStemmer('english') def stem(text): words = word_tokenize(text) stemmed = [stemmer.stem(word) for word in words] return ' '.join(stemmed)

Text Augmentation

1. Synonym Replacement

python
from nltk.corpus import wordnet import random def synonym_replacement(text, n=1): words = word_tokenize(text) new_words = words.copy() for _ in range(n): word_to_replace = random.choice(words) synonyms = [] for syn in wordnet.synsets(word_to_replace): for lemma in syn.lemmas(): if lemma.name() != word_to_replace: synonyms.append(lemma.name()) if synonyms: replacement = random.choice(synonyms) idx = new_words.index(word_to_replace) new_words[idx] = replacement return ' '.join(new_words)

2. Random Deletion

python
def random_deletion(text, p=0.1): words = word_tokenize(text) if len(words) == 1: return text new_words = [] for word in words: if random.random() > p: new_words.append(word) if len(new_words) == 0: return random.choice(words) return ' '.join(new_words)

3. Random Swap

python
def random_swap(text, n=1): words = word_tokenize(text) for _ in range(n): if len(words) < 2: return text idx1, idx2 = random.sample(range(len(words)), 2) words[idx1], words[idx2] = words[idx2], words[idx1] return ' '.join(words)

4. Back-Translation

python
from googletrans import Translator translator = Translator() def back_translate(text, intermediate_lang='fr'): # Translate to intermediate language translated = translator.translate(text, dest=intermediate_lang).text # Translate back to original language back_translated = translator.translate(translated, dest='en').text return back_translated

Feature Extraction

1. TF-IDF

python
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=1000) tfidf_matrix = vectorizer.fit_transform(texts)

2. N-gram

python
from sklearn.feature_extraction.text import CountVectorizer # Unigrams unigram_vectorizer = CountVectorizer(ngram_range=(1, 1)) # Bigrams bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # Trigrams trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))

3. Word Vectors

Word2Vec

python
from gensim.models import Word2Vec sentences = [word_tokenize(text) for text in texts] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

GloVe

python
from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)

BERT Embeddings

python
from transformers import BertModel, BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') def get_bert_embedding(text): inputs = tokenizer(text, return_tensors='pt') outputs = model(**inputs) return outputs.last_hidden_state.mean(dim=1)

Dataset Processing

1. Data Splitting

python
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( texts, labels, test_size=0.2, random_state=42 ) X_train, X_val, y_train, y_val = train_test_split( X_train, y_train, test_size=0.25, random_state=42 )

2. Stratified Sampling

python
from sklearn.model_selection import StratifiedShuffleSplit sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in sss.split(texts, labels): X_train, X_test = [texts[i] for i in train_index], [texts[i] for i in test_index] y_train, y_test = [labels[i] for i in train_index], [labels[i] for i in test_index]

3. Data Balancing

python
from imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler # Oversampling oversampler = RandomOverSampler() X_resampled, y_resampled = oversampler.fit_resample(X_train, y_train) # Undersampling undersampler = RandomUnderSampler() X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)

Complete Preprocessing Pipeline

python
class TextPreprocessor: def __init__(self, language='english'): self.language = language self.setup_tools() def setup_tools(self): if self.language == 'english': from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer self.stop_words = set(stopwords.words('english')) self.lemmatizer = WordNetLemmatizer() elif self.language == 'chinese': import jieba self.stop_words = self.load_chinese_stopwords() def preprocess(self, text): # Cleaning text = self.clean_text(text) # Normalization text = self.normalize_text(text) # Tokenization tokens = self.tokenize(text) # Stopword removal tokens = self.remove_stopwords(tokens) # Lemmatization tokens = self.lemmatize(tokens) return ' '.join(tokens) def clean_text(self, text): # Remove HTML tags from bs4 import BeautifulSoup text = BeautifulSoup(text, 'html.parser').get_text() # Remove URLs text = re.sub(r'http\S+', '', text) # Remove special characters text = re.sub(r'[^\w\s]', '', text) # Remove extra spaces text = re.sub(r'\s+', ' ', text) return text.strip() def normalize_text(self, text): # Lowercase text = text.lower() # Expand contractions text = self.expand_contractions(text) return text def tokenize(self, text): if self.language == 'english': return word_tokenize(text) elif self.language == 'chinese': return list(jieba.cut(text)) def remove_stopwords(self, tokens): return [token for token in tokens if token.lower() not in self.stop_words] def lemmatize(self, tokens): if self.language == 'english': return [self.lemmatizer.lemmatize(token) for token in tokens] return tokens

Best Practices

1. Preprocessing Order

  1. Clean text (remove noise)
  2. Normalize (case, contractions)
  3. Tokenize
  4. Remove stopwords
  5. Lemmatize/Stem
  6. Feature extraction

2. Avoid Over-processing

  • Preserve important information
  • Consider task requirements
  • Don't over-clean

3. Consistency

  • Keep processing pipeline consistent
  • Document all steps
  • Reproducibility

4. Performance Optimization

  • Batch processing
  • Parallelization
  • Cache results

Tools and Libraries

Python Libraries

  • NLTK: Classic NLP tools
  • spaCy: Industrial-grade NLP
  • jieba: Chinese tokenization
  • HanLP: Chinese NLP
  • TextBlob: Simple and easy to use
  • gensim: Word vectors

Pre-trained Models

  • BERT: Contextual embeddings
  • GPT: Generation models
  • T5: Text-to-text

Summary

Text data preprocessing is the foundation of NLP projects, directly impacting model performance. Choosing appropriate preprocessing methods requires considering task requirements, data characteristics, and model types. Through a systematic preprocessing pipeline, you can significantly improve model performance and training efficiency.

标签:NLP