The Bag of Words (BoW) model is one of the most fundamental text representation techniques in Natural Language Processing (NLP). It converts text (such as sentences or documents) into fixed-length vectors. The core idea of this model is to represent text using the occurrence counts of each word in the vocabulary, while ignoring word order and grammatical structure.
The main steps of the Bag of Words model include:
- Vocabulary Creation: First, collect all distinct words from all documents to build a vocabulary.
- Text Vectorization: Next, convert each document into a vector where the length matches the vocabulary size, and each element corresponds to the frequency of a specific word in the document.
For example, consider the following two sentences:
- Sentence 1: "I like watching movies"
- Sentence 2: "I don't like watching TV"
Assume the vocabulary is {"I", "like", "watch", "movies", "not", "TV"}, then these sentences can be represented as:
- Vector 1:
[1, 1, 1, 1, 0, 0](corresponding to "I like watching movies") - Vector 2:
[1, 1, 1, 0, 1, 1](corresponding to "I don't like watching TV")
Each number represents the occurrence count of the corresponding word in the sentence.
The Bag of Words model is very simple to implement, but it has some limitations:
- Ignoring word order: All text is reduced to word frequency counts, meaning the model cannot capture semantic information conveyed by word order.
- High dimensionality and sparsity: With a large vocabulary, each text becomes a long vector with many zero elements, resulting in inefficiencies in computation and storage.
- Handling synonyms and polysemous words: The model cannot handle synonyms and polysemous words as it only considers word frequency counts.
Despite these limitations, the Bag of Words model is widely applied in various NLP tasks, such as document classification and sentiment analysis, primarily due to its simplicity and ease of understanding. For more complex semantic understanding tasks, higher-level models are typically used, such as TF-IDF or Word2Vec.