Bag Of Words in Machine Learning with Python

One of the simplest but effective and commonly used ways to represent text for machine learning is through the bag of words representation. In this article, I will take you through the implementation of Bag Of Words in Machine Learning with Python programming language.

Introduction To Bag Of Words in Machine Learning

Using the Bag Of Words representation, we remove most of the structure of the input text, such as chapters, paragraphs, sentences, and formatting, and only count how often each word appears in each. text of the corpus.

Also, Read – Machine Learning Full Course for free.

Ignoring structure and counting only word occurrences leads to the mental image of the text being represented as a “bag.”

The calculation of the word bag representation for a corpus of documents consists of the following three steps:

  1. Tokenization: Divide each document into words that appear there (called tokens), for example by dividing them into spaces and punctuation marks.
  2. Vocabulary building: Collect a vocabulary of all the words that appear in any of the documents and number them (for example, in alphabetical order).
  3. Encoding: for each document, count the frequency with which each of the vocabulary words appears in that document.

There are a few niceties involved in Steps 1 and 2, which I will discuss below. For now, let’s see how we can apply word processing using scikit-learn. The image below illustrates the process on a string:

Process of Bag Of Words

The output is a vector of word counts for each document. For each word in the vocabulary, we count the frequency with which it appears in each document. This means that our digital representation has a characteristic for every unique word in the data set. 

Note that the order of the words in the original string has no relation to the representation of the functions of the bag of words.

Implementing Bag Of Words with Python

The bag of words representation is implemented in CountVectorizer, which is a transformer. Let’s first apply it to a few sample sentences, made up of two examples, to see it in action:

bards_words =["The fool doth think he is wise,",
 "but the wise man knows himself to be a fool"]Code language: Python (python)

Next, we import and instantiate the CountVectorizer and adapt it to our data as follows:

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)Code language: Python (python)

After the adjustment, the CountVectorizer consists of tokenization of the training data and the construction of the vocabulary, which we can access as a vocabulary_ attribute:

print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))Code language: Python (python)
Vocabulary size: 13
Vocabulary content:
 {'the': 9, 'himself': 5, 'wise': 12, 'he': 4, 'doth': 2, 'to': 11, 'knows': 7,
 'man': 8, 'fool': 3, 'is': 6, 'be': 0, 'think': 10, 'but': 1}

The vocabulary consists of 13 words, from “to be” to “wise”. To create the bag of words representation for the training data, we call the transformation method:

bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))Code language: Python (python)
bag_of_words: <2x13 sparse matrix of type '' with 16 stored elements in Compressed Sparse Row format>

The representation of the bag of words is stored in a SciPy fragmented matrix which only stores nonzero entries. The matrix is ​​in the form of 2 × 13, with a row for each of the two data points and a characteristic for each of the vocabulary words.

A sparse matrix is ​​used because most documents only contain a small subset of the vocabulary words, which means that most Entity Table entries are 0. Think about how many different words can appear. in a movie review against all the words in the English language (which the vocabulary models).

Storing all of these zeros would be prohibitive and a waste of memory. To look at the actual contents of the sparse matrix, we can convert it to a “dense” NumPy array (which also stores all 0 entries) using the toarray method:

print("Dense representation of bag_of_words:\n{}".format(
bag_of_words.toarray()))Code language: Python (python)
Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]

Conclusion

We can see that the word counts for each word are either 0 or 1; neither of the two strings of bards_words contains a word twice. Let’s see how to read these feature vectors. The first string (“The fool thinks he is wise”) is represented as the first line and it contains the first vocabulary word, “to be”, zero times.

It also contains the second vocabulary word, “but”, zero times. It contains the third word, “doth”, once, and so on. Looking at the two lines, we can see that the fourth word, “idiot”, the tenth word, “the”, and the thirteenth word, “wise”, appear in both strings.

Hope you liked this article on the implementation of Bag Of Words in Machine Learning using the Python programming language. Please feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply