I recently shared an article on Bag Of Words, Stopwords in Machine Learning is another way to get rid of uninformative words by rejecting words that are too frequent to be informative. In this article, I will introduce you to the concept of stopwords in machine learning.
Stopwords in Machine Learning
Stop words are commonly used words that are excluded from searches to help index and crawl web pages faster. Some examples of stop words are: “a,” “and” “but” “how”, “or” and “what”.
Sometimes certain extremely common words which seem to have little value in helping to select documents corresponding to a user’s need are excluded from the vocabulary entirely. These words are called stopwords.
The general strategy for determining a list of stopwords is to sort the terms by collection frequency, then take the most frequent terms, often filtered by hand for their semantic content relative to the domain of the documents to be indexed as a list of stopwords.
There are two main approaches: using a language-specific stop word list or removing words that appear too frequently. scikit-learn has a built-in list of stopwords in English in the feature_extraction.text module:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS))) print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))
Number of stop words: 318 Every 10th stopword: ['above', 'elsewhere', 'into', 'well', 'rather', 'fifteen', 'had', 'enough', 'herein', 'should', 'third', 'although', 'more', 'this', 'none', 'seemed', 'nobody', 'seems', 'he', 'also', 'fill', 'anyone', 'anything', 'me', 'the', 'yet', 'go', 'seeming', 'front', 'beforehand', 'forty', 'i']
Removing stopwords from the list can only reduce the number of entities by the length of the list – here, 318 – but it can lead to improved performance. Let’s try:
# Specifying stop_words="english" uses the built-in list. # We could also augment it and pass our own. vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train) X_train = vect.transform(text_train) print("X_train with stop words:\n{}".format(repr(X_train)))
X_train with stop words: <25000x26966 sparse matrix of type '' with 2149958 stored elements in Compressed Sparse Row format>
There are now 305 (27,271–26,966) fewer features in the dataset, meaning that most, but not all, stop words have occurred. Let’s start the search again in the grid:
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5) grid.fit(X_train, y_train) print("Best cross-validation score: {:.2f}".format(grid.best_score_))
Best cross-validation score: 0.88
The grid search performance decreased slightly using the stopwords—not enough to worry about, but given that excluding 305 features out of over 27,000 is unlikely to change performance or interpretability a lot, it doesn’t seem worth using this list.
Fixed lists are most helpful for small datasets, which might not contain enough information for the model to determine which words are stopwords from the data itself. As an exercise, you can try out the other approach, discarding frequently appearing words, by setting the max_df option of CountVectorizer and see how it influences the number of features and the performance.Â
I hope you liked this article on the concept of Stop words in Machine Learning. Feel free to ask your valuable questions in the comments section below.