Stemming in Machine Learning

In Machine Learning, the Stemming process is widely used in tagging, indexing, SEO, web search results and information search. For example, the search for fish on Google will also result in fish, fishing as fish is the root of the two words. In this article, I will take you through Stemming in Machine Learning with Python.

What is Stemming?

Stemming is the task of reducing the inflexion of words to their root form, such as mapping a group of words to the same root even though the root itself is not a valid word in the language. 

Also, Read – Visualize a Decision Tree in Machine Learning.

Here I will introduce you to the process of stemming words and sentences with Machine Learning using natural language processing. Let’s import all the necessary libraries we need to stem words and sentences to get started with the task:

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
porter = PorterStemmer()
lancaster=LancasterStemmer()Code language: JavaScript (javascript)

Now, I will create 2 lists of words and I will define a variable with a sentence that I will use for the process of stemming words and sentences:

l_words1 = ['cats', 'trouble', 'troubling', 'troubled']
l_words2 = ['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']

print(l_words1)
print(l_words2)

sentence = 'Hi, I am Aman Kharwal. I am a programmer from India, and I am here to guide you with Machine Learning for free. I hope you will learn a lot in your journey towards ML and AI with me.'
sentenceCode language: PHP (php)
['cats', 'trouble', 'troubling', 'troubled'] 
['dogs', 'programming', 'programs', 'programmed', 'cakes', 'indices', 'matrices']
'Hi, I am Aman Kharwal. I am a programmer from India, and I am here to guide you with Machine Learning for free. I hope you will learn a lot in your journey towards ML and AI with me.'

In the output above you can see we have two lists of words and one variable “sentence” that I will use in the further process to explain you Stemming. It has some popular methods, Now I will take you some popular methods of stemming by using our variable and lists that I have defined above.

Stemming Words with Python

In Machine learning, we have two popular methods to stem words, Porter Method and Lancaster Method. Now let’s go through both these methods.

Porter Method

The Porter Method only keeps the prefix for each word and leaves non-English words such as troubl. It might not be useful to see non-English words for further analysis, but it is simple and effective. 

Now let’s go through this method for stemming words by using our defined lists:

for word in l_words1:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))Code language: PHP (php)
cats          -> cat 
trouble       -> troubl 
troubling     -> troubl 
troubled      -> troubl
for word in l_words2:
    print(f'{word} \t -> {porter.stem(word)}'.expandtabs(15))Code language: PHP (php)
dogs            -> dog 
programming -> program
programs -> program
programmed -> program
cakes -> cake
indices -> indic
matrices -> matric

Lancaster Method

Lancaster Method is a rule-based derivation method which is based on the last letter of words. It is heavier in calculus than the Porter method.

Now, let’s go through this method for stemming words by using our defined lists:

for word in l_words1:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))Code language: PHP (php)
cats            -> cat 
trouble -> troubl
troubling -> troubl
troubled -> troubl
for word in l_words2:
    print(f'{word} \t -> {lancaster.stem(word)}'.expandtabs(15))Code language: PHP (php)
dogs            -> dog
programming     -> program
programs        -> program
programmed      -> program
cakes           -> cak
indices         -> ind
matrices        -> mat

I hope you have now understood how we can stem words in Machine Learning by using natural language processing. Now, I will proceed with the task to stem sentences in Machine Learning.

Stemming Sentences with Python

To stem sentences we have the same methods but the application is a little bit different. First, we need to tokenize our sentence so that we can easily use our sentence for the process of stemming. Now let’s see how we can do this.

Tokenization

Tokenization is the process of dividing a text or a word into a list of tokens. We can think of the token as coins because a word is a token in a sentence, a sentence is a token in a paragraph.

Now, let’s tokenize the sentence that we defined above and get started with the task to stem a sentence using python:

tokenized_words=word_tokenize(sentence)
print(tokenized_words)Code language: PHP (php)
['Hi', ',', 'I', 'am', 'Aman', 'Kharwal', '.', 'I', 'am', 'a', 'programmer', 'from', 'India', ',', 'and', 'I', 'am', 'here', 'to', 'guide', 'you', 'with', 'Machine', 'Learning', 'for', 'free', '.', 'I', 'hope', 'you', 'will', 'learn', 'a', 'lot', 'in', 'your', 'journey', 'towards', 'ML', 'and', 'AI', 'with', 'me', '.']

Now, you can see we have tokenized our sentence properly, now the next step is stemming the sentence. I will simply use both the methods that I introduced to you above. I will first use the Porter method:

tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(porter.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentenceCode language: JavaScript (javascript)
Hi , I am aman kharwal . I am a programm from india , and I am here to guid you with machin learn for free . I hope you will learn a lot in your journey toward ML and AI with me .

The porter method gave us a good result, now let’s do this on the Lancaster method:

tokenized_sentence = []
for word in tokenized_words:
    tokenized_sentence.append(lancaster.stem(word))
tokenized_sentence = " ".join(tokenized_sentence)
tokenized_sentenceCode language: JavaScript (javascript)
hi , i am am kharw . i am a program from ind , and i am her to guid you with machin learn for fre . i hop you wil learn a lot in yo journey toward ml and ai with me .

Also, Read – Visualize Real-Time Stock Prices with Python.

So we can clearly see that the output of both the methods is different from one another and both are accurate. I hope you liked this article on Stemming words and sentences in Machine Learning using Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning and Python.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1536

Leave a Reply