Have you ever thought about how the autocorrect features works in the keyboard of a smartphone? Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works. In this article, I will take you through how to build autocorrect with Python.
Autocorrect with Python: How It Works?
With the context of machine learning, autocorrect is based on natural language processing. As the name suggests it is programmed to correct spellings and errors while typing. So how it works?
Before I get into the coding stuff let’s understand how autocorrect works. Let’s say you typed a word in your keyboard if the word will exist in the vocabulary of our smartphone then it will assume that you have written the right word. Now it doesn’t matter whether you write a name, a noun or any word on the planet.
If the word exists in the history of the smartphone, it will generalize the word as a correct word. What if the word doesn’t exist? If the word that you typed is a non-existing word in the history of our smartphone then the autocorrect is programmed to find the most similar words in the history of our smartphone.
Build an Autocorrect with Python
I hope you now know what autocorrect is and how it works. Now let’s see how we can build an autocorrect feature with Python. Like our smartphone uses history to match the type words whether it’s correct or not. So here we also need to use some words to put the functionality in our autocorrect.
So I will use the text from a book which you can easily download from here. Now let’s get started with the task to build an autocorrect with Python.
For this task, we need some libraries. The libraries that I am going to use are very general as a machine learning practitioner. So you must be having all the libraries installed in your system already except one. You need to install a library known as textdistance, which can be easily installed by using the pip command; pip install textdistance.
Now let’s get started with this task by importing all the necessary packages and by reading our text file:
The first ten words in the text are: ['moby', 'dick', 'by', 'herman', 'melville', '1851', 'etymology', 'supplied', 'by', 'a'] There are 17140 unique words in the vocabulary.
In the above code, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python:
[('the', 14431), ('of', 6609), ('and', 6430), ('a', 4736), ('to', 4625), ('in', 4172), ('that', 3085), ('his', 2530), ('it', 2522), ('i', 2127)]
Relative Frequency of words
Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:
Finding Similar Words
Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. Next, we will return the 5 most similar words ordered by similarity and probability:
Now, let’s find the similar words by using our autocorrect function:
As we took words from a book the same way their are some words already present in the vocabulary of the smartphone and some words it records while the user starts using the keyboard.
I hope you liked this article on how to build an autocorrect feature with Python. Feel free to ask your valuable questions in the comments section below.