Next Word Prediction Model using Python

Next Word Prediction means predicting the most likely word or phrase that will come next in a sentence or text. It is like having an inbuilt feature on an application that suggests the next word as you type or speak. The Next Word Prediction Models are used in applications like messaging apps, search engines, virtual assistants, and autocorrect features on smartphones. So, if you want to learn how to build a Next Word Prediction Model, this article is for you. In this article, I’ll take you through building a Next Word Prediction Model with Deep Learning using Python.

What is the Next Word Prediction Model & How to Build it?

Next word prediction is a language modelling task in Machine Learning that aims to predict the most probable word or sequence of words that follows a given input context. This task utilizes statistical patterns and linguistic structures to generate accurate predictions based on the context provided.

Next Word Prediction Model: Example
An example of Next Word Prediction on an Apple iPhone’s keyboard!

The Next Word Prediction models have a range of applications across various industries. For example, when you start typing a message on your phone, it suggests the next word to speed up your typing. Similarly, search engines predict and show search suggestions as you type in the search bar. Next word prediction helps us communicate faster and more accurately by anticipating what we might say or search for.

To build a Next Word Prediction model:

  1. start by collecting a diverse dataset of text documents, 
  2. preprocess the data by cleaning and tokenizing it, 
  3. prepare the data by creating input-output pairs, 
  4. engineer features such as word embeddings, 
  5. select an appropriate model like an LSTM or GPT, 
  6. train the model on the dataset while adjusting hyperparameters,
  7. improve the model by experimenting with different techniques and architectures.

This iterative process allows businesses to develop accurate and efficient Next Word Prediction models that can be applied in various applications.

So the process of building a Next Word Prediction model starts by collecting textual data that can be a vocabulary for our model. For example, the way you type on your smartphone’s keyboard becomes the vocabulary of the next word prediction model of your smartphone’s keyboard. In the same way, we need textual data for our model. I found an ideal dataset for this task based on the text of a book on Sherlock Holmes. You can download the dataset from here.

Next Word Prediction Model using Python

I hope you now know what a Next Word Prediction model is. In this section, I’ll take you through how to build a Next Word Prediction model using Python and Deep Learning. So, let’s start this task by importing the necessary Python libraries and the dataset:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Read the text file
with open('sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

Now let’s tokenize the text to create a sequence of words:

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

In the above code, the text is tokenized, which means it is divided into individual words or tokens. The ‘Tokenizer’ object is created, which will handle the tokenization process. The ‘fit_on_texts’ method of the tokenizer is called, passing the ‘text’ as input. This method analyzes the text and builds a vocabulary of unique words, assigning each word a numerical index. The ‘total_words’ variable is then assigned the value of the length of the word index plus one, representing the total number of distinct words in the text.

Now let’s create input-output pairs by splitting the text into sequences of tokens and forming n-grams from the sequences:

input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In the above code, the text data is split into lines using the ‘\n’ character as a delimiter. For each line in the text, the ‘texts_to_sequences’ method of the tokenizer is used to convert the line into a sequence of numerical tokens based on the previously created vocabulary. The resulting token list is then iterated over using a for loop. For each iteration, a subsequence, or n-gram, of tokens is extracted, ranging from the beginning of the token list up to the current index ‘i’.

This n-gram sequence represents the input context, with the last token being the target or predicted word. This n-gram sequence is then appended to the ‘input_sequences’ list. This process is repeated for all lines in the text, generating multiple input-output sequences that will be used for training the next word prediction model.

Now let’s pad the input sequences to have equal length:

max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In the above code, the input sequences are padded to ensure all sequences have the same length. The variable ‘max_sequence_len’ is assigned the maximum length among all the input sequences. The ‘pad_sequences’ function is used to pad or truncate the input sequences to match this maximum length.

The ‘pad_sequences’ function takes the input_sequences list, sets the maximum length to ‘max_sequence_len’, and specifies that the padding should be added at the beginning of each sequence using the ‘padding=pre’ argument. Finally, the input sequences are converted into a numpy array to facilitate further processing.

Now let’s split the sequences into input and output:

X = input_sequences[:, :-1]
y = input_sequences[:, -1]

In the above code, the input sequences are split into two arrays, ‘X’ and ‘y’, to create the input and output for training the next word prediction model. The ‘X’ array is assigned the values of all rows in the ‘input_sequences’ array except for the last column. It means that ‘X’ contains all the tokens in each sequence except for the last one, representing the input context.

On the other hand, the ‘y’ array is assigned the values of the last column in the ‘input_sequences’ array, which represents the target or predicted word.

Now let’s convert the output to one-hot encode vectors:

y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

In the above code, we are converting the output array into a suitable format for training a model, where each target word is represented as a binary vector.

Now let’s build a neural network architecture to train the model:

model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))
print(model.summary())
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 17, 100)           820000    
                                                                 
 lstm_1 (LSTM)               (None, 150)               150600    
                                                                 
 dense_1 (Dense)             (None, 8200)              1238200   
                                                                 
=================================================================
Total params: 2,208,800
Trainable params: 2,208,800
Non-trainable params: 0
_________________________________________________________________
None

The code above defines the model architecture for the next word prediction model. The ‘Sequential’ model is created, which represents a linear stack of layers. The first layer added to the model is the ‘Embedding’ layer, which is responsible for converting the input sequences into dense vectors of fixed size. It takes three arguments:

  1. ‘total_words’, which represents the total number of distinct words in the vocabulary; 
  2. ‘100’, which denotes the dimensionality of the word embeddings; 
  3. and ‘input_length’, which specifies the length of the input sequences.

The next layer added is the ‘LSTM’ layer, a type of recurrent neural network (RNN) layer designed for capturing sequential dependencies in the data. It has 150 units, which means it will learn 150 internal representations or memory cells.

Finally, the ‘Dense’ layer is added, which is a fully connected layer that produces the output predictions. It has ‘total_words’ units and uses the ‘softmax’ activation function to convert the predicted scores into probabilities, indicating the likelihood of each word being the next one in the sequence.

Now let’s compile and train the model:

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=100, verbose=1)
Epoch 1/100
3010/3010 [==============================] - 62s 20ms/step - loss: 6.2408 - accuracy: 0.0756
Epoch 2/100
3010/3010 [==============================] - 61s 20ms/step - loss: 5.5185 - accuracy: 0.1238
Epoch 3/100
3010/3010 [==============================] - 60s 20ms/step - loss: 5.1323 - accuracy: 0.1472
Epoch 4/100
3010/3010 [==============================] - 60s 20ms/step - loss: 4.8025 - accuracy: 0.1643
Epoch 5/100
3010/3010 [==============================] - 60s 20ms/step - loss: 4.4973 - accuracy: 0.1834
Epoch 6/100
3010/3010 [==============================] - 61s 20ms/step - loss: 4.2105 - accuracy: 0.2027
Epoch 7/100
3010/3010 [==============================] - 60s 20ms/step - loss: 3.9381 - accuracy: 0.2285
Epoch 8/100
3010/3010 [==============================] - 61s 20ms/step - loss: 3.6836 - accuracy: 0.2582
Epoch 9/100
3010/3010 [==============================] - 60s 20ms/step - loss: 3.4395 - accuracy: 0.2916
Epoch 10/100
3010/3010 [==============================] - 60s 20ms/step - loss: 3.2134 - accuracy: 0.3253
Epoch 11/100
3010/3010 [==============================] - 60s 20ms/step - loss: 3.0053 - accuracy: 0.3600
Epoch 12/100
3010/3010 [==============================] - 60s 20ms/step - loss: 2.8086 - accuracy: 0.3956
Epoch 13/100
3010/3010 [==============================] - 60s 20ms/step - loss: 2.6304 - accuracy: 0.4284
...
Epoch 99/100
3010/3010 [==============================] - 74s 24ms/step - loss: 0.5163 - accuracy: 0.8645
Epoch 100/100
3010/3010 [==============================] - 70s 23ms/step - loss: 0.5154 - accuracy: 0.8652

In the above code, the model is being compiled and trained. The ‘compile’ method configures the model for training. The ‘loss’ parameter is set to ‘categorical_crossentropy’, a commonly used loss function for multi-class classification problems. The ‘optimizer’ parameter is set to ‘adam’, an optimization algorithm that adapts the learning rate during training.

The ‘metrics’ parameter is set to ‘accuracy’ to monitor the accuracy during training. After compiling the model, the ‘fit’ method is called to train the model on the input sequences ‘X’ and the corresponding output ‘y’. The ‘epochs’ parameter specifies the number of times the training process will iterate over the entire dataset. The ‘verbose’ parameter is set to ‘1’ to display the training process.

The above code will take more than an hour to execute. Once the code is executed, here’s how we can generate the next word predictions using our model:

seed_text = "I will leave if they"
next_words = 3

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)
Output: I will leave if they have already married

The above code generates the next word predictions based on a given seed text. The ‘seed_text’ variable holds the initial text. The ‘next_words’ variable determines the number of predictions to be generated. Inside the for loop, the ‘seed_text’ is converted into a sequence of tokens using the tokenizer. The token sequence is padded to match the maximum sequence length.

The model predicts the next word by calling the ‘predict’ method on the model with the padded token sequence. The predicted word is obtained by finding the word with the highest probability score using ‘np.argmax’. Then, the predicted word is appended to the ‘seed_text’, and the process is repeated for the desired number of ‘next_words’. Finally, the ‘seed_text’ is printed, which contains the initial text followed by the generated predictions.

So this is how you can build a Next Word Prediction model using Deep Learning and Python programming language.

Summary

Next word prediction is a language modelling task in Machine Learning that aims to predict the most probable word or sequence of words that follows a given input context. This task utilizes statistical patterns and linguistic structures to generate accurate predictions based on the context provided. I hope you liked this article on building a Next Word Prediction model using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1614

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading