In machine learning, text generation is a type of language modelling problem. In this article, I will introduce you to a machine learning project on text generation with Python programming language.
Introduction to Text Generation in Machine Learning
In machine learning, text generation is the central problem of several natural language processing tasks such as speech to text, conversational system, and text synthesis. A trained text generation model learns the probability of occurrence of a word based on the previous sequence of words used in the text.
Machine learning models for generating text can be used at the character, sentence, or even paragraph level. In this article, I’ll explain how to build a machine learning model to generate natural language text by implementing and training an advanced recurrent neural network using the Python programming language.
Machine Learning Project on Text Generation with Python
In this section, I will take you through a Machine Learning project on Text Generation with Python programming language. Here I will train a Text Generation model for the task of generating News Headlines.
Let’s start this task by importing all the necessary Python libraries and the dataset:
In this step, I’ll first perform a data text cleanup that includes removing punctuation and lower case all words:
[' gop leadership poised to topple obamas pillars', 'fractured world tested the hope of a young president', 'little troublemakers', 'angela merkel russias next target', 'boots for a stranger on a bus', 'molder of navajo youth where a game is sacred', 'the affair season 3 episode 6 noah goes home', 'sprint and mr trumps fictional jobs', 'america becomes a stan', 'fighting diabetes and leading by example']
The next step is to generate sequences of N-gram tokens. The machine learning model of generating text requires a sequence of input data, because, given a sequence (of words/tokens), the goal is to predict the next word/token. For this task, we need to do some tokenization on the dataset.
Tokenization is a process of extracting tokens from a corpus. Python’s Keras library has a built-in tokenization model that can be used to get tokens and their index in the corpus. After this step, each text document in the dataset is converted into a sequence of tokens:
Padding the Sequences
Now that we have generated a dataset that contains the sequence of tokens, but be aware that different sequences can have different lengths. So, before we start training the text generation model, we need to fill in the sequences and make their lengths equal:
Using LSTM for Text Generation with Python
Unlike other RNNs LSTMs have an additional state called “cell state” whereby the network makes adjustments in the flow of information. The advantage of this state is that the model can remember or forget the tilts more selectively. Now let’s train the LSTM model for the task of generating text with Python:
Now let’s fit the model:
model.fit(predictors, label, epochs=100, verbose=5)
Testing the Text Generation Model
Our machine learning model for the task of generating text with Python is now ready. Next, let’s write the function to predict the next word based on the input words.
We will first tokenize the seed text, fill in the sequences, and move on to the trained model to get the predicted word. The multiple predicted words can be added together to obtain the predicted sequence:
United States On Paralysis Its A Workout Preident Trump Fires We Be Mindful Donald Trump Tweets Blacks Perceive A India And China 3 Episode 7 Theres New York Today A Trumpless Tower Science And Technology Nam A Raid And A
As we can see the model produced the output which looks pretty good. I hope you liked this article on Machine Learning project on Text Generation with Python. Feel free to ask your valuable questions in the comments section below.