Machine Translation is one of the most challenging tasks in Artificial Intelligence that works by investigating the use of software to translate a text or speech from one language to another. In this article, I will take you through Machine Translation using Neural networks.
At the end of this article, you will learn to develop a machine translation model using Neural networks and python. I will use the English language as an input and we will train our Machine Translation model to give the output in the French language. Now let’s start with importing all the libraries that we need for this task:
import collections
import helper
import numpy as np
import project_tests as tests
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
Code language: Python (python)
I will first create two functions to load the data and another function to test our data:
import os
def load_data(path):
"""
Load dataset
"""
input_file = os.path.join(path)
with open(input_file, "r") as f:
data = f.read()
return data.split('\n')
def _test_model(model, input_shape, output_sequence_length, french_vocab_size):
if isinstance(model, Sequential):
model = model.model
assert model.input_shape == (None, *input_shape[1:]),\
'Wrong input shape. Found input shape {} using parameter input_shape={}'.format(model.input_shape, input_shape)
assert model.output_shape == (None, output_sequence_length, french_vocab_size),\
'Wrong output shape. Found output shape {} using parameters output_sequence_length={} and french_vocab_size={}'\
.format(model.output_shape, output_sequence_length, french_vocab_size)
assert len(model.loss_functions) > 0,\
'No loss function set. Apply the `compile` function to the model.'
assert sparse_categorical_crossentropy in model.loss_functions,\
'Not using `sparse_categorical_crossentropy` function for loss.
Code language: Python (python)
Now let’s load the data and have a look at some insights from the data, the dataset that I am using here contains an English phrase with its translation:
english_sentences = helper.load_data('data/small_vocab_en')
french_sentences = helper.load_data('data/small_vocab_fr')
print('Dataset Loaded')
Code language: Python (python)
Dataset Loaded
for sample_i in range(2):
print('small_vocab_en Line {}: {}'.format(sample_i + 1, english_sentences[sample_i]))
print('small_vocab_fr Line {}: {}'.format(sample_i + 1, french_sentences[sample_i]))
Code language: Python (python)
small_vocab_en Line 1: new jersey is sometimes quiet during autumn , and it is snowy in april . small_vocab_fr Line 1: new jersey est parfois calme pendant l' automne , et il est neigeux en avril . small_vocab_en Line 2: the united states is usually chilly during july , and it is usually freezing in november . small_vocab_fr Line 2: les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
As we are doing the translation of a language so the complexity of this problem will be determined by the complexity of the vocabulary. The more complex is the vocabulary of our language is the more complex our problem will be. Let’s look at the data to see what complex data we are dealing with:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])
print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
Code language: Python (python)
1823250 English words. 227 unique English words. 10 Most common words in the English dataset: "is" "," "." "in" "it" "during" "the" "but" "and" "sometimes" 1961295 French words. 355 unique French words. 10 Most common words in the French dataset: "est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"
Preprocessing the Data
In Machine Learning wherever we are dealing with any sort of text values we first need to convert the text values into sequences of integers by using two primary methods like Tokenize and Padding. Now let’s start with Tokenization:
def tokenize(x):
x_tk = Tokenizer(char_level = False)
x_tk.fit_on_texts(x)
return x_tk.texts_to_sequences(x), x_tk
text_sentences = [
'The quick brown fox jumps over the lazy dog .',
'By Jove , my quick study of lexicography won a prize .',
'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(sent))
print(' Output: {}'.format(token_sent))
Code language: Python (python)
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21} Sequence 1 in x Input: The quick brown fox jumps over the lazy dog . Output: [1, 2, 4, 5, 6, 7, 1, 8, 9] Sequence 2 in x Input: By Jove , my quick study of lexicography won a prize . Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17] Sequence 3 in x Input: This is a short sentence . Output: [18, 19, 3, 20, 21]
Now let’s use the padding method to make all the sequences of the same length:
def pad(x, length=None):
if length is None:
length = max([len(sentence) for sentence in x])
return pad_sequences(x, maxlen = length, padding = 'post')
tests.test_pad(pad)
# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
print('Sequence {} in x'.format(sample_i + 1))
print(' Input: {}'.format(np.array(token_sent)))
print(' Output: {}'.format(pad_sent))
Code language: Python (python)
Sequence 1 in x Input: [1 2 4 5 6 7 1 8 9] Output: [1 2 4 5 6 7 1 8 9 0] Sequence 2 in x Input: [10 11 12 2 13 14 15 16 3 17] Output: [10 11 12 2 13 14 15 16 3 17] Sequence 3 in x Input: [18 19 3 20 21] Output: [18 19 3 20 21 0 0 0 0 0]
Preprocessing a Pipeline for Machine Translation
Now let’s define a preprocessing function to create a Pipeline for the task of Machine Translation so that we could use this model in future also:
def preprocess(x, y):
preprocess_x, x_tk = tokenize(x)
preprocess_y, y_tk = tokenize(y)
preprocess_x = pad(preprocess_x)
preprocess_y = pad(preprocess_y)
# Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)
return preprocess_x, preprocess_y, x_tk, y_tk
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
preprocess(english_sentences, french_sentences)
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)
print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
Code language: Python (python)
Data Preprocessed Max English sentence length: 15 Max French sentence length: 21 English vocabulary size: 199 French vocabulary size: 344
Training a Neural Network for Machine Translation
Now, here I will train a model using Neural networks. Let’s start by creating a helper function:
def logits_to_text(logits, tokenizer):
index_to_words = {id: word for word, id in tokenizer.word_index.items()}
index_to_words[0] = '<PAD>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
print('`logits_to_text` function loaded.')
Code language: Python (python)
`logits_to_text` function loaded.
Now I will train a RNN model which will act as a very good base for our sequences that can translate English language to French:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
input_seq = Input(input_shape[1:])
rnn = GRU(64, return_sequences = True)(input_seq)
logits = TimeDistributed(Dense(french_vocab_size))(rnn)
model = Model(input_seq, Activation('softmax')(logits))
model.compile(loss = sparse_categorical_crossentropy,
optimizer = Adam(learning_rate),
metrics = ['accuracy'])
return model
tests.test_simple_model(simple_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))
# Train the neural network
simple_rnn_model = simple_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Code language: Python (python)
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110288/110288 [==============================] - 9s 82us/step - loss: 3.5162 - acc: 0.4027 - val_loss: nan - val_acc: 0.4516 Epoch 2/10 110288/110288 [==============================] - 7s 64us/step - loss: 2.4823 - acc: 0.4655 - val_loss: nan - val_acc: 0.4838 Epoch 3/10 110288/110288 [==============================] - 7s 63us/step - loss: 2.2427 - acc: 0.5016 - val_loss: nan - val_acc: 0.5082 Epoch 4/10 110288/110288 [==============================] - 7s 64us/step - loss: 2.0188 - acc: 0.5230 - val_loss: nan - val_acc: 0.5428 Epoch 5/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.8418 - acc: 0.5542 - val_loss: nan - val_acc: 0.5685 Epoch 6/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.7258 - acc: 0.5731 - val_loss: nan - val_acc: 0.5811 Epoch 7/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.6478 - acc: 0.5871 - val_loss: nan - val_acc: 0.5890 Epoch 8/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.5850 - acc: 0.5940 - val_loss: nan - val_acc: 0.5977 Epoch 9/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.5320 - acc: 0.5996 - val_loss: nan - val_acc: 0.6027 Epoch 10/10 110288/110288 [==============================] - 7s 64us/step - loss: 1.4874 - acc: 0.6037 - val_loss: nan - val_acc: 0.6039 new jersey est parfois parfois en en et il est est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
The RNN model gave us an accuracy of only 60 per cent, let’s use a more complex neural network to train our model with better accuracy. I will now train our model using RNN with embedding. Embedding represents a vector of a word that is very close to a similar word in the n-dimensional world. The n here represents the size of the vectors of embedding:
from keras.models import Sequential
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
learning_rate = 1e-3
rnn = GRU(64, return_sequences=True, activation="tanh")
embedding = Embedding(french_vocab_size, 64, input_length=input_shape[1])
logits = TimeDistributed(Dense(french_vocab_size, activation="softmax"))
model = Sequential()
#em can only be used in first layer --&gt; Keras Documentation
model.add(embedding)
model.add(rnn)
model.add(logits)
model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(learning_rate),
metrics=['accuracy'])
return model
tests.test_embed_model(embed_model)
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))
embeded_model = embed_model(
tmp_x.shape,
max_french_sequence_length,
english_vocab_size,
french_vocab_size)
embeded_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2)
print(logits_to_text(embeded_model.predict(tmp_x[:1])[0], french_tokenizer))
Code language: Python (python)
Train on 110288 samples, validate on 27573 samples Epoch 1/10 110288/110288 [==============================] - 8s 68us/step - loss: 3.7877 - acc: 0.4018 - val_loss: nan - val_acc: 0.4093 Epoch 2/10 110288/110288 [==============================] - 7s 65us/step - loss: 2.7258 - acc: 0.4382 - val_loss: nan - val_acc: 0.5152 Epoch 3/10 110288/110288 [==============================] - 7s 65us/step - loss: 2.0359 - acc: 0.5453 - val_loss: nan - val_acc: 0.6068 Epoch 4/10 110288/110288 [==============================] - 7s 65us/step - loss: 1.4586 - acc: 0.6558 - val_loss: nan - val_acc: 0.6967 Epoch 5/10 110288/110288 [==============================] - 7s 65us/step - loss: 1.1346 - acc: 0.7308 - val_loss: nan - val_acc: 0.7561 Epoch 6/10 110288/110288 [==============================] - 7s 65us/step - loss: 0.9358 - acc: 0.7681 - val_loss: nan - val_acc: 0.7825 Epoch 7/10 110288/110288 [==============================] - 7s 65us/step - loss: 0.8057 - acc: 0.7917 - val_loss: nan - val_acc: 0.7993 Epoch 8/10 110288/110288 [==============================] - 7s 65us/step - loss: 0.7132 - acc: 0.8095 - val_loss: nan - val_acc: 0.8173 Epoch 9/10 110288/110288 [==============================] - 7s 65us/step - loss: 0.6453 - acc: 0.8229 - val_loss: nan - val_acc: 0.8313 Epoch 10/10 110288/110288 [==============================] - 7s 64us/step - loss: 0.5893 - acc: 0.8355 - val_loss: nan - val_acc: 0.8401 new jersey est parfois calme au l'automne et il il est neigeux en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Also, Read: Audio Feature Extraction in Machine Learning.
So our RNN model with embedding resulted in a very good accuracy of 84 per cent. I hope you liked this article on Machine Translation using Neural networks and python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.