Machine Translation Model

Machine Translation is one of the most challenging tasks in Artificial Intelligence that works by investigating the use of software to translate a text or speech from one language to another. In this article, I will take you through Machine Translation using Neural networks.

At the end of this article, you will learn to develop a machine translation model using Neural networks and python. I will use the English language as an input and we will train our Machine Translation model to give the output in the French language. Now let’s start with importing all the libraries that we need for this task:

import collections import helper import numpy as np import project_tests as tests from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Model from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional from keras.layers.embeddings import Embedding from keras.optimizers import Adam from keras.losses import sparse_categorical_crossentropy

I will first create two functions to load the data and another function to test our data:

import os def load_data(path): """ Load dataset """ input_file = os.path.join(path) with open(input_file, "r") as f: data = f.read() return data.split('\n') def _test_model(model, input_shape, output_sequence_length, french_vocab_size): if isinstance(model, Sequential): model = model.model assert model.input_shape == (None, *input_shape[1:]),\ 'Wrong input shape. Found input shape {} using parameter input_shape={}'.format(model.input_shape, input_shape) assert model.output_shape == (None, output_sequence_length, french_vocab_size),\ 'Wrong output shape. Found output shape {} using parameters output_sequence_length={} and french_vocab_size={}'\ .format(model.output_shape, output_sequence_length, french_vocab_size) assert len(model.loss_functions) > 0,\ 'No loss function set. Apply the `compile` function to the model.' assert sparse_categorical_crossentropy in model.loss_functions,\ 'Not using `sparse_categorical_crossentropy` function for loss.

Now let’s load the data and have a look at some insights from the data, the dataset that I am using here contains an English phrase with its translation:

english_sentences = helper.load_data('data/small_vocab_en') french_sentences = helper.load_data('data/small_vocab_fr') print('Dataset Loaded')

Dataset Loaded

for sample_i in range(2): print('small_vocab_en Line {}: {}'.format(sample_i + 1, english_sentences[sample_i])) print('small_vocab_fr Line {}: {}'.format(sample_i + 1, french_sentences[sample_i]))
small_vocab_en Line 1: new jersey is sometimes quiet during autumn , and it is snowy in april . small_vocab_fr Line 1: new jersey est parfois calme pendant l' automne , et il est neigeux en avril . small_vocab_en Line 2: the united states is usually chilly during july , and it is usually freezing in november . small_vocab_fr Line 2: les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

As we are doing the translation of a language so the complexity of this problem will be determined by the complexity of the vocabulary. The more complex is the vocabulary of our language is the more complex our problem will be. Let’s look at the data to see what complex data we are dealing with:

english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()]) french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()]) print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()]))) print('{} unique English words.'.format(len(english_words_counter))) print('10 Most common words in the English dataset:') print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"') print() print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()]))) print('{} unique French words.'.format(len(french_words_counter))) print('10 Most common words in the French dataset:') print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"

Preprocessing the Data

In Machine Learning wherever we are dealing with any sort of text values we first need to convert the text values into sequences of integers by using two primary methods like Tokenize and Padding. Now let’s start with Tokenization:

def tokenize(x): x_tk = Tokenizer(char_level = False) x_tk.fit_on_texts(x) return x_tk.texts_to_sequences(x), x_tk text_sentences = [ 'The quick brown fox jumps over the lazy dog .', 'By Jove , my quick study of lexicography won a prize .', 'This is a short sentence .'] text_tokenized, text_tokenizer = tokenize(text_sentences) print(text_tokenizer.word_index) print() for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)): print('Sequence {} in x'.format(sample_i + 1)) print(' Input: {}'.format(sent)) print(' Output: {}'.format(token_sent))
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]

Now let’s use the padding method to make all the sequences of the same length:

def pad(x, length=None): if length is None: length = max([len(sentence) for sentence in x]) return pad_sequences(x, maxlen = length, padding = 'post') tests.test_pad(pad) # Pad Tokenized output test_pad = pad(text_tokenized) for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)): print('Sequence {} in x'.format(sample_i + 1)) print(' Input: {}'.format(np.array(token_sent))) print(' Output: {}'.format(pad_sent))
Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]

Preprocessing a Pipeline for Machine Translation

Now let’s define a preprocessing function to create a Pipeline for the task of Machine Translation so that we could use this model in future also:

def preprocess(x, y): preprocess_x, x_tk = tokenize(x) preprocess_y, y_tk = tokenize(y) preprocess_x = pad(preprocess_x) preprocess_y = pad(preprocess_y) # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1) return preprocess_x, preprocess_y, x_tk, y_tk preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\ preprocess(english_sentences, french_sentences) max_english_sequence_length = preproc_english_sentences.shape[1] max_french_sequence_length = preproc_french_sentences.shape[1] english_vocab_size = len(english_tokenizer.word_index) french_vocab_size = len(french_tokenizer.word_index) print('Data Preprocessed') print("Max English sentence length:", max_english_sequence_length) print("Max French sentence length:", max_french_sequence_length) print("English vocabulary size:", english_vocab_size) print("French vocabulary size:", french_vocab_size)
Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344

Training a Neural Network for Machine Translation

Now, here I will train a model using Neural networks. Let’s start by creating a helper function:

def logits_to_text(logits, tokenizer): index_to_words = {id: word for word, id in tokenizer.word_index.items()} index_to_words[0] = '<PAD>' return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)]) print('`logits_to_text` function loaded.')
`logits_to_text` function loaded.

Now I will train a RNN model which will act as a very good base for our sequences that can translate English language to French:

def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size): learning_rate = 1e-3 input_seq = Input(input_shape[1:]) rnn = GRU(64, return_sequences = True)(input_seq) logits = TimeDistributed(Dense(french_vocab_size))(rnn) model = Model(input_seq, Activation('softmax')(logits)) model.compile(loss = sparse_categorical_crossentropy, optimizer = Adam(learning_rate), metrics = ['accuracy']) return model tests.test_simple_model(simple_model) tmp_x = pad(preproc_english_sentences, max_french_sequence_length) tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1)) # Train the neural network simple_rnn_model = simple_model( tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size) simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2) # Print prediction(s) print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 9s 82us/step - loss: 3.5162 - acc: 0.4027 - val_loss: nan - val_acc: 0.4516
Epoch 2/10
110288/110288 [==============================] - 7s 64us/step - loss: 2.4823 - acc: 0.4655 - val_loss: nan - val_acc: 0.4838
Epoch 3/10
110288/110288 [==============================] - 7s 63us/step - loss: 2.2427 - acc: 0.5016 - val_loss: nan - val_acc: 0.5082
Epoch 4/10
110288/110288 [==============================] - 7s 64us/step - loss: 2.0188 - acc: 0.5230 - val_loss: nan - val_acc: 0.5428
Epoch 5/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.8418 - acc: 0.5542 - val_loss: nan - val_acc: 0.5685
Epoch 6/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.7258 - acc: 0.5731 - val_loss: nan - val_acc: 0.5811
Epoch 7/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.6478 - acc: 0.5871 - val_loss: nan - val_acc: 0.5890
Epoch 8/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.5850 - acc: 0.5940 - val_loss: nan - val_acc: 0.5977
Epoch 9/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.5320 - acc: 0.5996 - val_loss: nan - val_acc: 0.6027
Epoch 10/10
110288/110288 [==============================] - 7s 64us/step - loss: 1.4874 - acc: 0.6037 - val_loss: nan - val_acc: 0.6039
new jersey est parfois parfois en en et il est est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

The RNN model gave us an accuracy of only 60 per cent, let’s use a more complex neural network to train our model with better accuracy. I will now train our model using RNN with embedding. Embedding represents a vector of a word that is very close to a similar word in the n-dimensional world. The n here represents the size of the vectors of embedding:

from keras.models import Sequential def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size): learning_rate = 1e-3 rnn = GRU(64, return_sequences=True, activation="tanh") embedding = Embedding(french_vocab_size, 64, input_length=input_shape[1]) logits = TimeDistributed(Dense(french_vocab_size, activation="softmax")) model = Sequential() #em can only be used in first layer --&amp;gt; Keras Documentation model.add(embedding) model.add(rnn) model.add(logits) model.compile(loss=sparse_categorical_crossentropy, optimizer=Adam(learning_rate), metrics=['accuracy']) return model tests.test_embed_model(embed_model) tmp_x = pad(preproc_english_sentences, max_french_sequence_length) tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2])) embeded_model = embed_model( tmp_x.shape, max_french_sequence_length, english_vocab_size, french_vocab_size) embeded_model.fit(tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2) print(logits_to_text(embeded_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 8s 68us/step - loss: 3.7877 - acc: 0.4018 - val_loss: nan - val_acc: 0.4093
Epoch 2/10
110288/110288 [==============================] - 7s 65us/step - loss: 2.7258 - acc: 0.4382 - val_loss: nan - val_acc: 0.5152
Epoch 3/10
110288/110288 [==============================] - 7s 65us/step - loss: 2.0359 - acc: 0.5453 - val_loss: nan - val_acc: 0.6068
Epoch 4/10
110288/110288 [==============================] - 7s 65us/step - loss: 1.4586 - acc: 0.6558 - val_loss: nan - val_acc: 0.6967
Epoch 5/10
110288/110288 [==============================] - 7s 65us/step - loss: 1.1346 - acc: 0.7308 - val_loss: nan - val_acc: 0.7561
Epoch 6/10
110288/110288 [==============================] - 7s 65us/step - loss: 0.9358 - acc: 0.7681 - val_loss: nan - val_acc: 0.7825
Epoch 7/10
110288/110288 [==============================] - 7s 65us/step - loss: 0.8057 - acc: 0.7917 - val_loss: nan - val_acc: 0.7993
Epoch 8/10
110288/110288 [==============================] - 7s 65us/step - loss: 0.7132 - acc: 0.8095 - val_loss: nan - val_acc: 0.8173
Epoch 9/10
110288/110288 [==============================] - 7s 65us/step - loss: 0.6453 - acc: 0.8229 - val_loss: nan - val_acc: 0.8313
Epoch 10/10
110288/110288 [==============================] - 7s 64us/step - loss: 0.5893 - acc: 0.8355 - val_loss: nan - val_acc: 0.8401
new jersey est parfois calme au l'automne et il il est neigeux en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Also, Read: Audio Feature Extraction in Machine Learning.

So our RNN model with embedding resulted in a very good accuracy of 84 per cent. I hope you liked this article on Machine Translation using Neural networks and python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.

Follow Us:

Default image
Aman Kharwal

I am a programmer from India, and I am here to guide you with Data Science, Machine Learning, Python, and C++ for free. I hope you will learn a lot in your journey towards Coding, Machine Learning and Artificial Intelligence with me.

Leave a Reply