I recently shared an article on Audio Processing with Python. It was about how you can process and manipulate an audio file with Python. If you have gone through that, now you must know about how the basic stuff of audio data is handled with programming. In this article, I will take you through the speech classification with Machine Learning. Speech classification is one of the most yet to be explored fields of Artificial Intelligence. Here I will use the Speech MNIST dataset which is a set of recorded spoken digits.
The dataset consists of 3 speakers with over 1500 recordings. This article is based on speech classification; we will classify the dataset and predict the spoken digit in the record. Now let’s load the dataset and get started with our task. You can download the dataset from here.
Just like we do for other tasks in Machine Learning, where we classify text or images, we always start by exploring the data. Here we will have a look at what we are working on, and how the dataset looks like:
wav, sr = librosa.load(DATA_DIR + random_file) print 'sr:', sr print 'wav shape:', wav.shapeCode language: Python (python)
wav shape: (9609,)
The output returned two values, the first value represents the sound waves, and the second value represents the sampling rate of recordings. As we all know, the sound is a signal. To make these recordings into a digital format so that we can use these recordings in the form of a numpy array for speech classification. In our dataset, the sampling rate is 22050 recordings per second, and the size of the waves in the dataset is 9609, now let’s compute the length of the audio as follows:
wav, sr = librosa.load(DATA_DIR + random_file, sr=None) print 'sr:', sr print 'wav shape:', wav.shape print 'length:', sr/wav.shape, 'secs'Code language: Python (python)
wav shape: (3486,)
length: 0.43575 secs
Now let’s have a quick look at the actual sound:
plt.plot(wav)Code language: Python (python)
As we can see, this looks a very complex sign of signals, looking for the flow of patterns is quite challenging with this. Let’s zoom the figure to get some insights:
plt.plot(wav[4000:4200])Code language: Python (python)
This is quite good, but still, it does not tell anything. Let’s move towards the process of speech classification.
Speech Classification with Machine Learning
I will simply start with using the waves with the same state they are, and then I will try to build a neural network that can classify the digit in the recording for us. I will start with some data preparation as we have over 1500 records present in the dataset.
One thing which is challenging here is that we cannot split the data into a training and test set. Because if we split the data in the standard ratio, that is 85% training and 15% test, will not work for us. As we will be having the same voices in both the splits. So I will train the algorithm on two speakers and test it on the third. As I already told above that the dataset consists of the voices of 3 people so that this approach will help us in speech classification. Now let’s prepare the data:
X =  y =  pad = lambda a, i: a[0: i] if a.shape > i else np.hstack((a, np.zeros(i - a.shape))) for fname in os.listdir(DATA_DIR): struct = fname.split('_') digit = struct wav, sr = librosa.load(DATA_DIR + fname) padded = pad(wav, 30000) X.append(padded) y.append(digit) X = np.vstack(X) y = np.array(y) print 'X:', X.shape print 'y:', y.shapeCode language: Python (python)
Before training the data let’s have a look at the patterns of the sound waves more deeply:
signal = np.cos(np.arange(0, 20, 0.2)) plt.plot(signal)Code language: Python (python)
We can see an elementary signal. Yes, we can see a pattern in this. We can control the waves by manipulating with the amplitude and the frequency of sound waves:
signal = 2*np.cos(np.arange(0, 20, 0.2)*2) plt.plot(signal)Code language: Python (python)
Now let’s add two different sound waves that we just saw above together:
cos1 = np.cos(np.arange(0, 20, 0.2)) cos2 = 2*np.cos(np.arange(0, 20, 0.2)*2) cos3 = 8*np.cos(np.arange(0, 20, 0.2)*4) signal = cos1 + cos2 + cos3 plt.plot(signal)Code language: Python (python)
Now we need to decide what could be the sampling rate of our sound waves, or, similarly, what could be the length the signal of the speech. Let’s use one second for convenience. We have 100 pints, so our sampling rate is 100 Hz. For this, I will use the Fourier Transformation for the signal of the sound:
fft = np.fft.fft(signal)[:50] fft = np.abs(fft) plt.plot(fft)Code language: Python (python)
The above figure gives us three different levels of frequencies, exactly what we were trying to do. Now we need a way to extract rates out of this sound, but as we all the speech of a human being is never constant, it changes with time. So for a better speech classification, I will split our recordings into tiny windows, and with this, we can easily see what frequencies are roaming around each window:
D = librosa.amplitude_to_db(np.abs(librosa.stft(wav)), ref=np.max) librosa.display.specshow(D, y_axis='linear')
Here, we can see different levels of frequencies in different blocks. So with this, we are done with the task of Speech Classification with Machine Learning. I hope you liked this article, feel free to ask your valuable questions in the comments section below.