Audio Feature Extraction has been one of the significant focus of Machine Learning over the years. The most frequent common state of data is a text where we can perform feature extraction quite smoothly. Then we have Feature Extraction for the image, which is a challenging task.
Now I will show you Audio Feature Extraction, which is a bit more complicated task in Machine Learning. Feature Extraction is the process of reducing the number of features in the data by creating new features using the existing ones. The new extracted features must be able to summarise most of the information contained in the original set of elements in the data.
Audio Feature Extraction
Audio Feature Extraction plays a significant part in analyzing the audios. The idea is to extract those powerful features that can help in characterizing all the complex nature of audio signals which at the end will help in to identify the discriminatory subspaces of audio and all the keys that you need to analyze sound signals. Now let’s start with importing all the libraries that we need for this task:
import tensorflow as tf
import numpy as np
import pandas as pd
from pyAudioAnalysis import audioBasicIO
from pyAudioAnalysis import audioFeatureExtraction
import matplotlib.pyplot as plt
import os
Code language: Python (python)
Audio Basic IO is used to extract the audio data like a data frame and creating sample data for audio signals. Audio Feature Extraction is responsible for obtaining all the features from the signals of audio that we need for this task.
Now I will define a utility function that will help us in taking a file name as argument:
def preProcess( fileName ):
[Fs, x] = audioBasicIO.readAudioFile(fileName)
if( len( x.shape ) > 1 and x.shape[1] == 2 ):
x = np.mean( x, axis = 1, keepdims = True )
else:
x = x.reshape( x.shape[0], 1 )
F, f_names = audioFeatureExtraction.stFeatureExtraction(
x[ :, 0 ],
Fs, 0.050*Fs,
0.025*Fs
)
return (f_names, F)
Code language: Python (python)
Now I would like to use only the chronogram feature from the audio signals, so I will now separate the data from our function:
def getChromagram( audioData ):
temp_data = audioData[ 21 ].reshape(
1,
audioData[ 21 ].shape[0]
)
chronograph = temp_data
for i in range( 22, 33 ):
temp_data = audioData[ i ].reshape(
1,
audioData[ i ].shape[0]
)just say you love me
chronograph = np.vstack( [ chronograph, temp_data ] )
return chronographjust say you love me
Code language: Python (python)
Now I will create a function that will be used to find the best note in each window, and then we can easily find the frequencies from the audio signals:
def getNoteFrequency( chromagram ):
numberOfWindows = chromagram.shape[1]
freqVal = chromagram.argmax( axis = 0 )
histogram, bin = np.histogram( freqVal, bins = 12 )
normalized_hist = histogram.reshape( 1, 12 ).astype( float ) / numberOfWindows #D
return normalized_hist
Code language: Python (python)
Now I will create a function to iterate over the files in the path of our directory. Here I will be using a pandas data frame to store our feature vectors:
fileList = []
def getDataset( filePath ):
X = pd.DataFrame( )
columns=[ "G#", "G", "F#", "F", "E", "D#", "D", "C#", "C", "B", "A#", "A" ]
for root, dirs, filenames in os.walk( filePath ):
for file in filenames:
fileList.append( file )
feature_name, features = preProcess(filePath + file )
chromagram = getChromagram( features )
noteFrequency = getNoteFrequency( chromagram )
x_new = pd.Series(noteFrequency[ 0, : ])
X = pd.concat( [ X, x_new ], axis = 1 )
data = X.T.copy()
data.columns = columns
data.index = [ i for i in range( 0, data.shape[ 0 ] ) ]
return data
data
Code language: Python (python)
G# | G | F# | F | E | D# | D | C# | C | B | A# | A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
1 | 0.991713 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.008287 |
2 | 0.021834 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.978166 |
3 | 0.015385 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.984615 |
4 | 0.090909 | 0.0 | 0.0 | 0.090909 | 0.0 | 0.0 | 0.545455 | 0.0 | 0.0 | 0.0 | 0.0 | 0.272727 |
5 | 0.027027 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.972973 |
6 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
7 | 0.097561 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.902439 |
8 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
9 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
In the data frame above each row represents a data point, and each column represents the features. So we have 19 files and 12 features each in our audio signals.
Also, Read: Polynomial Regression Algorithm in Machine Learning.
Machine Learning Algorithm for Audio Feature Extraction
Here I will use the K-means clustering algorithm. Now I will define the hyperparameters for our Machine Learning Algorithm. Here K will represent the number of clusters, and epochs represent the number of iterations our Machine Learning Algorithm will run for:
k = 4
epochs = 1000
Code language: Python (python)
Now I will make a function to select the k data points as initial centroids:
def initilizeCentroids( data, k ):
centroids = data[ 0: k ]
return centroids
Code language: Python (python)
Now, I will define tensors that will represent the placeholders of our data. Here X is a representation of the data, C is the list of k centroids, and C_labels is the index of the centroids that we have assigned to our each data point:
X = tf.placeholder( dtype = tf.float32 )
C = tf.placeholder( dtype = tf.float32 )
C_labels = tf.placeholder( dtype = tf.int32 )
Code language: Python (python)
Now I will prepare our data for audio feature extraction with Machine Learning:
expanded_vectors = tf.expand_dims( X, 0 )
expanded_centroids = tf.expand_dims( C, 1 )
distance = tf.reduce_sum( tf.square( tf.subtract( expanded_vectors, expanded_centroids ) ), axis = 2 )
getCentroidsOp = tf.argmin( distance, 0 )
Code language: Python (python)
Now I will compute the new centroids from our assigned labels and data values:
sums = tf.unsorted_segment_sum( X, C_labels, k )
counts = tf.unsorted_segment_sum( tf.ones_like( X ), C_labels, k )
reCalculateCentroidsOp = tf.divide( sums, counts )
Driver Code
Now I will define the driver code for our algorithm. Using this function, we will feed the necessary data so that we could train it using our Machine Learning Algorithm:
data_labels = []
centroids = []
with tf.Session() as sess:
sess.run( tf.global_variables_initializer() )
centroids = initilizeCentroids( data, k )
for epoch in range( epochs ):
data_labels = sess.run( getCentroidsOp, feed_dict = { X: data, C: centroids } )
centroids = sess.run( reCalculateCentroidsOp, feed_dict = { X: data, C_labels: data_labels } )
print( data_labels )
print( centroids )
Code language: Python (python)
[0 1 2 2 0 2 0 2 0 0] [[0.01818182 0. 0. 0.01818182 0. 0. 0.9090909 0. 0. 0. 0. 0.05454545] [0.9917127 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.00828729] [0.04045167 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.95954835]]
Now we have trained the model for audio feature extraction. Let’s have a look at our output:
final_labels = pd.DataFrame( { "Labels": data_labels, "File Names": fileList } )
final_labels
Code language: Python (python)
File Names | Labels | |
---|---|---|
0 | c3_1.wav | 0 |
1 | c4_1.wav | 1 |
2 | c1_2.wav | 2 |
3 | c2_2.wav | 2 |
4 | c3_2.wav | 0 |
5 | c2_3.wav | 2 |
6 | c2_4.wav | 0 |
7 | c1_1.wav | 2 |
8 | c2_1.wav | 0 |
9 | c4_2.wav | 0 |
I hope you liked this article on Audio Feature Extraction using the k-means clustering algorithm. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.