Spectrogramify – Using machine learning to “see” sound

Machine Learning has taken the world by storm these past few years, with advances in Natural Language, Computer Vision, Anomaly Detection transforming the way we live.

One frontier that has yet to be extensively explored is sound. How can we get computers to hear things?

Our conversational AI chatbot, RICHA automatically places calls to the customers of our client, but what happens when a voicemail is reached? Our bot should be able to identify this on the spot, terminate the call, and schedule a new one in the future. At the moment our team has been manually labelling the recordings of each call to determine whether it is a voicemail or a human. 

Problem Statement:

How can we automate the process of detecting a Voicemail given a sound file? 


Machine Learning is incredibly good at classifying images. The state of the art of differentiating between pictures of ‘cats’ or ‘dogs’ is close to 99% accuracy. Through my previous research opportunities, I have learned a very simple way to do the same for audio is- To transform sounds into images!

Here I introduce Spectrogramify: An internal toolkit used to turn any sound file into an image, to be later used to train a neural network for binary (two class) classification.

In essence, sound waves can be represented in a 3 dimensional way: Frequency band, Intensity, and Time. This is called a spectrogram:

Can you tell which is Voicemail and which is Human?

Processing the Images

The first optimization I make is to normalize all the waveforms so that they are of equal volume, and can be compared numerically. This is absolutely crucial because not all recordings have the same strength, owing to factors such as mobile data signal, mobile phone microphone quality, user speaking volume, etc.

Next we can crop out parts of the spectrogram to reduce the dimensionality of the training data. In essence, this allows the model to learn faster, since there are less variables involved in making the correlations between input and label.


After taking all the manual work the team has done, I was able to compile a labelled dataset of these generated representations of audio files.

From there on it was a simple task of showing the neural network each image, letting it guess if it was a voicemail or a human, and use the correctness or incorrectness of that guess to refine all future guesses. (Read up on Backpropagation if you’re interested!)

There were 870 labelled samples in total. 66% were used to form the training set to train the model, and 33% were completely unseen by the model used as a testing set. 

The Model

The model I chose was a Convolutional Neural Network with dropout layers.

CNNs are really good at identifying images because it first extracts features using convolutional layers, before being fed to a perceptron. 

Dropout layers essentially at random disable some neurons in the neural network at each training stage. This forces the rest of the neurons to “pick up the slack”, and will allow neurons to learn different functionality. This works to combat overfitting, which essentially allows the model to generalize or transfer-learn on datasets that aren’t exactly the same as what it was trained on.


After running through the training set, we managed to achieve an accuracy of 94% on the test set, and 100% during our QA testing!

Errors in Machine Learning models are more nuanced than just a percentage. We need to look at the confusion matrix produced by testing the model on the unseen testing set:

Actual                     PredictedVoicemailHuman

As an explanation these are the four cases:
a) 143 Voicemail recordings were correctly predicted to be Voicemail

b) 127 Human recordings were correctly predicted to be Human

c) 11 Human recordings were incorrectly identified to be Voicemail

d) 6 Voicemail recordings were incorrectly identified as Humans

Cases a and b are fairly self explanatory, these are where the model correctly labelled the audio

For case c, we can see that the consequence of this will be that the Human will just get another call rescheduled some other time.

For case d, this is more problematic as voicemails that are identified as humans will not receive another appointment for a call.

The overall goal is to reduce the number of misclassifications of type d.


There is a multitude of enhancements that can be made, mostly pertaining to the preprocessing and dimensionality reduction of the generated Spectrograms. State of the art audio classification has an accuracy of 97%, which is quite a bit better than the 94% my custom model achieves. More research is needed to solidify this as a watertight solution for one of the biggest woes of the RICHA project!

Find out more about our conversational AI chatbot, RICHA.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

More To Explore