Training Speech Recognition

Posted by Alex Billinger on September 1, 2020



Many people have difficulty with spoken words, whether hard of hearing or even just checking voice mail in a noisy place. Speech recognition is a powerful tool with modern technology; training a computer to understand audio files and print the words, so that the information is accessible in a different format. Using technology to improve accessibility is a passion of mine, so I decided to develop my own algorithm for speech recognition prediction. My goal was not to create a better tool than what’s available on the market today, but to build an understanding of the process involved. Being limited to my own computational resources meant I simply don’t have the computing power to develop a highly accurate model.

My approach was to start with audio data, in the form of labeled phrases, from a variety of speakers. The data set I used can be found here: link. Since predicting phrases is not broadly applicable for speech recognition, the first step after getting the data into a jupyter notebook, was to split the phrases into individual words. To do this, I first cleaned up the audio by normalizing the volume and cutting out the silence. Then, using the text labels, I found syllable counts for each word, and the total in the phrase, and used this as a percent of the total time of the audio. This method is approximate, but by checking the resulting clips, I found it was reasonably accurate. I then lowered the quality of each clip (the frames per second) so that each clip had the same number of values. These values, converted to an numpy array, were used for training the model.

Here’s a sample of what the process looks like, for the phrase “learn to recognize omens and follow them the old king had said”, and the clip of this audio that contains just the word “recognize”.




For the text data, there were a total of 8000 unique words. I found the 1000 most common to use as the labels for prediction, as well as an [UNRECOGNIZED] category.

This is where I started to run into computer limitations. Even using 5000 phrases (the training data set from kaggle has 195776 phrases), resulted in 0.7 GB for the file (or 3000 * 47000 values!), which meant that even using this small fraction of the total data, it could take hours for my computer to train one model. Given that there were slightly over 1000 possible labels to predict, it meant that trying to get even a reasonably high accuracy involved testing many parameters. I also consistently ran into a problem of overfitting, even with fairly heavy regularization.

To build on this project in the future, I would love to return to such a project with more computation resources. This would allow both a possibility of using more data, as even my data intensive models used less than 10% of the available data, as well as making it possible to test more models. Another area where I would like to try to improve is in how the data is split-the current splitting method is approximate, but more than that, it only works on labeled data, which would make parsing a phrase impossible for my current model.

This project has really given me perspective on the scope and power available with big data. While modern speech recognition tools are far from perfect, training a computer to recognize words is an incredibly complex and resource intensive task. I look forward to seeing how accessibility tools like speech recognition improve in the future, and contributing the data science skills I have developed to be a part of their improvement.

Footnote: for code and more information, my repository can be found here: link