We have spent one week to discover the current state of the art in machine learned speech recognition.
Evidently, the state of affairs changed on late 2014 when a research by Baidu R&D on speech recognition surfaced. It was announced by the renowned AI expert Andrew Ng, then the head scientist in Baidu R&D.
The former methods are considered to be "complex multi-stage hand engineered pipelines" which are insightfully crafted and tuned by engineers' lexical expertise to perform in a wide range of possible spoken texts. Nowadays, these methods seem to be sidelined by neural networks that are trained to do the same, but that can generalize more effectively.
Recently, we have been twiddling with Mozilla's open source implementation of DeepSpeech (2, to be pedantic), in which Mozilla has been undergoing building, maintenance and data collection that is required for a well performing speech-to-text implementation. It was nice to find a code base that is so actively developed, and crafted to enable non-experts to have a taste of the area. Mozilla is getting voice donations here, where you can add to the pool of utterances.
We chose to investigate the code on Google Cloud, where we can test several different architectures on demand (CPUs and GPUs, well darn, we even thought that we might also make use of those fancy TPUs), have higher proximity to the data that we might be crunching, and well, for convenience overall. It might be worthwhile to note that DeepSpeech is not ready to be deployed as an on demand ML instance, but given the hustle in the GitHub repo, it seems to be close.
So lets begin. One can use a computer with Linux on it, or an instance with larger than 100GBs of hard drive in order to play around with this code. As GPU, we used V100s attached to our instances. We recommend Ubuntu for its larger user base. First, we need to attach a drive to our instance.
$ sudo lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 10G 0 disk └─sda1 8:1 0 10G 0 part / sdb 8:16 0 500G 0 disk ### This is the attached drive $ sudo mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb $ sudo mkdir -p /mnt/disks/1/ $ sudo mount -o discard,defaults /dev/sdb /mnt/disks/1/ $ sudo chmod a+w /mnt/disks/1 $ sudo cp /etc/fstab /etc/fstab.backup $ sudo blkid /dev/sdb ### Note down the UUID here! $ sudo nano /etc/fstab
We need to enter the UUID thet we get from command
blkid and the following options
discard, defaults, nofail (for Ubuntu)
UUID=a0a5e175-e78c-45ae-8950-a08fb0cf5599 / ext4 defaults 1 1 UUID=1e9d7961-96f1-439d-a576-e3611372070d /mnt/disks/1 ext4 discard,defaults,nofail 0 2
Then, we need to get some packages in order to work with data and as a prerequisite for DeepSpeech.
sudo apt install git-lfs # We need Git Large File Support sudo apt-get install python3 python3-venv virtualenv sudo apt-get install sox ffmpeg ### One of them suffices, but I altered among them
Then, we start following the instructions on the GitHub repo:
I first chose to create the virtual environment:
virtualenv -p python3 $HOME/tmp/deepspeech-venv/
Before getting deeper, we can start with observing how DeepSpeech is doing in inference:
cd /mnt/disks/1/ ### The default disk you get is 10GBs wget -O - https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz | tar xvfz - pip3 install deepspeech wget http://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0057_8k.wav deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio OSR_us_000_0057_16k.wav
Then we convert the audio to a 16KHz sampled wav that as DeepSpeech requires, and therefore end up with an almost correct transcription. Of coarse, since the model does not have any presuppositions on the language other than the lexicon, we don't get sentences.
ffmpeg -i OSR_us_000_0057_8k.wav -acodec pcm_s16le -ac 1 -ar 16000 OSR_us_000_0057_16k.wav
(deepspeech-venv) mehmet@instance-1:~$ deepspeech --model models/output_graph.pbmm --alphabet models/alphabet.txt --lm models/lm.binary --trie models/trie --audio OSR_us_000_0057_16k.wav Loading model from file models/output_graph.pbmm TensorFlow: v1.6.0-18-g5021473 DeepSpeech: v0.2.0-0-g009f9b6 2018-10-23 13:07:03.850668: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA Loaded model in 0.00777s. Loading language model from files models/lm.binary models/trie Loaded language model in 5.53s. Running inference. on the island the sea breeze is soft and mild the play began as soon as we sat down this will lead the world to more sound and fury and salt before you fry the egg the rush for funds reached its peak to day the birch looked stark white and lonesome the box is held by a bright red snapper to make pure ice you freeze water the first worm gets snapped early jumped fence and hurry up the banke
There is an effort on DeepSpeech GitHub repository to support TensorFlow Lite, which would make things a lot more interesting for us, since it will enable inference on Google Edge TPU chips.