DeepSpeech is an open source speech recognition engine developed by Mozilla. It uses machine learning to convert speech to text. Since it relies on TensorFlow and Nvidia’s CUDA it is a natural choice for the Jetson Nano which was designed with a GPU to support this technology. Unfortunately, getting this running is not easy so I thought I would write a helpful bog post with some tips.
First, the hard part of compiling DeepSpeech for the Jetson Nano has already been done for you. Go to https://github.com/domcross/DeepSpeech-for-Jetson-Nano/releases/tag/v0.6.0 and download the deepspeech-0.6.0-cp36-cp36m-linux_aarch64.whl and libdeepspeech.so files from the GitHub repository. That should be all the instruction you need. Unfortunately it is not that easy.
Second, install the Python wheel from the file. You cannot install DeepSpeech without this downloaded file you provide:
sudo pip install deepspeech-0.6.0-cp36-cp36m-linux_aarch64.whl
If you are not familiar with Linux, you may be wondering where to copy the libdeepspeech.so file. Run the following command to determine where to copy the libdeepspeech.so file:
This indicates that /usr/local/lib would be a good location so copy the file there:
sudo cp libdeepspeech.so /usr/local/lib
But just copying that file is not enough. You need to run another command so Linux knows about this new shared library:
Finally run the following command to see if DeepSpeech is working:
rsrobbins@nvidia-ai:~$ deepspeech --version TensorFlow: DeepSpeech: rsrobbins@nvidia-ai:~$
You are supposed to get version numbers for TensorFlow and DeepSpeech but both are blank. At least you are not getting any errors. Next you need to download the pre-trained English models from https://github.com/mozilla/DeepSpeech and extract them. The deepspeech-0.6.1-models.tar.gz file is 1.14 GB so you might want to download this using a computer with a decent Internet connection and copy the file to your Jetson Nano.
You can now transcribe an audio file:
rsrobbins@nvidia-ai:~$ cd deepspeech rsrobbins@nvidia-ai:~/deepspeech$ deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio audio/2830-3980-0043.wav Loading model from file deepspeech-0.6.1-models/output_graph.pbmm TensorFlow: DeepSpeech: 2020-02-29 14:46:19.470759: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2020-02-29 14:46:19.479426: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:19.479575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216 pciBusID: 0000:00:00.0 2020-02-29 14:46:19.479619: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2020-02-29 14:46:19.479744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:19.479900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:19.479978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2020-02-29 14:46:20.310523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-02-29 14:46:20.310602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2020-02-29 14:46:20.310635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2020-02-29 14:46:20.310884: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:20.311108: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:20.311283: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:972] ARM64 does not support NUMA - returning NUMA node zero 2020-02-29 14:46:20.311425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 704 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3) Loaded model in 1.53s. Loading language model from files deepspeech-0.6.1-models/lm.binary deepspeech-0.6.1-models/trie Loaded language model in 0.0271s. Running inference. experience proof less Inference took 7.315s for 1.975s audio file. rsrobbins@nvidia-ai:~/deepspeech$
You might be wondering where the heck is the text from the speech in the audio file? This program does not have a very intuitive user interface. The transcribed text is actually in the output directly after “Running inference” and reads “experience proof less”. The demo WAV file has only three spoken words. The actual speech in the audio file is “experience proves this”.
Although the demo audio files from Mozilla work well enough, you may need to install Sound eXchange to support conversion of audio files. DeepSpeech expects this to be installed. Naturally there is no mention of this requirement in the documentation. Run this command to install SoX:
sudo apt-get install sox
My additional tip is to run DeepSpeech using sudo if you get an error and run it again if the GPU runs out of memory.