Jul 13, 20233 min read

Audio Training 200x faster 🤯

When it comes to training AI models for audio related tasks, a new kind of challenge presents itself. Let’s take the example of a speech-to-text detection model. A model trained on a dataset having samples from people of Indian origin probably won’t work for people in the US, due to the accent variations. As a result, to achieve a speech-to-text detection model with high accuracy and good generalisation over various accents, the same model will have to be retrained over datasets from different geographical locations.

Similarly, for an audio classification model, audio samples in different environmental conditions might have to be collected for training, to make the model robust against different noise scenarios.

The crux of the problem is that audio-related AI training workloads generally require a large amount of training data, compute-intensive data preprocessing, and several possible iterations of retraining and fine-tuning of the model. Hence, days or even weeks of training!

Scaletorch DLOP (Deep Learning Offload Processor) is a groundbreaking technology that can revolutionise audio training by transparently speeding up your AI training workloads upto 200x faster– zero code or infrastructure changes. By leveraging Scaletorch DLOP, we can achieve impressive performance boosts, allowing for training and fine-tuning audio models on more data in a lesser amount of time.

In this blog, we'll explore the potential of Scaletorch DLOP and how it can significantly improve audio training.

Unleashing the Power of Scaletorch DLOP

Scaletorch DLOP works hand-in-hand with GPUs and (other deep learning accelerators, like TPUs, IPUs) to optimise and speed up audio-based AI training workloads. It doesn't require any changes to the training code or architecture setup, making it easy to integrate into existing workflows. Instead of using traditional techniques (like pruning, quantisation, selective backprop, etc.) that may affect accuracy, Scaletorch DLOP uses offloading and low-level programming to optimise training pipelines.

Some Benchmarks to Demonstrate Scaletorch DLOP Performance Let's look at two important audio benchmarks that highlight the performance gains achieved with Scaletorch DLOP.

Environmental Sound Classification

Dataset Used: The ESC-50 dataset is a labelled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification.

The dataset consists of 5-second-long recordings organised into 50 semantical classes (with 40 examples per class) loosely arranged into 5 major categories like animals, natural soundscapes, human sounds, etc.

1. ResNet 101 - 30x faster

Machine Config: We used a machine with 2 × Nvidia A100 (40GB)

Without DLOP:

Average throughput: 357.98 audio samples/sec

With DLOP:

Average throughput: 10593.03 audio samples/sec

With Scaletorch DLOP enabled, the training speed increased dramatically. On average, it processed ~10593 audio samples per second, which is an incredible 30 times faster compared to the non-DLOP mode.

2. ResNet18 with Distributed Data Parallel (DDP) - 200x faster

Machine Config: We used a machine with 8 × Nvidia A100 (40GB), with PyTorch operating in DataParallel mode

Without DLOP:

Average throughput: 55.33* audio samples/sec

With DLOP:

Average throughput: 11060.16* audio samples/sec

* Since DDP is used, the results are for 1 GPU. For 8 GPUs, these benchmark results can be multiplied 8-folds.

With Scaletorch DLOP enabled, the training speed increased 200 times as compared to the non-DLOP mode. This can significantly cut down on the training time.

Unlocking New Possibilities

The remarkable performance boost offered by Scaletorch DLOP opens up exciting possibilities in audio training. Researchers and developers can now train audio models faster and work with larger datasets effortlessly. The reduced training time allows for more iterations and experimentation, leading to faster progress and innovation in audio-related AI applications.

Conclusion

Scaletorch DLOP is a powerful tool that can supercharge audio training. With throughput increases of up to 200 times, as seen in the ResNet18 benchmark, it enables faster training and better utilisation of compute resources. Integrating Scaletorch DLOP is a seamless process that requires no changes to the training code or model architecture.

As Scaletorch DLOP continues to evolve, we can expect even greater performance improvements, driving advancements in audio AI applications.

(Note: The information and benchmark results presented in this blog are based on available data and may vary depending on specific scenarios and configurations.)