top of page

CV training on Steroids - 18x speed up

Updated: Apr 26

AI model complexity is expected to increase by 750x over the next 5 years, while accelerator performance is expected to increase by 6.5x-11x in the same period.


This disparity leads to longer AI model training times, which can last days or even months.



Let’s take a prominent example in the Computer Vision domain– Tesla AI. Tesla AutoPilot is one of the most advanced computer vision solutions. Even with access to massive amounts of infrastructure and AI talent, it takes Tesla AI Division around 70000 GPU hours (~8 years) to train their AutoPilot model.


If we talk about smaller companies with limited compute and talent resources, these numbers are practically impossible to compete with.


To address this challenge, a major industry requirement is to speed up deep learning workloads to reduce training time. ScaleTorch offers a solution that can increase Pytorch workload execution speeds by 10-200x without requiring code changes, using the same amount of GPU compute.


Here’s how it works.



How Can ScaleTorch Help With Your Computer Vision Training Speedup?


Scaletorch relies on its revolutionary Deep Learning Offload Processor (DLOP) technology, that works in conjunction with GPUs and other deep learning accelerators (such as TPUs, IPUs) to transparently speed up AI training by 10x-200x.


Scaletorch DLOP is capable of accelerating a variety of training workloads such as computer vision, audio, video, medical imaging, sensory data, software-defined radio, etc. Scaletorch DLOP achieves the speed-up by using a combination of offloading certain operations from GPUs or DLAs to the DLOP as well as asynchronous processing of the training pipelines.


  • Scaletorch DLOP requires no change to the Pytorch script in order to accelerate training.

  • Scaletorch DLOP can seamlessly access training datasets from filesystems, object stores as well as remote data sources.

  • Scaletorch DLOP does not require any change to the model itself and neither does it use any techniques like quantization, pruning, distillation, selective backpropagation that would change the accuracy of the model.


Rather Scaletorch DLOP uses offloading and low level programming to fully leverage the capabilities of modern processors. Scaletorch DLOP optimizes the AI training pipeline to utilize GPUs and DLAs their maximum potential.


The Scaletorch DLOP is available as a virtual appliance that runs in AWS, GCP and Azure or as a physical appliance for on-premise implementations. Multiple Scaletorch DLCPs can be clustered together to provide scale out offload capacity.


Some Benchmarks


To demonstrate the efficiency and performance gains of Scaletorch DLOP, we conducted a variety of benchmark tests with different datasets (ranging from image, video, audio, medical imaging and LiDAR data) and model architectures.


Of the tests, let us first consider the Charades dataset, a dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. The following are the specifications for the benchmark test.


Charades Benchmark | Object Detection


Dataset used: The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos.


Model architecture: 3DResNet50


Machine config: We used a machine with 8 × Nvidia A100 (40GB), with Pytorch operating in DataParallel mode.


We ran the same code with Scaletorch, once with DLOP enabled (in virtual appliance mode) and then with DLOP disabled. Here are the results from the benchmark tests, averaged out over multiple runs and epochs.


Without DLOP:

  • Average throughput: 23.59 samples/sec

With DLOP:

  • Average throughput: 436.89 samples/sec