Apr 3, 20234 min read

CV training on Steroids - 18x speed up

AI model complexity is expected to increase by 750x over the next 5 years, while accelerator performance is expected to increase by 6.5x-11x in the same period.

This disparity leads to longer AI model training times, which can last days or even months.

Let’s take a prominent example in the Computer Vision domain– Tesla AI. Tesla AutoPilot is one of the most advanced computer vision solutions. Even with access to massive amounts of infrastructure and AI talent, it takes Tesla AI Division around 70000 GPU hours (~8 years) to train their AutoPilot model.

If we talk about smaller companies with limited compute and talent resources, these numbers are practically impossible to compete with.

To address this challenge, a major industry requirement is to speed up deep learning workloads to reduce training time. ScaleTorch offers a solution that can increase Pytorch workload execution speeds by 10-200x without requiring code changes, using the same amount of GPU compute.

Here’s how it works.

How Can ScaleTorch Help With Your Computer Vision Training Speedup?

Scaletorch relies on its revolutionary Deep Learning Offload Processor (DLOP) technology, that works in conjunction with GPUs and other deep learning accelerators (such as TPUs, IPUs) to transparently speed up AI training by 10x-200x.

Scaletorch DLOP is capable of accelerating a variety of training workloads such as computer vision, audio, video, medical imaging, sensory data, software-defined radio, etc. Scaletorch DLOP achieves the speed-up by using a combination of offloading certain operations from GPUs or DLAs to the DLOP as well as asynchronous processing of the training pipelines.

Scaletorch DLOP requires no change to the Pytorch script in order to accelerate training.
Scaletorch DLOP can seamlessly access training datasets from filesystems, object stores as well as remote data sources.
Scaletorch DLOP does not require any change to the model itself and neither does it use any techniques like quantization, pruning, distillation, selective backpropagation that would change the accuracy of the model.

Rather Scaletorch DLOP uses offloading and low level programming to fully leverage the capabilities of modern processors. Scaletorch DLOP optimizes the AI training pipeline to utilize GPUs and DLAs their maximum potential.

The Scaletorch DLOP is available as a virtual appliance that runs in AWS, GCP and Azure or as a physical appliance for on-premise implementations. Multiple Scaletorch DLCPs can be clustered together to provide scale out offload capacity.

Some Benchmarks

To demonstrate the efficiency and performance gains of Scaletorch DLOP, we conducted a variety of benchmark tests with different datasets (ranging from image, video, audio, medical imaging and LiDAR data) and model architectures.

Of the tests, let us first consider the Charades dataset, a dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. The following are the specifications for the benchmark test.

Charades Benchmark | Object Detection

Dataset used: The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos.

Model architecture: 3DResNet50

Machine config: We used a machine with 8 × Nvidia A100 (40GB), with Pytorch operating in DataParallel mode.

We ran the same code with Scaletorch, once with DLOP enabled (in virtual appliance mode) and then with DLOP disabled. Here are the results from the benchmark tests, averaged out over multiple runs and epochs.

Without DLOP:

Average throughput: 23.59 samples/sec

With DLOP:

Average throughput: 436.89 samples/sec

As you can see, for the Charades dataset, we observe a ~18x increase in average throughput. A test that would have taken 18 hrs for you earlier, will just finish in roughly 1 hrs, which can significantly boost your research and development process. Talking in terms of operational costs, an 8 × Nvidia A100 (40GB) machine on AWS costs $32.77/hr for on-demand instances. The benchmark comparison for the test would look something like this in that case.

That accounts for a savings of ~$557 on a single training job. For a startup, these training time and cost reductions can be a huge game changer!

Let’s take a look at another Computer Vision benchmark result. For this test, we used the MICCAI Gleason 2019 Challenge dataset. The following are the specifications for the benchmark test.

MICCAI Gleason 2019 Challenge | 2D Medical Imaging

Dataset used: MICCAI Gleason 2019 Challenge dataset, a 2D medical imaging dataset consisting TMA (tissue microarrays) images, where the aim is to predict the Gleason grade for prostate cancer. There’s a total of 6 labels.

Model architecture: UNet2D

Machine config: We used a machine with 8 × Nvidia A100 (40GB), with PyTorch operating in DataParallel mode.

The training was executed only for 1 training epoch, with the number of samples being 500,000. Post running the same training code with and without DAP enabled, here are the benchmark results that we got.

Without DAP:

Average throughput: 83 images/sec

With DAP:

Average throughput: 810 images/sec

There is ~10x speedup, again without requiring any code or configuration change at all. To put this in perspective, instead of paying for approx. 10 hrs of training, you’ll now be paying only for 1 hr of training on the cloud. So not only does this significantly reduce your training time, Scaletorch DLOP can also help you cut down on your cloud bills.

To know more about benchmark results across other data types, and to see if Scaletorch can be the right fit for your organisation, you can schedule an onboarding call with us- https://calendly.com/d/dsw-ww6-pnc/onboarding-session

Book a Demo