New Technology

Deep Learning Offload Processor

To achieve the desired speed up for images, video and audio training we created the concept of Deep Learning Offload Processor (DLOP). The DLOP complements GPUs and DLAs. The speedup is achieved by splitting the AI training pipeline to a combination of GPU hosts and Offloaders and executing the entire pipeline in a non-blocking asynchronous way.

Problem We Are Solving!

The rapid growth of AI and the excitement around foundation models has resulted in exponential growth in the demand for computing power to train AI models. The amount of compute used in the largest AI training runs has been doubling every 3-4 months.

This has led to long training times and as a result frustration among deep learning engineers and researchers. Secondly, this has created compute shortages as well as spiralling costs to train AI models.

DLOP aims to overcome this problem and speed up experiments for AI researchers by complementing existing GPUs and using offloading techniques

Classical Approach:
Distributed Multi-Node
Training

In order to scale/speed-up training, the most commonly used approach is multi-node distributed training wherein multiple machines each having multiple GPUs is used.

❌ All machines need to have the same GPU type and the same number of GPUs.

❌ Multi-node training scales sub-linearly in terms of performance, for example, going from 1 machine to 10 machines typically gives a 3x-5x speedup as compared to 10x.

❌ GPUs are in shortage due to increased demand. Plus, the GPUs are expensive and power hungry.

AI Training Speed Up with
Scaletorch Deep Learning Offload Processor

Scaletorch software functionally splits/distributes the PyTorch training process between a machine with multiple GPUs and DLOP machines, which are CPU or CPU+FPGA based.

With the DLOP approach, only the core DNN training operations runs on the machine that hosts the GPUs. All other functions are offloaded to stateless DLOP machines.

✔️ No changes to model, code or infrastructure.

✔️ The speed-up depends on the model and dataset, and can get up to 60x for vision workloads and 200x for audio.

✔️ Doesn't need additional GPUs.

✔️ The CPU machines are completely stateless and hence one can use spot/preemptible instances when using AWS, Google Cloud or Azure.

✔️ For On-Premises, DLOP hardware should be simply added to the existing network with no changes to infrastructure.

The DLOP is not making any
changes or modifications to the model.

The deep learning researcher can use the same parameters (e.g. batch size) that they would normally use if they are training on a single machine with multiple GPUs.