top of page

Deep Learning Offload Processor


The DLOP is technology that complements GPUs and DLAs to speed up AI model training by up to 200x.

Problem we are solving


The rapid growth of AI and the excitement around foundation models has resulted in exponential growth in the demand for computing power to train AI models. The amount of compute used in the largest AI training runs has been doubling every 3-4 months.

This has led to long training times and as a result frustration among deep learning engineers and researchers. Secondly, this has created compute shortages as well as spiralling costs to train AI models.

DLOP aims to overcome this problem and
speed up experiments for AI researchers by complementing existing GPUs and using offloading techniques 

AI Training Speed Up with
Scaletorch Deep Learning Offload Processor

Scaletorch DLOP functionally splits/distributes the training process between a machine with multiple GPUs and a large number of pure CPU machines. The CPU machines are completely stateless and hence one can use spot/preemptible instances when using AWS, Google Cloud or Azure. With the DLOP approach, only the core DNN training operations run on the machine that hosts the GPUs. All other functions are offloaded to stateless CPU machines.

Classical Approach:
Distributed Multi-Node

In order to scale/speed-up training, the most commonly used approach is distributed data parallel training wherein multiple machines each having multiple GPUs is used. All machines need to have the same GPU type and the same number of GPUs. Multi-node training scales sub-linearly in terms of performance.

The DLOP creates an asynchronous pipeline between the machine with GPUs and the pure CPU machines so as to maximize performance.

The DLOP disaggregates AI model training across the machine with GPUs and a large number of purely CPU machines that results in a 10x-200x speed up when training or fine-tuning an AI model.

AI Training Speed up using DLOP

Multiple Processes are executed on the offload processors. As a result, training is finished earlier.

Classical Approach: Distributed Multi-NodeTraining

There exists large bubbles where GPUs are not utilized. This also leads to large amount of context switching and inefficient memory accesses.

The DLOP is not making any
changes or modifications to the model.

The deep learning researcher can use the same parameters (e.g. batch size) that they would normally use if they are training on a single machine with multiple GPUs.

Scaletorch DLOP accelerates a variety of training workloads.






Geospatial Data

Medical Imaging

Signal Processing

Satellite Imaging

Ready to Boost Up Your Training Performance?
Get Started with Scaletorch!

bottom of page