Organizations such as OpenAI, Meta AI Research, DeepMind have access to scalable compute as well as systems engineering talent - a key weapon in their arsenal that helps them churn out breakthroughs in pure and applied AI.
It's expected that the need for AI compute power will grow by 750x over the next 5 years, while hardware performance will grow by 12x at the same time. This gap can only be bridged via software innovation.
Scaletorch is a team of Pytorch obsessed software and deep learning engineers working round the clock to reduce AI training time and cost by an order of magnitude.
Clients Are Saying?
"Being the leading company in the industry we needed a solution that could keep up with our high demands. Since, we have our infrastructure and do training on-premise, building a new infrastructure would've be a hefty and expensive task. So, we integrated Scaletorch into our training infrastructure. Now, our 100 GPUs are capable of doing the work of 500 GPUs."
- Leading E-commerce company worldwide
Palo Alto, CA, USA
Frequently Asked Questions.
While looking to use a deep learning offload processor, various questions are likely to arise regarding this service. Here are some of them
How is the DLOP different from multi-node distributed training?With multi-node distributed training, the data is split across multiple GPUs spread across multiple hosts. For the right throughput you need the same type and number of GPUs on each host. Scaling with distributed data parallel is generally sub-linear and has quite a bit of systems engineering complexity. DLOP on the other hand, offloads many aspects of the DNN training pipeline to CPU or CPU+FPGA on machines which are on the same network as GPU hosts - we call these machines offloaders. The Offloaders are stateless and hence use spot instances on clouds such as AWS, GCP and Azure. The recommended setup with DLOP is a single GPU host with multiple GPUs (preferably connected over NVLink), and multiple offload machines (purely CPU machines or CPU+FPGA machines)
What is Scaletorch Deep Learning Offload Processor (DLOP) ?The Scaletorch Deep Learning Offload Processor accelerates deep learning training by 10x-200x, available as a software appliance or software+hardware combo. It seamlessly integrates with GPUs and other accelerators (TPUs, IPUs) without requiring changes to your Pytorch script or setup.
For what kind of workloads does DLOP accelerate AI model training?DLOP accelerates a wide variety of workloads such as computer vision, audio, video, signal processing, 3D point clouds, LIDAR. DLOP currently does not accelerate NLP workloads nor does it accelerate generative AI workloads (such as image generation or text to image)
Does the DLOP accelerate AI training by using methods like quantization, sparsity, pruning and distillation that change the model accuracy?DLOP does not use techniques like quantization, sparsity, pruning that affect/change model accuracy. AI model training is accelerated by offloading parts of the training pipeline to CPU or CPU+FPGA hosts other than where the GPUs are located and asynchronously executing the training pipeline. In the case of AWS, GCP and Azure, we use spot CPU instances. For on-premise workloads we provide one or more hardware appliances with a large number of CPU cores or CPU+FPGA for offloading. Techniques like quantization, pruning, distillation are orthogonal to what DLOP does and can be used in conjunction with DLOP for further training speed up and model compression.
Is DLOP only for training models from scratch or also applicable for fine-tuning and transfer learning workloads?DLOP works for both pre-training as well as fine-tuning/transfer learning scenarios. The speedup using DLOP is much higher in fine-tuning/transfer learning scenarios as compared to training from scratch.
Do I need to change my training scripts or infrastructure?Scaletorch requires no changes to your training scripts or infrastructure.