top of page

What does it mean to train like OMG-D?

Updated: Apr 27

We are seeing a flurry of amazing AI breakthroughs coming these days from OMG-D companies: OpenAI, Meta AI, Google DeepMind, Microsoft. It is official - the AI arms race is in full flow.

This begets the question, why is it that only these big companies are in the news when it comes to the interesting breakthroughs in AI. Is it that because they all cornered the most talented and bright researchers in this field ?

There is no doubt that these companies have amazingly talented people and some of the finest minds in AI working for them. However, there are equally talented people and very bright minds doing research at universities, building startups and working for smaller companies.

A key factor in AI these days has become GPU compute power that is required to build and train AI models. FAANG companies have built huge internal clusters, having tens of thousands of GPUs available to their deep learning engineers and researchers.

Let’s see some examples. Tesla runs a cluster of 7,360 A100 GPUs for its research on self-driving. Meta’s AI Research SuperCluster has 16,000 A100 GPUs. These clusters not only entail massive capital expenditure, but also require highly qualified staff to optimize and maintain. On the other hand, these end up giving AI research superpowers to their deep learning engineers and researchers.

But what if an AI engineer from a startup, or a company with AI research lab (where they don’t have access to 1000s of GPUs), want to solve the same complex problems and apply the latest AI technologies, build more sophisticated, and work with larger datasets? They have the knowledge, but the computing power is just not there. Here Scaletorch comes to the play.

At Scaletorch we found a way how to speed up the AI training on the same amount of GPUs up to 30x faster without any code change. Now a single machine with 8 GPUs has the same computing ability as 30 machines with a total of 30 GPUs. For a video AI startup, Scaletorch reduced the time to train a single epoch from 4 hours to 13 minutes on the same number of GPUs.

Sounds great! Next question, how’s that possible?

At Scaletorch, we developed the Deep Learning Offload Processor (DLOP) engine.

The Scaletorch DLOP is technology to speedup deep learning training. It is available purely as a software appliance or as a software+hardware combination.

The solution works in conjunction with GPUs and other deep learning accelerators (such as TPUs, IPUs) to transparently speed up AI training by 10x-200x without any changes to your Pytorch script and set up.

In comparison to the Classical approach where engineers use Distributed multi-node training, DLOP doesn't require additional GPUs to scale. Instead, we perform part of the processes on CPUs, so we can optimize GPU work. GPUs are used only for DNN and matrix-related operations. This helps to avoid inconsistency across runs and linear cost increase.

Thus, the GPU performance looks like this:

This approach helps us to train AI models 10x-200x faster, defending on the type of workload. Still don’t believe it? Just try and see!