We are pleased to announce the power of BOLT Engine for training large deep learning models on any CPU. The Bolt Engine is an algorithmic accelerator for training deep learning models that can achieve or even surpass GPU-level performance on commodity CPU hardware. Bolt is the first engine where computation reductions are exponential. The BOLT algorithm achieves neural network training in 1% or fewer FLOPS, unlike standard tricks like quantization, pruning, and structured sparsity, which only offer a slight constant factor improvement. As a result, we don’t have to rely on any specialized instructions, and the speedups are naturally observed on any CPU, be it Intel, AMD, or ARM. Even older versions of commodity CPUs can be made equally capable of training billion parameter models faster than A100 GPUs. And to top it all, the BOLT engine can be invoked via just a few line changes in existing python machine learning pipelines. To know more about our technology, click here.
Here is a snapshot of the savings with BOLT on a 200M parameter neural network on the Kaggle Amazon 670k recommendation dataset. We compare the performances (time to finish one epoch) of our BOLT Engine on CPU with TensorFlow. We benchmark both AMD CPU and Intel CPU. To put things in perspective we also show the performance of training the same network on TensorFlow accelerated with powerful A100 GPUs. Looking at the figure, we can see that BOLT dramatically exceeds the performance of TensorFlow on the same hardware CPU. Quite remarkably BOLT on refurbished desktops can even eclipses the performance of TensorFlow running on a powerful A100 GPU. More details are given here.
As explained in our other article, larger models need larger batch sizes for speed and generalization. However, GPUs have limited memory and cannot scale to the needs of a billion scale network.
We notice that the top-of-the-line A100 GPU with 48 GB memory can barely accommodate a batch size of 256 for our 1.6BN network. To run 2048 bath sizes, we need to distribute the model and training over 4 A100 GPUS (on a machine that costs around $70000). Even 2 A100s cannot accommodate the training.
With the BOLT Engine, we can effortlessly scale up to a few thousand batch sizes with no change in model memory. Moreover, the BOLT experiments were done on an old dual Intel Gold V3 processor purchased for under $1,000. Running the same model in the same CPU with TensorFlow is 5x slower. More details here.
We package the Bolt Engine in a simple python API, allowing the user to specify the network structure easily. We also have simple methods on the network that will enable us to retrieve the trained parameters for each layer as NumPy arrays that can easily port the trained model in TensorFlow or PyTorch. We also provide simple functions which can perform this process automatically. Similarly, we also provide drop-in functionality. A user can specify a TensorFlow model and pass it to our script, which will train and return it using the Bolt Engine for acceleration, all with 2 lines of code change.
Currently BOLT is available for trial evaluation. Please fill out this form to get access.