AI & Deep Learning: Zero-Communication Model Parallelism for Distributed Extreme-Scale Deep Learning
#### Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

**Speakers**
## Anshumali Shrivastava

Current deep learning architectures are growing larger to learn from complex datasets. Microsoft showed a breakthrough in image recognition with a 152-layer neural network, popularly known as resnet, containing around 60 million parameters. This breakthrough was achieved with a ten times bigger model than the previous best performing Google's LeNet. Last year, Google demonstrated a need for a 137 billion parameter network trained over specialized hardware for language modeling. Quoting a statement from the same paper “such model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.”

The quest for a unified machine learning algorithm which can simultaneously generalize from diverse sources of information (or transfer learning) has made it imperative to train astronomical sized neural networks with enormous computational and memory overheads. Basic information theoretical argument implies that to assimilate all the information to be able to map inputs to complex and diverse decisions, the capacity of the model cannot be small.

A simple back of envelope calculation shows that for a neural network with 137 billion weights, the model itself will require around 500 GB of memory. Training will need at least around 1.5 TB of working memory if trained with popular algorithms like Adam.

Due to the growing size and complexity of networks, efficient algorithms for training massive deep networks in a distributed and parallel environment is currently the most sought after problem in both academia and the commercial industry. Recently Google released mesh-tensorflow for distributed training of neural networks. In distributed computing environments, the parameters of giant deep networks are required to be split across multiple nodes. However, this setup requires costly communication and synchronization between the parameter server and processing nodes in order to transfer the gradient and parameter updates. The sequential and dense nature of gradient updates prohibits any efficient splitting (sharding) of the neural network parameters across compute nodes. There is no clear way to avoid the costly synchronization without resorting to some ad-hoc breaking of the network. This ad-hoc breaking of deep networks is not well understood and is likely to hurt performance. Synchronization is one of the significant hurdles in scalability.

As a result, model parallelism and parameter sharding in a distributed environment remain one of the major sought after problem in the HPC community.

In this talk, I will show a surprising connection between predicting measurements of a count-min sketch, a popular streaming algorithm for finding frequent items, and predicting the class with maximum probability by a classifier such as deep learning. With this connection, we will show a provable and straightforward randomized algorithm for multi-class classification that requires resources logarithmic in the number of classes. The technique is generic and can decompose any given large network into logarithmic size small networks that can trained independently without any communication. In practice, when applied on industry scale dataset with 100,000 classes, we can obtain around twice the best-reported accuracy in more than 100x reduction in computations and memory. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features, with the classification accuracy of 19.28% which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. All prior works on ODP dataset require significant clusters with days of training which we can achieve in hours on a single machine. This work is an ideal illustration of the power of randomized algorithms for ML, where randomization can reduce computations with increased accuracy due to implicit regularization.

And, to sweeten the deal, the algorithm is embarrassingly parallelizable over GPUs. We get a perfectly linear speedup with an increase in the number of GPUs as the algorithm is provably communication free. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.

We provide the first framework for model parallelism that does not require any communication. Further exploration can revolutionize the field of distributed deep learning with a large number of classes. We hope this method will get adopted.

WATCH VIDEO

The quest for a unified machine learning algorithm which can simultaneously generalize from diverse sources of information (or transfer learning) has made it imperative to train astronomical sized neural networks with enormous computational and memory overheads. Basic information theoretical argument implies that to assimilate all the information to be able to map inputs to complex and diverse decisions, the capacity of the model cannot be small.

A simple back of envelope calculation shows that for a neural network with 137 billion weights, the model itself will require around 500 GB of memory. Training will need at least around 1.5 TB of working memory if trained with popular algorithms like Adam.

Due to the growing size and complexity of networks, efficient algorithms for training massive deep networks in a distributed and parallel environment is currently the most sought after problem in both academia and the commercial industry. Recently Google released mesh-tensorflow for distributed training of neural networks. In distributed computing environments, the parameters of giant deep networks are required to be split across multiple nodes. However, this setup requires costly communication and synchronization between the parameter server and processing nodes in order to transfer the gradient and parameter updates. The sequential and dense nature of gradient updates prohibits any efficient splitting (sharding) of the neural network parameters across compute nodes. There is no clear way to avoid the costly synchronization without resorting to some ad-hoc breaking of the network. This ad-hoc breaking of deep networks is not well understood and is likely to hurt performance. Synchronization is one of the significant hurdles in scalability.

As a result, model parallelism and parameter sharding in a distributed environment remain one of the major sought after problem in the HPC community.

In this talk, I will show a surprising connection between predicting measurements of a count-min sketch, a popular streaming algorithm for finding frequent items, and predicting the class with maximum probability by a classifier such as deep learning. With this connection, we will show a provable and straightforward randomized algorithm for multi-class classification that requires resources logarithmic in the number of classes. The technique is generic and can decompose any given large network into logarithmic size small networks that can trained independently without any communication. In practice, when applied on industry scale dataset with 100,000 classes, we can obtain around twice the best-reported accuracy in more than 100x reduction in computations and memory. Using the simple idea of hashing, we can train ODP dataset with 100,000 classes and 400,000 features, with the classification accuracy of 19.28% which is the best-reported accuracy on this dataset. Before this work, the best performing baseline is a one-vs-all classifier that requires 40 billion parameters (160 GB model size) and achieves 9% accuracy. All prior works on ODP dataset require significant clusters with days of training which we can achieve in hours on a single machine. This work is an ideal illustration of the power of randomized algorithms for ML, where randomization can reduce computations with increased accuracy due to implicit regularization.

And, to sweeten the deal, the algorithm is embarrassingly parallelizable over GPUs. We get a perfectly linear speedup with an increase in the number of GPUs as the algorithm is provably communication free. Our experiments show that we can train ODP datasets in 7 hours on a single GPU or in 15 minutes with 25 GPUs. Similarly, we can train classifiers over the fine-grained imagenet dataset in 24 hours on a single GPU which can be reduced to little over 1 hour with 20 GPUs.

We provide the first framework for model parallelism that does not require any communication. Further exploration can revolutionize the field of distributed deep learning with a large number of classes. We hope this method will get adopted.

WATCH VIDEO

Professor, Rice University; Founder, ThirdAI Corp

Anshumali Shrivastava's research focuses on Large Scale Machine Learning, Scalable and Sustainable Deep Learning, Randomized Algorithms for Big-Data and Graph Mining.

2 Anshu
pdf

Monday March 4, 2019 3:50pm - 4:10pm CST

Auditorium

Auditorium

AI & Deep Learning

**Host Organization**The Ken Kennedy Institute