There are two main approaches to training models across multiple devices; model parallelism, where the model is split across the devices, and data parallelism, where the model is replicated across every device, and each replica is trained on a subset of the data. Let’s look at these two options closely to understand how training models across multiple devices works.
Training Models using Model Parallelism
So far we have trained each neural network on a single device. What if we want to train a single neural network across multiple devices? This requires chopping the model into separate chunks and running each chunk on a different device. Unfortunately, such model parallelism turns out to be pretty tricky, and it depends on the architecture of your neural network.
For fully connected networks, there is generally not much to be gained from this approach. Intuitively, it may seem that an easy way to split the model is to place each layer on a different device, but this does not work because each layer needs to wait for the output of the previous layer before it can do anything.
So perhaps you can slice it vertically for example, with the left half of each layer on one device, and the right part on another device? This is slightly better since both halves of each layer can indeed work in parallel, but the problem is that each half of the next layer requires the output of both halves, so there will be a lot of cross-device communication. This is likely to completely cancel out the benefit of the parallel computation since cross-device communication is slow.
Model Parallelism with Neural Networks
Some neural network architectures, such as convolutional neural networks contain layers, so it is much easier to distribute chunks across devices in an efficient way.
Deep recurrent neural networks can be split a bit more efficiently across multiple GPUs. If you split the network horizontally by placing each layer on a different device, and you feed the network with an input sequence to process, then at the first time step only one device will be active, at the second step to will be active, and by the time signal propagates to the output layer, all devices will be active simultaneously.
There is still a lot of cross-device communication going on, but since each cell may be fairly complex, the benefit of running multiple cells in parallel may outweigh the communication penalty. However, in practice, a regular stack of LSTM layers running on a single GPU runs much faster.
In short, model parallelism may speed up running or training some types of neural networks but not all, and it requires special care and tuning, such as making sure that devices that need to communicate the most run on the same machine. Let’s look at a much simpler and generally more efficient option: data parallelism.
Training Models using Data Parallelism
Another way to parallelize the training of a neural network is to replicate it on every device and run each training step simultaneously on all replicas, using a different mini-batch for each. The gradients computed by each replica are then averaged, and the result is used to update the model parameters. This is called data parallelism. There are many variants of this idea, so let’s look at the most important ones.
Data Parallelism using the mirrored strategy
Arguably the simplest approach is to completely mirror all the model parameters across all the GPUs and always apply the same parameter updates on every GPU. This way, all replicas always remain perfectly identical. This is called the mirrored strategy, and it turns out to be quite efficient, especially when using a single machine.
The tricky part when using this approach is to efficiently compute the mean of all the gradients from all the GPUs and distribute the result across all the GPUs. This can be done using an All Reduce algorithm, a class of algorithms where multiple nodes collaborate to efficiently perform a reduce operation while ensuring that all nodes obtain the same final result. Fortunately, there are off-the-shelf implementations of such algorithms, as we will see.
Data Parallelism with centralized parameters
Another approach is to store the model parameters outside of the GPU devices performing the computations ( called workers), for example on the CPU. In a distributed setup, you may place all the parameters on one or more CPU-only servers called parameter servers, whose only role is to host and update the parameters.
I hope you liked this article of training models across multiple devices. Feel free to ask your valuable questions in the comments section below.