Distributed Learning
The main reasons you want to do distributed learning:
- You want to train very large models that won't fit on a single device.
- You want to speed up model training that uses large datasets.
- You want to try out a large set of hyperparameters.
- You want to find better performing neural network architecture / model with AutoML.
The different approaches to distribute model training:
- Model Parallelism: You train different parts of the model on different devices.
- Data Parallelism: Worker machines will read different data batches, compute gradients based on the batches and share update operations. Final weights can be gathered to dedicated servers or just shared between workers.
Most frequently, people want to speed up training with data parallelism. With a lot of data, deep learning will keep on increasing performance more often than traditional learning algorithms. More data means slower training times, increasingly so.
Data parallelism in a nutshell:
- run multiple copies of your main model training...
- read a designated chunk of training data
- run it through your full model
- compute model updates i.e. gradients
- average gradients among these copies
- update the main model weights and biases
- repeat
Taking the gradient over the mini-batch is done because:
- It can be efficiently computed by vectorizing the computations.
- It allows us to obtain a better approximation of the true gradient and thus makes us converge faster.
You can streamline experiments using data parallelism. As you make training a lot faster, you can feasibly use random search or other hyperparameter optimization approaches to explore the parameter space more quickly.
Data parallelism can be synchronous or asynchronous. If all workers communicate with the parameter servers at the same time, it is synchronous. When workers receive and send to parameter servers at their own pace, it is asynchronous.
Parameter server approach is not very scaleable. If you use just one parameter server, that will be your bottleneck capped by both machine resources and network. If you use multiple parameter servers, your bottleneck will be the network. Parameter server approach has 40 - 50% scaling efficiency, so you lose half the power you throw at it.
Model parallelism is hard to get right. Bottleneck is that deeper layers must wait for the first layers finish during forward pass, and first layers must wait for the deeper layers during backpropagation. Works better for branching model architectures and when training on a single machine with multiple GPUs.