ruk·si

Horovod

Updated at 2020-10-08 16:53

Horovod is a plugin to TensorFlow/Keras.

  • Uses TensorFlow custom operation mechanism.
  • Uses Message Passing Interface (MPI) for worker discovery and work coordination.
  • Uses NVIDIA NCCL for the actual reduction. NVIDIA all-reduce library to do collective communication, optimized for GPUs.

Horovod's power comes from ring-allreduce. Ring-allreduce is a high-performance computing strategy and last year Baidu's Silicon Valley AI Lab demonstrated its benefits with machine learning. Horovod uses NVIDIA Collective Communications Library (NCCL) for the ring-allreduce which is more optimized.

Each node:
* Receives from exactly one node.
* Sends to exactly one node.

Message Passing Interface (MPI) is a communication protocol for programming parallel computers. MPI is the main approach in high-performance computing today. MPI offers synchronization and communication between a set of processes in a language-independent way. Usually you start as many processes as you have CPUs/GPUs.

Maximize number of GPUs per machine. Network traffic is frequently the bottleneck for Horovod.

Sources