DRACO:Robust Distributed Training via Redundant Gradients
Distributed model training is vulnerable to adversarial compute nodes, i.e. nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To defend against such attacks, recent work suggests using variants of the geometric median to aggregate distributed updates at the PS, in place of bulk averaging. Although median-based update rules are robust to adversarial nodes, their computational cost can be prohibitive in large-scale settings and their convergence guarantees often require relatively strong assumptions.
In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are then used by the parameter server to eliminate the effects of adversarial updates. We present problem-independent robustness guarantees for DRACO and show that the model it produces is always identical to one trained with no adversaries. We provide extensive experiments on real datasets in a distributed environment which show that DRACO is several times to orders of magnitude faster than median-based approaches.