
arXiv: 1809.07599
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
to appear at NIPS 2018
FOS: Computer and information sciences, Computer Science - Machine Learning, G.1.6, E.4, Machine Learning (stat.ML), G.1.6; F.2.1; E.4, Machine Learning (cs.LG), Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, 68W40, 68W15, 90C25, 90C06, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), F.2.1, Distributed, Parallel, and Cluster Computing (cs.DC)
FOS: Computer and information sciences, Computer Science - Machine Learning, G.1.6, E.4, Machine Learning (stat.ML), G.1.6; F.2.1; E.4, Machine Learning (cs.LG), Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, 68W40, 68W15, 90C25, 90C06, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), F.2.1, Distributed, Parallel, and Cluster Computing (cs.DC)
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
