# MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

- Published: 27 Nov 2018
- Publisher: IEEE

- 1
- 2

[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223-1231.

[2] P. Goyal, P. Dolla´r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.

[3] P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch, “Ako: Decentralised deep learning with partial gradient exchange,” in Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, 2016, pp. 84-97.

[4] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server,” in Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016, p. 4.

[5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1707- 1718.

[6] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2018.

[7] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1508- 1518.

[8] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda, “Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with Nvidia GPUs,” in Parallel Processing (ICPP), 2013 42nd International Conference on. IEEE, 2013, pp. 80-89.

[9] M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, “Scalable reduction collectives with data partitioning-based multi-leader design,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 64. [OpenAIRE]

[10] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “Scaffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2017, pp. 193-205. [OpenAIRE]

[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.

[12] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters,” in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 2017, pp. 181-193.

[13] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, and M. Wo´jcik, “Re-architecting datacenter networks and stacks for low latency and high performance,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017, pp. 29-42. [OpenAIRE]

[14] C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn, “RDMA over commodity ethernet at scale,” in Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016, pp. 202-215.

[15] Y. You, A. Buluc¸, and J. Demmel, “Scaling deep learning on GPU and Knights Landing clusters,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 9.

- 1
- 2

- 1
- 2

[1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223-1231.

[2] P. Goyal, P. Dolla´r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.

[3] P. Watcharapichat, V. L. Morales, R. C. Fernandez, and P. Pietzuch, “Ako: Decentralised deep learning with partial gradient exchange,” in Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, 2016, pp. 84-97.

[4] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server,” in Proceedings of the Eleventh European Conference on Computer Systems. ACM, 2016, p. 4.

[5] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems, 2017, pp. 1707- 1718.

[6] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2018.

[7] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 1508- 1518.

[8] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda, “Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with Nvidia GPUs,” in Parallel Processing (ICPP), 2013 42nd International Conference on. IEEE, 2013, pp. 80-89.

[9] M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, “Scalable reduction collectives with data partitioning-based multi-leader design,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 64. [OpenAIRE]

[10] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “Scaffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2017, pp. 193-205. [OpenAIRE]

[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.

[12] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters,” in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 2017, pp. 181-193.

[13] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, and M. Wo´jcik, “Re-architecting datacenter networks and stacks for low latency and high performance,” in Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017, pp. 29-42. [OpenAIRE]

[14] C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn, “RDMA over commodity ethernet at scale,” in Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016, pp. 202-215.

[15] Y. You, A. Buluc¸, and J. Demmel, “Scaling deep learning on GPU and Knights Landing clusters,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 9.

- 1
- 2