<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Data-parallel distributed training of very large models beyond GPU capacity

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2018Embargo end date: 01 Jan 2018Publisher:arXiv

Authors: Matzek, Samuel; Grossman, Max; Cho, Minsik; Yusifov, Anar; Nelson, Bryant; Juneja, Amit;

doi: 10.48550/arxiv.1811.12174

arXiv: http://arxiv.org/abs/1811.12174

Data-parallel distributed training of very large models beyond GPU capacity

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high bandwidth NVLink connection between CPUs and GPUs to accomplish training of deep convolutional networks. LMS performs tensor swapping between CPU memory and GPU memory such that only a minimal number of tensors required in a training step are kept in the GPU memory. It is also shown how LMS can be combined with an MPI based distributed deep learning module to train models in a data-parallel fashion across multiple GPUs, such that each GPU is utilizing the CPU memory for tensor swapping. The hardware architecture that enables the high bandwidth GPU link with the CPU is discussed as well as the associated set of software tools that are available as the PowerAI package.

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)

2 Research products, page 1 of 1

gloo software on GitHub
IsRelatedTo
3DUnetCNN software on GitHub
IsRelatedTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Data-parallel distributed training of very large models beyond GPU capacity

Data-parallel distributed training of very large models beyond GPU capacity

2 Research products, page 1 of 1

gloo software on GitHub

3DUnetCNN software on GitHub