publication . Preprint . 2018

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Garg, Rohan; Mohan, Apoore; Sullivan, Michael; Cooperman, Gene;
Open Access English
  • Published: 31 Jul 2018
Abstract
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider po...
Subjects
acm: Software_PROGRAMMINGTECHNIQUES
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing
Related Organizations
Funded by
NSF| SI2-SSE: Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1440788
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
,
NSF| NSCI: SI2-SSE: An Extensible Model to Support Scalable Checkpoint-Restart for DMTCP Across Multiple Disciplines
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1740218
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
Download from
60 references, page 1 of 4

[1] TOP500, \TOP500 supercomputer sites," https://www.top500.org/, 2018.

[2] I. S. Haque and V. S. Pande, \Hard Data on Soft Errors: A Large-Scale Assessment of RealWorld Error Rates in GPGPU," in CCGRID, May 2010.

[3] J. Y. Shi, M. Tai , A. Khreishah, and J. Wu, \Sustainable GPU Computing at Scale," in 2011 14th IEEE International Conference on Computational Science and Engineering, Aug 2011.

[4] N. DeBardeleben, S. Blanchard, L. Monroe, P. Romero, D. Grunau, C. Idler, and C. Wright, \GPU Behavior on a Large HPC Cluster," in Euro-Par 2013: Parallel Processing Workshops. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014.

[5] D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell, \Reliability Lessons Learned from GPU Experience with the Titan Supercomputer at Oak Ridge Leadership Computing Facility," in SC. New York, NY, USA: ACM, 2015. [Online]. Available: http://doi.acm.org/10.1145/2807591.2807666

[6] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux et al., \Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation," in HPCA. IEEE, 2015.

[7] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, \Memory Errors in Modern Systems: The Good, The Bad, and The Ugly," in ASPLOS. New York, NY, USA: ACM, 2015. [Online]. Available: http://doi.acm.org/10.1145/2694344.2694348

[8] L. Shi, H. Chen, and J. Sun, \vCUDA: GPU-accelerated High Performance Computing in Virtual Machines," in Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009.

[9] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan, \GViM: GPU-accelerated Virtual Machines," in Proc. of the 3rd ACM Workshop on Systemlevel Virtualization for High Performance Computing. ACM, 2009.

[10] H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi, \CheCUDA: A Checkpoint/Restart Tool for CUDA Applications," in Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009.

[11] L. B. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, \Transparent Low-overhead Checkpoint for GPU-accelerated Clusters," 2010, [Online; accessed 16-Mar2018]. [Online]. Available: https://wiki.ncsa.illinois.edu/download/attachments/17630761/ INRIA-UIUC-WS4-lbautista.pdf

[12] A. Nukada, H. Takizawa, and S. Matsuoka, \NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA," in Proceedings of the International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. IEEE, 2011.

[13] T. Suzuki, A. Nukada, and S. Matsuoka, \Transparent Checkpoint and Restart Technology for CUDA Applications," GPU Technology Conference (GTC), 2016, [Online; accessed 17-Jan-2018]. [Online]. Available: https://tinyurl.com/ycb7y8xw

[14] NVIDIA, \CUDA C programming guide, appendix k: Uni ed memory programming," NVIDIA Developer Zone, 2017, pG-02829-001 v9.1 [Online; accessed 17-Jan-2018]. [Online]. Available: http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf

[15] M. Harris, \Uni ed memory for CUDA beginners," NVIDIA Blog, 2016, [Online; accessed 18- Jan-2018]. [Online]. Available: https://devblogs.nvidia.com/uni ed-memory-cuda-beginners/

60 references, page 1 of 4
Abstract
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider po...
Subjects
acm: Software_PROGRAMMINGTECHNIQUES
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing
Related Organizations
Funded by
NSF| SI2-SSE: Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1440788
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
,
NSF| NSCI: SI2-SSE: An Extensible Model to Support Scalable Checkpoint-Restart for DMTCP Across Multiple Disciplines
Project
  • Funder: National Science Foundation (NSF)
  • Project Code: 1740218
  • Funding stream: Directorate for Computer & Information Science & Engineering | Division of Advanced Cyberinfrastructure
Download from
60 references, page 1 of 4

[1] TOP500, \TOP500 supercomputer sites," https://www.top500.org/, 2018.

[2] I. S. Haque and V. S. Pande, \Hard Data on Soft Errors: A Large-Scale Assessment of RealWorld Error Rates in GPGPU," in CCGRID, May 2010.

[3] J. Y. Shi, M. Tai , A. Khreishah, and J. Wu, \Sustainable GPU Computing at Scale," in 2011 14th IEEE International Conference on Computational Science and Engineering, Aug 2011.

[4] N. DeBardeleben, S. Blanchard, L. Monroe, P. Romero, D. Grunau, C. Idler, and C. Wright, \GPU Behavior on a Large HPC Cluster," in Euro-Par 2013: Parallel Processing Workshops. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014.

[5] D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell, \Reliability Lessons Learned from GPU Experience with the Titan Supercomputer at Oak Ridge Leadership Computing Facility," in SC. New York, NY, USA: ACM, 2015. [Online]. Available: http://doi.acm.org/10.1145/2807591.2807666

[6] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux et al., \Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation," in HPCA. IEEE, 2015.

[7] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, \Memory Errors in Modern Systems: The Good, The Bad, and The Ugly," in ASPLOS. New York, NY, USA: ACM, 2015. [Online]. Available: http://doi.acm.org/10.1145/2694344.2694348

[8] L. Shi, H. Chen, and J. Sun, \vCUDA: GPU-accelerated High Performance Computing in Virtual Machines," in Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009.

[9] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan, \GViM: GPU-accelerated Virtual Machines," in Proc. of the 3rd ACM Workshop on Systemlevel Virtualization for High Performance Computing. ACM, 2009.

[10] H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi, \CheCUDA: A Checkpoint/Restart Tool for CUDA Applications," in Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009.

[11] L. B. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, \Transparent Low-overhead Checkpoint for GPU-accelerated Clusters," 2010, [Online; accessed 16-Mar2018]. [Online]. Available: https://wiki.ncsa.illinois.edu/download/attachments/17630761/ INRIA-UIUC-WS4-lbautista.pdf

[12] A. Nukada, H. Takizawa, and S. Matsuoka, \NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA," in Proceedings of the International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. IEEE, 2011.

[13] T. Suzuki, A. Nukada, and S. Matsuoka, \Transparent Checkpoint and Restart Technology for CUDA Applications," GPU Technology Conference (GTC), 2016, [Online; accessed 17-Jan-2018]. [Online]. Available: https://tinyurl.com/ycb7y8xw

[14] NVIDIA, \CUDA C programming guide, appendix k: Uni ed memory programming," NVIDIA Developer Zone, 2017, pG-02829-001 v9.1 [Online; accessed 17-Jan-2018]. [Online]. Available: http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf

[15] M. Harris, \Uni ed memory for CUDA beginners," NVIDIA Blog, 2016, [Online; accessed 18- Jan-2018]. [Online]. Available: https://devblogs.nvidia.com/uni ed-memory-cuda-beginners/

60 references, page 1 of 4
Powered by OpenAIRE Open Research Graph
Any information missing or wrong?Report an Issue