publication . Article . Preprint . 2018

Efficient Synchronization Primitives for GPUs

Stuart, Jeff A.; Owens, John D.;
Open Access English
  • Published: 19 Jun 2018
  • Publisher: eScholarship, University of California
  • Country: Mexico
Abstract
In this paper, we revisit the design of synchronization primitives---specifically barriers, mutexes, and semaphores---and how they apply to the GPU. Previous implementations are insufficient due to the discrepancies in hardware and programming model of the GPU and CPU. We create new implementations in CUDA and analyze the performance of spinning on the GPU, as well as a method of sleeping on the GPU, by running a set of memory-system benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class GPUs from NVIDIA. From our results we define higher-level principles that are valid for generic many-core processors, the most important of which is to lim...
Subjects
arXiv: Computer Science::Mathematical SoftwareComputer Science::PerformanceComputer Science::Graphics
ACM Computing Classification System: Software_PROGRAMMINGTECHNIQUES
free text keywords: cs.OS, cs.DC, cs.DS, cs.GR, D.4.1; I.3.2, D.4.1, I.3.2, Computer Science - Operating Systems, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Computer Science - Graphics
19 references, page 1 of 2

[1] T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. In Proceedings of High Performance Graphics 2009, pages 145-149, Aug. 2009. [OpenAIRE]

[2] T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6-16, Jan. 1990.

[3] T. E. Anderson, E. D. Lazowska, and H. M. Levy. The performance implications of thread management for shared-memory multiprocessors. IEEE Transactions on Computers, 38(12):1631-1644, Dec. 1989. [OpenAIRE]

[4] N. S. Arenstorf and H. F. Jordan. Comparing barrier algorithms. Parallel Computing, 12:157-170, 1989.

[5] L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik, and A. Vainshten. Optimal strategies for spinning and blocking. Technical report, 1993. [OpenAIRE]

[6] E. D. Brooks. The butterfly barrier. International Journal of Parallel Programming, 15:295-307, 1986.

[7] E. W. Dijkstra. Cooperating sequential processes. In F. Genuys, editor, Programming Languages: NATO Advanced Study Institute, pages 43-112. Academic Press, 1968. [OpenAIRE]

[8] J. Jenkins, I. Arkatkar, J. D. Owens, A. Choudhary, and N. F. Samatova. Lessons learned from exploring the backtracking paradigm on the GPU. In Euro-Par 2011: Proceedings of the 17th International European Conference on Parallel and Distributed Computing, volume 6853 of Lecture Notes in Computer Science, pages 425-437. Springer, Aug./Sept. 2011. [OpenAIRE]

[9] A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In PPoPP '11: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pages 223-233, Feb. 2011.

[10] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computers, 9(1):21-65, Feb. 1991.

[11] R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(7):395-404, July 1976. [OpenAIRE]

[12] J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems, pages 22-30, 1982.

[13] R. Russell. Fuss, futexes and furwocks: Fast userlevel locking in linux. In Proceedings of the Ottawa Linux Symposium, pages 479-495, 2002.

[14] R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418-422, Gold Coast, Australia, Dec. 2007.

[15] S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In Proceedings of High Performance Graphics 2010, pages 29-37, June 2010. [OpenAIRE]

19 references, page 1 of 2
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue