T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. In Proceedings of High Performance Graphics 2009, pages 145-149, Aug. 2009. [OpenAIRE]
 T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6-16, Jan. 1990.
 T. E. Anderson, E. D. Lazowska, and H. M. Levy. The performance implications of thread management for shared-memory multiprocessors. IEEE Transactions on Computers, 38(12):1631-1644, Dec. 1989. [OpenAIRE]
 N. S. Arenstorf and H. F. Jordan. Comparing barrier algorithms. Parallel Computing, 12:157-170, 1989.
 L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik, and A. Vainshten. Optimal strategies for spinning and blocking. Technical report, 1993. [OpenAIRE]
 E. D. Brooks. The butterfly barrier. International Journal of Parallel Programming, 15:295-307, 1986.
 E. W. Dijkstra. Cooperating sequential processes. In F. Genuys, editor, Programming Languages: NATO Advanced Study Institute, pages 43-112. Academic Press, 1968. [OpenAIRE]
 J. Jenkins, I. Arkatkar, J. D. Owens, A. Choudhary, and N. F. Samatova. Lessons learned from exploring the backtracking paradigm on the GPU. In Euro-Par 2011: Proceedings of the 17th International European Conference on Parallel and Distributed Computing, volume 6853 of Lecture Notes in Computer Science, pages 425-437. Springer, Aug./Sept. 2011. [OpenAIRE]
 A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In PPoPP '11: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, pages 223-233, Feb. 2011.
 J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computers, 9(1):21-65, Feb. 1991.
 R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed packet switching for local computer networks. Communications of the ACM, 19(7):395-404, July 1976. [OpenAIRE]
 J. K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems, pages 22-30, 1982.
 R. Russell. Fuss, futexes and furwocks: Fast userlevel locking in linux. In Proceedings of the Ottawa Linux Symposium, pages 479-495, 2002.
 R. Shams and R. A. Kennedy. Efficient histogram algorithms for NVIDIA CUDA compatible devices. In Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS), pages 418-422, Gold Coast, Australia, Dec. 2007.
 S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In Proceedings of High Performance Graphics 2010, pages 29-37, June 2010. [OpenAIRE]