Scalable Task Parallelism for NUMA

Conference object, Contribution for newspaper or weekly magazine English OPEN
Drebes, Andi ; Pop, Antoniu ; Heydemann, Karine ; Cohen, Albert ; Drach, Nathalie (2016)
  • Publisher: HAL CCSD
  • Related identifiers: doi: 10.1145/2967938.2967946
  • Subject: Task-parallel programming | Memory allocation | NUMA | Scheduling | [ INFO.INFO-PL ] Computer Science [cs]/Programming Languages [cs.PL] | Data-flow programming

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models for both computing and memory resources with high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of accesses to task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system, and placement information from the operating system. On a 192-core system with 24 NUMA nodes, our optimizations achieve above 94% locality (fraction of local memory accesses), up to 5× better performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.
  • References (35)
    35 references, page 1 of 4

    [1] K. E. Batcher. Sorting networks and their applications. In Proceedings of the April 30{May 2, 1968, Spring joint Computer Conference, AFIPS '68 (Spring), pages 307{314, New York, NY, USA, 1968. ACM.

    [2] J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. O ner. Extending OpenMP for NUMA machines. In Supercomputing. ACM/IEEE, Nov. 2000.

    [3] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An e cient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207{216, New York, NY, USA, 1995. ACM.

    [4] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720{748, Sept. 1999.

    [5] F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.-A. Wacrenier, and R. Namyst. Structuring the execution of openmp applications for multicore architectures. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1{10. IEEE.

    [6] F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An e cient OpenMP environment for NUMA architectures. International Journal of Parallel Programming, 38(5):418{439, 2010.

    [7] F. Broquedis, T. Gautier, and V. Danjean. LIBKOMP, an e cient OpenMP runtime system for both fork-join and data ow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP, pages 102{115, Berlin, Heidelberg, 2012. Springer-Verlag.

    [8] Z. Budimlic, M. Burke, V. Cave, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar. Concurrent collections. Scienti c Programming, 18:203|217, 2010.

    [9] V. Cave, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: The New Adventures of Old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51{61, New York, NY, USA, 2011. ACM.

    [10] P. Charles, C. Grotho , V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519{538, New York, NY, USA, 2005. ACM.

  • Similar Research Results (3)
  • Metrics
    4
    views in OpenAIRE
    0
    views in local repository
    18
    downloads in local repository

    The information is available from the following content providers:

    From Number Of Views Number Of Downloads
    The University of Manchester - Institutional Repository - IRUS-UK 0 18
Share - Bookmark