Scheduling irregular parallel computations on hierarchical caches

descriptionPublicationkeyboard_double_arrow_right Article , Conference object 04 Jun 2011Publisher:ACMJournal:Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Authors: Guy E. Blelloch; Jeremy T. Fineman; Phillip B. Gibbons; Harsha Vardhan Simhadri;

doi: 10.1145/1989493.1989553

Scheduling irregular parallel computations on hierarchical caches

- Summary
- Related research
  (5)
- Metrics

Abstract

For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

Related Organizations

Carnegie Mellon University
United States
Intel (United States)
United States

5 Research products, page 1 of 1

The Data Locality of Work Stealing
2000IsAmongTopNSimilarDocuments
The Power of Nested Parallelism in Big Data Processing – Hitting Three Flies with One Slap –
2021IsAmongTopNSimilarDocuments
Heartbeat scheduling: provable efficiency for nested parallelism
2018IsAmongTopNSimilarDocuments
NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs
2014IsAmongTopNSimilarDocuments
Disentanglement in nested-parallel programs
2019IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	62
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%