PaC-trees: supporting parallel and compressed purely-functional collections

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 09 Jun 2022Embargo end date: 01 Jan 2022Publisher:ACMJournal:Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and ImplementationFunded by:NSF | AF: Small: Shared-Memory ..., NSF | Collaborative Research: P..., NSF | AF: Small: New Directions... +2 projects

Authors: Laxman Dhulipala; Guy E. Blelloch; Yan Gu 0001; Yihan Sun 0001;

doi: 10.1145/3519939.3523733 , 10.48550/arxiv.2204.06077

arXiv: 2204.06077

PaC-trees: supporting parallel and compressed purely-functional collections

- Summary
- Subjects
- Metrics

Abstract

Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on balanced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks using arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC-trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that implements the full functionality of PAM, while offering significant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1--7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3--2.6x less space than the state-of-the-art graph processing system Aspen.

This is a preliminary version of a paper that will appear at the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2022)

Related Organizations

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Distributed, Parallel, and Cluster Computing (cs.DC)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	14
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%