Scalable Compression of Massive Data Collections on HPC Systems

Name: Scalable Compression of Massive Data Collections on HPC Systems
Keywords: Big Data; Data Compression; Distributed Processing; HPC; Parallel Computing

Loris Belcastro; Paolo Ferragina; Giovanni Manzini; Fabrizio Marozzo; Domenico Talia; Paolo Trunfio

Found an issue? Give us feedback

downloadFull-Text

Archivio della ricer...arrow_drop_down

Archivio della ricerca della Scuola Superiore Sant'Anna

Conference object . 2025

License: CC BY

Full-Text: https://www.iris.sssup.it/request-item?handle=11382/581152&bitstreamId=d25fca1f-ece7-445d-a680-772c37ebf967

Data sources: Archivio della ricerca della Scuola Superiore Sant'Anna

https://doi.org/10.1007/978-3-...

Part of book or chapter of book . 2025 . Peer-reviewed

License: Springer Nature TDM

Data sources: Crossref

DBLP

Conference object

Data sources: DBLP

Scalable Compression of Massive Data Collections on HPC Systems

descriptionPublicationkeyboard_double_arrow_right Part of book or chapter of book , Conference object 23 Aug 2025 Italy English Publisher:Springer Nature Switzerland

Authors: Loris Belcastro; Paolo Ferragina; Giovanni Manzini; Fabrizio Marozzo; Domenico Talia; Paolo Trunfio;

doi: 10.1007/978-3-031-99857-7_23

handle: 11382/581152

Scalable Compression of Massive Data Collections on HPC Systems

- Summary
- Subjects
- Metrics

Abstract

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Country

Italy

Related Organizations

Sant'Anna School of Advanced Studies
Italy
University of Calabria
Italy
University of Pisa
Italy

Keywords

Big Data; Data Compression; Distributed Processing; HPC; Parallel Computing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now