
handle: 11382/581152
The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.
Big Data; Data Compression; Distributed Processing; HPC; Parallel Computing
Big Data; Data Compression; Distributed Processing; HPC; Parallel Computing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
