Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ Archivio della ricer...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Journal of Systems and Software
Article . 2025 . Peer-reviewed
License: CC BY
Data sources: Crossref
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
INRIA2
Article . 2025
License: CC BY
Data sources: INRIA2
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
DBLP
Article
Data sources: DBLP
versions View all 6 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

On the compressibility of large-scale source code datasets

Authors: Boffa, Antonio; Di Cosmo, Roberto; Ferragina, Paolo; Guerra, Andrea; Manzini, Giovanni; Vinciguerra, Giorgio; Zacchiroli, Stefano;

On the compressibility of large-scale source code datasets

Abstract

Storing ultra-large amounts of unstructured data (often called objects or blobs) is a fundamental task for several object-based storage engines, data warehouses, data-lake systems, and key-value stores. These systems cannot currently leverage similarities between objects, which could be vital in improving their space and time performance. An important use case in which we can expect the objects to be highly similar is the storage of large-scale versioned source code datasets, such as the Software Heritage Archive (Di Cosmo and Zacchiroli, 2017). This use case is particularly interesting given the extraordinary size (1.5 PiB), the variegated nature, and the high repetitiveness of the at-issue corpus.In this paper we discuss and experiment with content-and context-based compression techniques for source-code collections that tailor known and novel tools to this setting in combination with state-of-the-art general-purpose compressors and the information coming from the Software Heritage Graph.We experiment with our compressors over a random sample of the entire corpus, and four large samples of source code files written in different popular languages: C/C++, Java, JavaScript, and Python. We also consider two scenarios of usage for our compressors, called Backup and File-Access scenario, where the latter adds to the former the support for single file retrieval. As a net result, our experiments show (i) how much ''compressible'' each language is, (ii) which content-or context-based techniques compress better and are faster to (de)compress by possibly supporting individual file access, and (iii) the ultimate compressed size that, according to our estimate, our best solution could achieve in storing all the source code written in these languages and available in the Software Heritage Archive: namely, in 3 TiB (down from their original 78 TiB total size, with an average compression ratio of 4%).

Countries
Italy, France
Keywords

Locality-sensitive hashing, Version control systems, Data compression, Data compression; Locality-sensitive hashing; Software Heritage; Source code; Storage systems; Version control systems, [INFO.INFO-SE] Computer Science [cs]/Software Engineering [cs.SE], Software Heritage, Source code, Storage systems

1 Data sources, page 1 of 1
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average
Green
hybrid
Related to Research communities
EGI : advanced computing for research