Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Name: Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing
Creator: Diego Diaz
Keywords: FOS: Computer and information sciences, locally consistent parsing, Computer Science - Data Structures and Algorithms, Grammar compression, Data Structures and Algorithms (cs.DS), ddc:004, hashing

Diego Diaz

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2024

Data sources: arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Conference object . 2025

License: CC BY

Data sources: Dagstuhl Research Online Publication Server

https://dx.doi.org/10.48550/ar...

Article . 2024

License: CC BY

Data sources: Datacite

DBLP

Article

Data sources: DBLP

DBLP

Conference object

Data sources: DBLP

http://dx.doi.org/10.48550/arX...

Other literature type . 2024

Data sources: European Union Open Data Portal

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object , Other literature type 01 Jan 2024Embargo end date: 01 Jan 2024 Germany Publisher:arXivJournal:CoRR, volume abs/2411.12439

Authors: Diego Diaz;

doi: 10.48550/arxiv.2411.12439

arXiv: 2411.12439

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$, the instances $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $\mathcal{T}$'s parse tree that remains the same in all the occurrences of $P$. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ fully in parallel and then merge the resulting grammars into a single compressed output equivalent to $ALG(\mathcal{T})$. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

Country

Germany

Related Organizations

Leibniz Association
Germany
Schloss Dagstuhl – Leibniz Center for Informatics
Germany
University of Helsinki
Finland

Keywords

FOS: Computer and information sciences, locally consistent parsing, Computer Science - Data Structures and Algorithms, Grammar compression, Data Structures and Algorithms (cs.DS), hashing, ddc: ddc:004

4 Research products, page 1 of 1

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

UArctic

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

4 Research products, page 1 of 1

lcg software on GitHub

zstd software on GitHub

xxHash software on GitHub

agc software on GitHub