A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2025 English Publisher:Oxford University Press (OUP)Journal:Briefings in Functional Genomics, volume 24 (issn: 2041-2649, eissn: 2041-2657,

Copyright policy )

Authors: Roy, Subhankar; Kumar Maity, Dilip; Mukhopadhyay, Anirban;

doi: 10.1093/bfgp/elae050

pmid: 39777449

pmc: PMC11735755

A lossless reference-free sequence compression algorithm leveraging grammatical, statistical, and substitution rules

- Summary
- Subjects
- Metrics

Abstract

Abstract Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS’s performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact: anirban@klyuniv.ac.in

Related Organizations

University of Kalyani
India

Keywords

Protocol Article, Sequence Analysis, DNA, Data Compression, Algorithms, Compression Algorithms

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	6
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

6

Top 10%

Average

Top 10%

Green

hybrid