
Abstract Deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) sequence compressors for novel species frequently face challenges when processing wide-scale raw, FASTA, or multi-FASTA structured data. For years, molecular sequence databases have favored the widely used general-purpose Gzip and Zstd compressors. The absence of sequence-specific characteristics in these encoders results in subpar performance, and their use depends on time-consuming parameter adjustments. To address these limitations, in this article, we propose a reference-free, lossless sequence compressor called GraSS (Grammatical, Statistical, and Substitution Rule-Based). GraSS compresses sequences more effectively by taking advantage of certain characteristics seen in DNA and RNA sequences. It supports various formats, including raw, FASTA, and multi-FASTA, commonly found in GenBank DNA and RNA files. We evaluate GraSS’s performance using ten benchmark DNA sequences with reduced number of repeats, two highly repetitive RNA sequences, and fifteen raw DNA sequences. Test results indicate that the weighted average compression ratios (WACR) for DNA and RNA sequences are 4.5 and 19.6, respectively. Additionally, the entire DNA sequence corpus has a total compression time (TCT) of 246.8 seconds (s). These results demonstrate that the proposed compression method performs better than several advanced algorithms specifically designed to handle various levels of sequence redundancy. The decompression times, memory usage, and CPU usage are also very competitive. Contact: anirban@klyuniv.ac.in
Protocol Article, Sequence Analysis, DNA, Data Compression, Algorithms, Compression Algorithms
Protocol Article, Sequence Analysis, DNA, Data Compression, Algorithms, Compression Algorithms
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 6 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
