
Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated.In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Base Composition, QH301-705.5, DNA storage, Research, Computer applications to medicine. Medical informatics, R858-859.7, Information Storage and Retrieval, DNA, Sequence Analysis, DNA, Variable-to-variable length code, GC content constraint, Biology (General), Homopolymer constraint, Algorithms
Base Composition, QH301-705.5, DNA storage, Research, Computer applications to medicine. Medical informatics, R858-859.7, Information Storage and Retrieval, DNA, Sequence Analysis, DNA, Variable-to-variable length code, GC content constraint, Biology (General), Homopolymer constraint, Algorithms
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
