
Genomes are strings over the letters A,C,G,T, which represent nucleotides, the building blocks of DNA. In view of ultra-large amounts of genome sequence data emerging from ever more and technologically rapidly advancing genome sequencing devices—in the meantime, amounts of sequencing data accrued are reaching into the exabyte scale—the driving, urgent question is: how can we arrange and analyze these data masses in a formally rigorous, computationally efficient and biomedically rewarding manner? Graph based data structures have been pointed out to have disruptive benefits over traditional sequence based structures when representing pan-genomes, sufficiently large, evolutionarily coherent collections of genomes. This idea has its immediate justification in the laws of genetics: evolutionarily closely related genomes vary only in relatively little amounts of letters, while sharing the majority of their sequence content. Graph-based pan-genome representations that allow to remove redundancies without having to discard individual differences, make utmost sense. In this project, we will put this shift of paradigms—from sequence to graph based representations of genomes—into full effect. As a result, we can expect a wealth of practically relevant advantages, among which arrangement, analysis, compression, integration and exploitation of genome data are the most fundamental points. In addition, we will also open up a significant source of inspiration for computer science itself.
In the EU alone, according to the Orphanet DB (https://pubmed.ncbi.nlm.nih.gov/31527858/), 30 million persons, 3,5-6% of the general population, are affected by one of the 6,172 different rare diseases (RDs) of which 72% are genetic and 70% affect children. The path to diagnosis for people suffering from a RD is burdensome, often severely delayed by a diagnostic odyssey. Lack of timely diagnosis affects disease management, family planning, identification of potential beneficial treatments and / or clinical trials. This unacceptable situation does not meet the concept of equity for EU citizens, and requires rapid, structured, and cost-effective corrective actions. The Screen4Care (S4C) consortium will leverage the genomic and digital advent to develop and pilot genetic NBS and AI-guided symptom recognition algorithms, while accounting for all relevant legal, regulatory and ethical considerations. S4C aims to harmonize the results of existing efforts in a horizon scan, by looking at the totality of the available data resources, diagnostic algorithms, and other initiatives with similar ultimate goals. The genetic NBS will interrogate 1) currently treatable RDs (TREAT-map gene panel), 2) actionable RDs (ACT-map gene panel) in 18.000 new-borns in 3 EU countries (D, It, and Cz). Further, S4C will offer whole genome sequencing (WGS) to early symptomatic babies, tested negatively during panel-based NBS to identify known NBS-escaped RDs and novel genes/phenotypes. S4C will also provide two digital diagnosis support systems for RD on the basis of features and symptom complexes: 1) federated ML- and literature-evidence-based algorithm for continuous and automated screening of EHR and 2) meta symptom checker with virtual clinics for patients and HCP offering the possibility of increased accuracy of diagnosis and ongoing supports. Our ambitious goal is to evaluate the validity of our multi-pronged approach to shorten the time to diagnosis for all patients affect by RDs, improve value-based healthcare resource utilization, and hopefully reduce the suffering of millions of European citizens.