Cloud-native distributed genomic pileup operations

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 30 Aug 2022 United States English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 39 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Marek Wiewiórka; Agnieszka Szmurło; Paweł Stankiewicz; Tomasz Gambin;

doi: 10.1093/bioinformatics/btac804 , 10.1101/2022.08.27.475646

pmid: 36515465

pmc: PMC9848050

Cloud-native distributed genomic pileup operations

- Summary
- Subjects
- Metrics

Abstract

Abstract Motivation Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. Results Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. Availability and implementation https://biodatageeks.github.io/sequila/

Country

United States

Related Organizations

Polish Academy of Sciences
Poland
Warsaw University of Technology
Poland
Institute of Computer Science
Poland
Baylor College of Medicine
United States
Texas Medical Center
United States

Keywords

Original Paper, Medical Sciences, Genome, 000, Cell Phenomena, Life Sciences, Computational Biology, Genetics and Genomics, Genomics, Biomedical Informatics, 004, Medical Molecular Biology, Medical Specialties, Medicine and Health Sciences, and Immunity, Medical Genetics, Software, Algorithms, Biological Phenomena

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

gold

Fields of Science (4) View all

engineering and technology

medical engineering

Fields of Science

engineering and technology

medical engineering

View all