descriptionPublicationkeyboard_double_arrow_right Article , Preprint 29 Aug 2022Embargo end date: 01 Jan 2022Publisher:ACMJournal:Proceedings of the 51st International Conference on Parallel Processing

Authors: Guidi, Giulia; Raulet, Gabriel; Rokhsar, Daniel; Oliker, Leonid; Yelick, Katherine; Buluc, Aydin;

doi: 10.1145/3545008.3545050 , 10.48550/arxiv.2207.04350

arXiv: 2207.04350

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

- Summary
- Subjects
- Related research
  (1)
- Metrics

Abstract

De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.

ICPP22, August 29-September 1, 2022, Bordeaux, France

Related Organizations

University of California System
United States
University of California, Berkeley
United States
Lawrence Berkeley National Laboratory
United States
University of California, San Francisco
United States

Keywords

Genomics (q-bio.GN), FOS: Computer and information sciences, Computer Science - Distributed, Parallel, and Cluster Computing, FOS: Biological sciences, Quantitative Biology - Genomics, Distributed, Parallel, and Cluster Computing (cs.DC)

1 Research products, page 1 of 1

ELBA software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average