
pmid: 30816928
Abstract Motivation Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100–1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. Results Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems. Availability and implementation https://bitbucket.org/berkeleylab/jgi-sparc
Cluster Analysis (mesh), Bioinformatics, 3102 Bioinformatics and Computational Biology (for-2020), Metagenomics (mesh), Bioinformatics and Computational Biology, Bioengineering, 08 Information and Computing Sciences (for), 3105 Genetics (for-2020), Mathematical Sciences, Information and Computing Sciences, 31 Biological sciences (for-2020), Genetics, Cluster Analysis, Software (mesh), 46 Information and computing sciences (for-2020), Algorithms (mesh), 31 Biological Sciences (for-2020), Genetics (rcdc), Human Genome, Bioengineering (rcdc), High-Throughput Nucleotide Sequencing (mesh), High-Throughput Nucleotide Sequencing, DNA, Sequence Analysis, DNA, Biological Sciences, Human Genome (rcdc), 49 Mathematical sciences (for-2020), 004, 06 Biological Sciences (for), Bioinformatics (science-metrix), 01 Mathematical Sciences (for), DNA (mesh), Metagenomics, Sequence Analysis, Algorithms, Software
Cluster Analysis (mesh), Bioinformatics, 3102 Bioinformatics and Computational Biology (for-2020), Metagenomics (mesh), Bioinformatics and Computational Biology, Bioengineering, 08 Information and Computing Sciences (for), 3105 Genetics (for-2020), Mathematical Sciences, Information and Computing Sciences, 31 Biological sciences (for-2020), Genetics, Cluster Analysis, Software (mesh), 46 Information and computing sciences (for-2020), Algorithms (mesh), 31 Biological Sciences (for-2020), Genetics (rcdc), Human Genome, Bioengineering (rcdc), High-Throughput Nucleotide Sequencing (mesh), High-Throughput Nucleotide Sequencing, DNA, Sequence Analysis, DNA, Biological Sciences, Human Genome (rcdc), 49 Mathematical sciences (for-2020), 004, 06 Biological Sciences (for), Bioinformatics (science-metrix), 01 Mathematical Sciences (for), DNA (mesh), Metagenomics, Sequence Analysis, Algorithms, Software
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 16 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
