Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets.

Article, Preprint English OPEN
Soroush Samadian ; Jeff P Bruce ; Trevor J Pugh (2018)
  • Publisher: Public Library of Science (PLoS)
  • Journal: PLoS Computational Biology, volume 14, issue 3 (issn: 1553-734X, eissn: 1553-7358)
  • Related identifiers: pmc: PMC5891060, doi: 10.1371/journal.pcbi.1006080, doi: 10.1101/119636
  • Subject: Computational Biology | Molecular biology | Research Article | Sequence Alignment | DNA sequence analysis | Genome Complexity | Genetic Loci | Genetics | Copy Number Variation | Genome Analysis | Molecular biology techniques | Genomics | Sequence analysis | Database and informatics methods | Alleles | Molecular Genetics | DNA sequencing | Sequencing techniques | Biology and life sciences | bepress|Life Sciences|Bioinformatics | Research and analysis methods | Heredity | QH301-705.5 | bepress|Life Sciences|Biology | Haplotypes | Genetic Mapping | Bioinformatics | Biology (General)

Somatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20–100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.
  • References (18)
    18 references, page 1 of 2

    1. Sathirapongsasuti JF, Lee H, Horst BAJ, Brunner G, Cochran AJ, Binder S, et al. Exome Sequencing-Based Copy-Number Variation and Loss of Heterozygosity Detection: ExomeCNV. Bioinformatics. 2011;btr462.

    2. Chiang DY, Getz G, Jaffe DB, O'Kelly MJT, Zhao X, Carter SL, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat. Methods. 2009;6:99-103.

    3. Kim S, Jeong K, Bafna V. Wessim: a whole-exome sequencing simulator based on in silico exome capture. Bioinformatics. 2013;29:1076-7.

    4. Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 2010;61:437-55.

    5. Tan R, Wang Y, Kleinstein SE, Liu Y, Zhu X, Guo H, et al. An evaluation of copy number variation detection tools from whole-exome sequencing data. Hum. Mutat. 2014;35:899-907.

    6. Escalona M, Rocha S, Posada D. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet [Internet]. 2016 [cited 2016 Jul 9];advance online publication. Available from: http://www.nature.com/nrg/journal/vaop/ncurrent/full/nrg.2016.57.html

    7. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593-4.

    9. Ewing AD, Houlahan KE, Hu Y, Ellrott K, Caloian C, Yamaguchi TN, et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic singlenucleotide-variant detection. Nat Meth. 2015;12:623-30.

    10. Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 2007;81:1084-97.

    11. Picard [Internet]. Picard. Available from: http://broadinstitute.github.io/picard 12. mpileup [Internet]. Available from: http://samtools.sourceforge.net/mpileup.shtml 13. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568-76.

  • Related Research Results (2)
  • Similar Research Results (1)
  • Metrics
    No metrics available
Share - Bookmark