Beware of mis-assembled genomes

descriptionPublicationkeyboard_double_arrow_right Article 25 Oct 2005 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 21, pages 4,320-4,321 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Steven Salzberg; James A. Yorke;

doi: 10.1093/bioinformatics/bti769

pmid: 16332717

Beware of mis-assembled genomes

- Summary
- Subjects
- Metrics

Abstract

With hundreds of genomes now in GenBank, researchers might be forgiven for assuming that genome sequence data are correct, at least at a large scale. Certainly there might be errors at some small rate, perhaps 1 in 50 000 or 100 000 bases (Schmutz et al., 2004; Read et al., 2002), but at a large scale these genomes are put together correctly, are not they? Well, not always. We have been looking at the assemblies of large genomes for several years now, and for every ‘draft’ genome we look at, we find hundreds—and sometimes thousands—of mis-assemblies. These include regions where a genome is incorrectly re-arranged as well as places where large chunks of DNA sequence are simply deleted and the surrounding sequences just crunched together. The source of most mis-assemblies is, as it has always been, repeats. Genomes vary in their repeat content, but we have learned that large genomes are filled with repeats of all shapes and sizes. To illustrate how these repeats result in sequences being ‘lost’ by an assembler, consider the situation in Figure 1. In the figure, we see that the genome has two copies, R1 and R2, of a sequence that lie near one another, separated by a unique region shown in red. If R1 and R2 are long enough, then the assembler will not have any individual sequences (‘reads’) containing the entire repeat and its unique flanking sequences (the green and blue regions). The result will be that the genome assembly looks like the lower half of the figure, with a contiguous stretch of DNA (a contig) that has just one copy of the repeat, incorrectly jamming together the blue and green regions, and the red region will have no place to go. If this seems like a made-up example, it is not: we have observed that even the best assemblers today make exactly this mistake when assembling the Drosophila species currently being sequenced. Compressions such as this can easily total 1% or more of the genome, and the ‘orphan’ regions can be quite long, 5000–10 000 bp or more. And we would note that Drosophila is not a particularly difficult genome as compared with many others currently under way. To those who might think (or argue) that the assembler they are using is not prone to such errors, we can only reply that we have seen these types of errors in all the major assemblers in use today (e.g. Arachne (Batzoglou et al., 2002; Jaffe et al., 2003), Celera Assembler (Myers et al., 2000), Jazz (Aparicio et al., 2002), Phusion (Mullikin and Ning, 2003), PCAP (Huang et al., 2003) and Atlas (Havlak et al., 2004)), in some cases after running the assemblers ourselves and in other cases after carefully examining the results of assemblies created by others. We have developed software for improving assemblies that can detect at least some situations like the one shown above, although there is still no automated way of fixing these problems. However, the problem is often made much more difficult by the diploid nature of most large genomes, particularly the many mammalian genomes currently being sequenced by the NIH. The problem is this: the two copies of a chromosome are always slightly divergent, and this has led assembly groups (including ours) to develop methods for separating the two haplotypes from one another. But wherever there are tandem repeats in two or more copies, it can become extremely difficult to distinguish an incorrectly collapsed repeat (including situations such as that shown in Fig. 1) from true polymorphisms between the haplotypes. A tremendous amount of genome analysis is built upon the framework of the DNA sequence itself: not only are genes and regulatory sites anchored in the sequence, but analyses of synteny, duplications and evolutionary relationships among species all depend on having the correct structure of the genome. We need to devote more effort to making sure the basis for all these analyses does not turn out to be a house of cards. Our group has created a website (http://cbcb.umd. edu/research/benchmark.shtml) for depositing reference assemblies: genomes for which the sequence is finished, and for which we can demonstrate how all the original data map to that finished sequence. The site also distinguishes the original wholegenome shotgun reads from any additional finishing reads. This small set of genomes, which thus far only includes bacteria, should be just the beginning: all assemblies need to be available so that others can check them and, if necessary, correct them. Fortunately, NCBI has created a much larger resource to capture both draft and finished assemblies, the Assembly Archive (Salzberg et al., 2004). This archive captures the complete information about how a set of raw sequences maps to a genome assembly, whether that assembly is ‘draft’ or ‘finished’. After spending fifteen years and hundreds of millions of dollars on the human genome, the community has a near-complete draft sequence, but the evidence for that sequence—the underlying raw data and the assembly itself—is, amazingly, not available. Indeed, many of the original assemblies of parts of the human genome were done in the midand late-1990s, and are now lost. We can only hope that future genomes would not be needlessly lost now that there is a place to deposit them. Are we arguing that all genomes should be finished? Actually, finishing does not necessarily address this problem at all. Finishing efforts are usually directed at closing gaps, not at fixing misassemblies, and therefore ‘finished’ genomes are very likely to contain errors of the type we are discussing. A better term for such genomes is ‘closed’: gaps are closed but sequence is not confirmed. We strongly suspect that many of the alreadypublished finished genomes in GenBank today contain assembly errors. To whom correspondence should be addressed. E-mail: salzberg@umd.edu

Related Organizations

University of Maryland, College Park
United States

Keywords

Computational Biology, Genomics, Databases, Nucleic Acid, Software, Repetitive Sequences, Nucleic Acid

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	150
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

150

Top 1%

Top 10%

gold

Fields of Science (3) View all

medical and health sciences

basic medicine

Fields of Science

medical and health sciences

basic medicine

View all