
doi: 10.5353/th_b5351027
handle: 10722/208040
The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel ...
Nucleotide sequence - Data processing
Nucleotide sequence - Data processing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
