
Abstract Motivation: When targeted to a barcoding region, high-throughput sequencing can be used to identify species or operational taxonomical units from environmental samples, and thus to study the diversity and structure of species communities. Although there are many methods which provide confidence scores for assigning taxonomic affiliations, it is not straightforward to translate these values to unbiased probabilities. We present a probabilistic method for taxonomical classification (PROTAX) of DNA sequences. Given a pre-defined taxonomical tree structure that is partially populated by reference sequences, PROTAX decomposes the probability of one to the set of all possible outcomes. PROTAX accounts for species that are present in the taxonomy but that do not have reference sequences, the possibility of unknown taxonomical units, as well as mislabeled reference sequences. PROTAX is based on a statistical multinomial regression model, and it can utilize any kind of sequence similarity measures or the outputs of other classifiers as predictors. Results: We demonstrate the performance of PROTAX by using as predictors the output from BLAST, the phylogenetic classification software TIPP, and the RDP classifier. We show that PROTAX improves the predictions of the baseline implementations of TIPP and RDP classifiers, and that it is able to combine complementary information provided by BLAST and TIPP, resulting in accurate and unbiased classifications even with very challenging cases such as 50% mislabeling of reference sequences. Availability and implementation: Perl/R implementation of PROTAX is available at http://www.helsinki.fi/science/metapop/Software.htm. Contact: panu.somervuo@helsinki.fi Supplementary information: Supplementary data are available at Bioinformatics online.
SEQUENCES, IDENTIFICATION, ASSIGNMENT, FUNGI, RELIABILITY, MARKER GENES, DNA Barcoding, Taxonomic, RIBOSOMAL-RNA, Biochemistry, cell and molecular biology, Phylogeny, Software
SEQUENCES, IDENTIFICATION, ASSIGNMENT, FUNGI, RELIABILITY, MARKER GENES, DNA Barcoding, Taxonomic, RIBOSOMAL-RNA, Biochemistry, cell and molecular biology, Phylogeny, Software
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 81 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 1% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
