BMD-SRA: A Boosting Model for Differentiating Sequence Read Archive sequences

descriptionPublicationkeyboard_double_arrow_right Conference object , Other literature type 12 Jun 2024 English Publisher:ZenodoFunded by:DFG | unidentified

Authors: Bole, Martin;

doi: 10.5281/zenodo.11615627 , 10.5281/zenodo.11615628

BMD-SRA: A Boosting Model for Differentiating Sequence Read Archive sequences

- Summary
- Metrics

Abstract

The number of sequence files deposited in the Sequence Read Archive (NCBI-SRA) has been growing exponentially through the years, and with it, the number of incorrectly annotated types of sequences. The submitted sequences are then used for genomic, metagenomic, and taxonomic studies. This presents a need in the research community for a model that facilitates the collection of correctly annotated data. This study aimed to develop a boosting classification model called BMDSRA that classifies input sequences into four sequence types: 1)Metagenomes, 2)Amplicons, 3)Single-Amplified Genomes (SAGs), 4)Isolated-Genomes. For developing the Machine Learning (ML) algorithm, we gathered 3000 test samples for each sequence type respectively. Test samples were used for supervised ML. Metagenomes were collected from various metagenome databases (DBs) (Kasmanas et al., Nucleic Acids Research, 2020) (750 samples from each), manually curated, and created by our team. Amplicon samples were gathered from the Joint Genome Institute portal based on their library strategy. The SAG samples were collected by manually inspecting published research papers, proving they were sequenced from a single cell. The Isolated-Genomes were gathered from SRA, searching for bacteria-type strain Genomes from different taxonomies. The BDMSRA reads a small portion of the sequence file using a sub-sampling approach (SRA Toolkit Development Team, https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) and extracts statistical features generated based on Shannon entropy, Tsallis entropy, and Fourier z-curve. The extracted features were evaluated using the QPFS method (Soheili et al., Scientific Programming, 2020), and the reliability of training data was tested with an outlier analysis. From the 119 generated features, we chose 38 with the highest importance for developing the model. The outlier analysis showed that the SAG and Amplicon data sets were the most reliable, with few outliers. The outliers from Metagenomes and Isolated-Genomes were subjected to further manual investigation. The model was created and evaluated by using 5-fold cross-validation. The confusion matrix showed an overall accuracy of 92% (96% for SAGs, 95% for Amplicons, 92% for Metagenomes, and 85% for Isolated-Genomes). The false negatives from Isolated-Genomes classified as Metagenome (7.6%) and SAGs (5.9 %) are likely due to the wrong classification in the SRA. The false negatives from Metagenomes classified as Isolated-Genomes (6.7%) are potentially due to downloading process from our Dbs. BMDSRA can help researchers verify that the sequences they submit or collect from public repositories are correctly annotated. Further, our tool could also select samples for metastudies and determine if sequence projects are well performed.

Related Organizations

Helmholtz Association of German Research Centres
Germany
Leipzig University
Germany
Helmholtz Centre for Environmental Research
Germany

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Funded by

DFG| unidentified