Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2024
License: CC BY
Data sources: ZENODO
ZENODO
Conference object . 2024
License: CC BY
Data sources: Datacite
ZENODO
Conference object . 2024
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

BMD-SRA: A Boosting Model for Differentiating Sequence Read Archive sequences

Authors: Bole, Martin;

BMD-SRA: A Boosting Model for Differentiating Sequence Read Archive sequences

Abstract

The number of sequence files deposited in the Sequence Read Archive (NCBI-SRA) has been growing exponentially through the years, and with it, the number of incorrectly annotated types of sequences. The submitted sequences are then used for genomic, metagenomic, and taxonomic studies. This presents a need in the research community for a model that facilitates the collection of correctly annotated data. This study aimed to develop a boosting classification model called BMDSRA that classifies input sequences into four sequence types: 1)Metagenomes, 2)Amplicons, 3)Single-Amplified Genomes (SAGs), 4)Isolated-Genomes. For developing the Machine Learning (ML) algorithm, we gathered 3000 test samples for each sequence type respectively. Test samples were used for supervised ML. Metagenomes were collected from various metagenome databases (DBs) (Kasmanas et al., Nucleic Acids Research, 2020) (750 samples from each), manually curated, and created by our team. Amplicon samples were gathered from the Joint Genome Institute portal based on their library strategy. The SAG samples were collected by manually inspecting published research papers, proving they were sequenced from a single cell. The Isolated-Genomes were gathered from SRA, searching for bacteria-type strain Genomes from different taxonomies. The BDMSRA reads a small portion of the sequence file using a sub-sampling approach (SRA Toolkit Development Team, https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) and extracts statistical features generated based on Shannon entropy, Tsallis entropy, and Fourier z-curve. The extracted features were evaluated using the QPFS method (Soheili et al., Scientific Programming, 2020), and the reliability of training data was tested with an outlier analysis. From the 119 generated features, we chose 38 with the highest importance for developing the model. The outlier analysis showed that the SAG and Amplicon data sets were the most reliable, with few outliers. The outliers from Metagenomes and Isolated-Genomes were subjected to further manual investigation. The model was created and evaluated by using 5-fold cross-validation. The confusion matrix showed an overall accuracy of 92% (96% for SAGs, 95% for Amplicons, 92% for Metagenomes, and 85% for Isolated-Genomes). The false negatives from Isolated-Genomes classified as Metagenome (7.6%) and SAGs (5.9 %) are likely due to the wrong classification in the SRA. The false negatives from Metagenomes classified as Isolated-Genomes (6.7%) are potentially due to downloading process from our Dbs. BMDSRA can help researchers verify that the sequences they submit or collect from public repositories are correctly annotated. Further, our tool could also select samples for metastudies and determine if sequence projects are well performed.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Green