Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Other literature type . 2020
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Presentation . 2020
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Deep neural network embedding for efficient repository-scale analysis of hundreds of millions of mass spectra

Authors: Bittremieux, Wout; May, Damon H.; Bilmes, Jeffrey; Noble, William Stafford;

Deep neural network embedding for efficient repository-scale analysis of hundreds of millions of mass spectra

Abstract

Introduction Despite an explosion of publicly available data in mass spectrometry proteomics repositories, peptide mass spectra are typically still analyzed by each laboratory in isolation, treating each experiment as if it has no relationship to any others. This approach fails to exploit the wealth of existing, previously analyzed mass spectrometry data. Here, we describe a deep neural network approach, called “GLEAMS”, which learns to embed spectra across an entire data repository into a low-dimensional space such that spectra generated by the same peptide are close to one another. This learned embedding captures latent properties of the spectra, and the low-dimensional space can be used for the efficient clustering and identification of hundreds of millions of spectra. Methods We have trained the GLEAMS deep neural network using peptide-spectrum assignments to embed spectra in a low-dimensional space. The neural network takes as input three feature types — precursor attributes, binned fragment intensities, and similarities to a set of reference spectra selected via submodular optimization — and consists of a combination of multiple convolutional and fully connected layers. To train the embedder network, a Siamese network containing two instances of the embedder with tied weights is trained via optimization of the contrastive loss function, pulling positive training pairs consisting of spectra corresponding to the same peptide together and pushing negative training pairs consisting of spectra corresponding to different peptides away from each other. Preliminary data We have used GLEAMS to process 31TB of human HCD proteomics data belonging to the MassIVE Knowledge Base dataset, corresponding to 666 million spectra derived from 220 publicly available experiments. After training the Siamese neural network, we observe that spectra generated by the same peptide lie close to each other in the embedded space. Additionally, the learned embeddings capture latent properties of the spectra, such as precursor mass and charge, and protein modifications correspond to translations in the latent space. Next, we investigate the “dark matter” of the human proteome using our large-scale and heterogeneous public dataset. First, we use DBSCAN density-based clustering to group repeatedly observed embeddings corresponding to similar spectra. By propagating peptide labels within high-quality clusters containing spectra that correspond to a single peptide, we can achieve an 8% increase in identification rate. Second, clusters that only contain unidentified spectra are processed using the ANN-SoLo open modification spectral library search engine to identify modified peptides that are frequently observed but consistently remain unidentified. This allows us to achieve an additional 22% increase in identified spectra. As a result, this combined strategy achieves a 30% increase in identifications relative to the MassIVE-KB standard database search results at a repository scale, providing valuable new insights into previously unlabeled data. In conclusion, the GLEAMS neural network is a powerful, scalable method that enables us to efficiently process hundreds of millions of MS/MS spectra and explore the dark human proteome at an unprecedented depth and scale. Novel aspect Repository-scale deep learning of hundreds of millions of spectra. Clustering and identifying the spectrum embeddings to investigate the dark proteome.

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 400
    download downloads 89
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
  • 400
    views
    89
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
400
89
Green