
AbstractSpectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm -- Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.
ddc:004, Databases, Factual, QH301-705.5, Computational Biology, Reproducibility of Results, 004, Machine Learning, Tandem Mass Spectrometry, Life Science, Metabolomics, Computer Simulation, False Positive Reactions, Biology (General), Algorithms, Research Article, Gene Library, Natural Language Processing
ddc:004, Databases, Factual, QH301-705.5, Computational Biology, Reproducibility of Results, 004, Machine Learning, Tandem Mass Spectrometry, Life Science, Metabolomics, Computer Simulation, False Positive Reactions, Biology (General), Algorithms, Research Article, Gene Library, Natural Language Processing
| citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 159 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 1% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 1% |
