Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Data for "SeaMoon: from protein language models to continuous structural heterogeneity"

Authors: Lombard, Valentin;

Data for "SeaMoon: from protein language models to continuous structural heterogeneity"

Abstract

Datasets used for development of SeaMoon: https://github.com/PhyloSofS-Team/seamoon. This upload contains the following data: precomputed_emb.tar.gz is a compressed archive containing the precomputed data used for training and testing the models of the SeaMoon method, in Torch .pt format. The file prefixes consist of two IDs, "ID1_ID2_", identifying the DANCE [1] protein conformational collection used for its generation. "ID1" represents the first member of the collection in alphabetical order, while "ID2" is the reference conformation for the structural alignment. The "ESM_data" or "ProstT5_data" suffixes designate the type of embeddings, generated by either ESM2 [2] or ProstT5 [3].The dictionnary contains the following keys: emb: The per-residue embedding. data: A tuple containing "ID2" (the reference), the amino acid sequence, and the coverage of the positions in the original DANCE collection. eigvect: The eigenvectors of the covariance matrix of the "ID1_ID2" collection, centered on reference conformaton "D2". eigval: The associated eigenvalues. ref: The coordinates of the C-alpha atoms of the reference conformaton "ID2". train_list.txt, train_list_5ref.txt, val_list.txt and test_list.txt contain the identifiers of the samples used for training and evaluating the SeaMoon models. In the "5ref" setting, we used up to 5 reference conformations per collection. For details on SeaMoon see: SeaMoon: Prediction of molecular motions based on language models Valentin Lombard, Dan Timsit, Sergei Grudinin, Elodie Laine bioRxiv 2024.09.23.614585; doi: https://doi.org/10.1101/2024.09.23.614585 For more information on data usage and generation please see https://github.com/PhyloSofS-Team/seamoon. Abstract: How protein move and deform determines their interactions with the environment and is thus of utmost importance for cellular functioning. Following the revolution in single protein 3D structure prediction, researchers have focused on repurposing or developing deep learning models for sampling alternative protein conformations. In this work, we explored whether continuous compact representations of protein motions could be predicted directly from protein sequences, without exploiting nor sampling protein structures. Our approach, called SeaMoon, leverages protein Language Model (pLM) embeddings as input to a lightweight (~1M trainable parameters) convolutional neural network. SeaMoon achieves a success rate of up to 40% when assessed against ~1,000 collections of experimental conformations exhibiting a wide range of motions. SeaMoon capture motions not accessible to the normal mode analysis, an unsupervised physics-based method relying solely on a protein structure's 3D geometry, and generalises to proteins that do not have any detectable sequence similarity to the training set. SeaMoon is easily retrainable with novel or updated pLMs. [1] Lombard, V.; Grudinin, S.; Laine, E. Explaining Conformational Diversity in Protein Families through Molecular Motions. Scientific Data 2024, 11, 752. [2] Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; Dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [3] Heinzinger, M.; Weissenow, K.; Sanchez, J. G.; Henkel, A.; Steinegger, M.; Rost, B. ProstT5: Bilingual language model for protein sequence and structure. bioRxiv 2023, 2023–07.

Related Organizations
Keywords

Proteins, Deep learning

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average