<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Embeddings from protein language models predict conservation and variant effects

Research datakeyboard_double_arrow_right Dataset 23 Aug 2021Publisher:Zenodo

Authors: Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Erckert, Kyra; Bernhofer, Michael; Nechaev, Dmitrii;

doi: 10.5281/zenodo.5238536 , 10.5281/zenodo.5238537

Embeddings from protein language models predict conservation and variant effects

- Summary
- Related research
  (1)
- Metrics

Abstract

For this work, we used protein language model representations (embeddings) to predict sequence conservation without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596±0.006 vs. 0.608±0.006 for ConSeq). ConSurf10k- Dataset for the development of ProtT5cons: The method (ProtT5cons) predicting residue conservation used ConSurf-DB (Ben Chorin et al. 2020). This resource provided sequences and conservation for 89,673 proteins. For all, experimental high-resolution three-dimensional (3D) structures were available in the Protein Data Bank (PDB) (Berman et al. 2000). As standard-of-truth for the conservation prediction, we used the values from ConSurf-DB generated using HMMER (Mistry et al. 2013), CD-HIT (Fu et al. 2012), and MAFFT-LINSi (Katoh and Standley 2013) to align proteins in the PDB (Burley et al. 2019). For proteins from families with over 50 proteins in the resulting MSA, an evolutionary rate at each residue position is computed and used along with the MSA to reconstruct a phylogenetic tree. The ConSurf-DB conservation scores ranged from 1 (most variable) to 9 (most conserved). The PISCES server (Wang and Dunbrack 2003) was used to redundancy reduce the data set such that no pair of proteins had more than 25% pairwise sequence identity. We removed proteins with resolutions >2.5Å, those shorter than 40 residues, and those longer than 10,000 residues. The resulting data set (ConSurf10k) with 10,507 proteins (or domains) was randomly partitioned into training (9,392 sequences), cross-training/validation (555) and test (519) sets. Uploaded data: ConSuf10k_PDBid_seq_cons.fasta: fasta file with PDBid, sequence and conservation annotation consurf10k_test_ids.txt: txt file with id's of test set consurf10k_train_ids.txt: txt file with id's of train set consurf10k_val_ids.txt: txt file with id's of cross-validation set

{"references": ["Ben Chorin A, Masrati G, Kessel A, Narunsky A, Sprinzak J, Lahav S, Ashkenazy H, Ben\u2010Tal N (2020) ConSurf\u2010DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins. Protein Science 29: 258-267. doi: 10.1002/pro.3779", "Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Research 28: 235-242. doi: 10.1093/nar/28.1.235", "Mistry J, Finn RD, Eddy SR, Bateman A, Punta M (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41: e121. doi: 10.1093/nar/gkt263", "Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28: 3150-2. doi: 10.1093/bioinformatics/bts565", "Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30: 772-80. doi: 10.1093/molbev/mst010", "Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L, Di Costanzo L, Christie C, Dalenberg K, Duarte JM, Dutta S, Feng Z, Ghosh S, Goodsell DS, Green RK, Guranovic V, Guzenko D, Hudson BP, Kalro T, Liang Y, Lowe R, Namkoong H, Peisach E, Periskova I, Prlic A, Randle C, Rose A, Rose P, Sala R, Sekharan M, Shao C, Tan L, Tao YP, Valasatava Y, Voigt M, Westbrook J, Woo J, Yang H, Young J, Zhuravleva M, Zardecki C (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Research 47: D464-D474. doi: 10.1093/nar/gky1004", "Wang G, Dunbrack RL, Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589-1591. doi: 10.1093/bioinformatics/btg224"]}

See details in the research paper.

1 Research products, page 1 of 1

Embeddings from protein language models predict conservation and variant effects
2021IsSupplementTo

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	31
download	downloads	11

31
views
11
downloads
Powered by

Found an issue? Give us feedback

visibility

download

Average

Related to Research communities

Corona Virus Disease

Knowmad Institut