ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Ahmed Elnaggar; Michael Heinzinger; Christian Dallago; Ghalia Rehawi; Yu Wang; Llion Jones; Tom Gibbs; Tamas Feher; Christoph Angerer; Martin Steinegger; Debsindhu Bhowmik; Burkhard Rost

Found an issue? Give us feedback

https://doi.org/10.1...arrow_drop_down

https://doi.org/10.1101/2020.0...

Article . 2020 . Peer-reviewed

License: CC BY NC ND

Data sources: Crossref

https://www.biorxiv.org/conten...

Article

License: CC BY NC ND

Data sources: UnpayWall

https://dx.doi.org/10.1101/202...

Other literature type

Data sources: Microsoft Academic Graph

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

descriptionPublicationkeyboard_double_arrow_right Article , Other literature type 12 Jul 2020Publisher:Cold Spring Harbor Laboratory

Authors: Ahmed Elnaggar; Michael Heinzinger; Christian Dallago; Ghalia Rehawi; Yu Wang; Llion Jones; Tom Gibbs; +5 Authors

doi: 10.1101/2020.07.12.199554

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

- Summary
- Related research
  (11)
- Metrics

Abstract

AbstractComputational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores.Dimensionality reduction revealed that the raw protein LM-embeddingsfrom unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using theembeddingsas exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of thegrammarof thelanguage of life. To facilitate future work, we released our models athttps://github.com/agemagician/ProtTrans.

Related Organizations

Nvidia
United States
Oak Ridge National Laboratory
United States
Google (United States)
United States
Seoul National University
Korea (Republic of)
Technical University of Munich
Germany

11 Research products, page 1 of 2

Protein language model embeddings and predictions of the human proteome
2021IsSupplementedBy
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning
2022IsAmongTopNSimilarDocuments
ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing
2020IsAmongTopNSimilarDocuments
albert software on GitHub
IsRelatedTo
text-to-text-transfer-transformer software on GitHub
IsRelatedTo
electron software on GitHub
IsRelatedTo
apex software on GitHub
IsRelatedTo
ProtTrans software on GitHub
IsRelatedTo
DeepLearningExamples software on GitHub
IsRelatedTo
bert software on GitHub
IsRelatedTo

chevron_left
1
2
chevron_right

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	293
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 0.1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 0.1%