Protein language model embeddings and predictions of the human proteome

Name: Protein language model embeddings and predictions of the human proteome
Keywords: protein embeddings, protein secondary structure, protein subcellular location, protein prediction, protein language models, human proteome

Dallago, Christian; Rost, Burkhard

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset . 2021

Data sources: Datacite

ZENODO

Dataset . 2021

Data sources: Datacite

Protein language model embeddings and predictions of the human proteome

Research datakeyboard_double_arrow_right Dataset 30 Jun 2021Publisher:ZenodoFunded by:DFG | unidentified

Authors: Dallago, Christian; Rost, Burkhard;

doi: 10.5281/zenodo.5047019 , 10.5281/zenodo.5047020

Protein language model embeddings and predictions of the human proteome

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

Residue and sequence embeddings of the human proteome (SwissProt for organism Human, downloaded on 2021.06.09) computed using bio_embeddings (bioembeddings.com) using the ProtT5 embedder at full precision (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3). Additionally: - Sequence-level predictions of subcellular localization in 10 classes using LA (https://www.biorxiv.org/content/10.1101/2021.04.25.441334v1) - Residue-level three state secondary structure prediction (alpha, sheet or other) using models reported in the ProtTrans paper (https://www.biorxiv.org/content/10.1101/2020.07.12.199554v3) Files included: - human.fasta --> FASTA-formatted sequences of human from SwissProt - DSSP3_human_ProtT5Sec.fasta --> Secondary structure predictions in three states for each residue of each protein in human.fasta. "H" stands for Helix; "E" stands for Sheet; "C" stands for Other. - subcell_human_LA_ProtT5.csv --> Subcellular location (10 states) and memrane-boundness (2 states) for each protein in human.fasta - embeddings_file.h5 --> per-residue embeddings of sequences in human.fasta. Each dataset in the .h5 file represents a protein sequence and contains a matrix of length Lx1024, with L being the length of the protein sequence. Datasets are indexed using integers. The original sequence identifier (from the FASTA header) can be accessed through the "original_id" attribute. See https://docs.bioembeddings.com/v0.2.0/notebooks/open_embedding_file.html for information on how to open the file - reduced_embeddings_file.h5 --> per-sequence embeddings of sequences in human.fasta (obtained by mean-pooling the residue-embeddings along the length dimension of the protein sequence). Each dataset in the .h5 file represents a protein sequence and contains a vector of size 1024 (meaning, each sequence has the same dimension).

Related Organizations

Technical University of Munich
Germany

Keywords

protein embeddings, protein secondary structure, protein subcellular location, protein prediction, protein language models, human proteome

4 Research products, page 1 of 1

Protein language model embeddings and predictions for the fly proteome (FlyBase)
2022IsAmongTopNSimilarDocuments
Light attention predicts protein location from the language of life
2021IsSupplementTo
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning
2020IsSupplementTo
Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets
2021IsSupplementTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average