Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

Authors: Sangster, Ami G.; Dufault, Cameron; Qu, Haoning; Le, Denise; Forman-Kay, Julie; Moses, Alan;

Zero-Shot Protein Segmentation (ZPS) Data and Embeddings

Abstract

uniprotkb_Human.txt this is a raw text file that contains a downloaded copy of UniProtKB this inlcudes all reviewed human protein sequences we used annotations from this file to copmare to ZPS predictions uniprotkb_Human_Sequences.fasta this is a fasta file that contains reviewed human protein sequences these are the sequences we used as input to ProtT5 to generate protein embeddings ZPS_Boundaries.tsv this is a tab separated file that contains the boundaries of protein segments defined by ZPS for reviewed human protein sequences we used zero-based indexing for the protein boundaries ZPS_Segment_Embeddings.hdf5 this is a hdf5 file that contains segment embeddings for the human proteome see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" A. G. Sangster 2025 for definition of segment embeddings segment boundaries in this file are also in zero-based indexing evaluation_data.zip includes: disprot_functional_annotations.tsv this conatins DisProt annotations that are labeled as "molecular_function" or "disorder_function" for the human proteome this is from the 2025-06 DisProt release disprot_functional_annotations_per_segment.tsv this is a parsed version of disprot_functional_annotations.tsv this includes protein segment keys and their corresponding disprot functional annotations these labels were used for multi-label evaluations protGPS_dataset.csv this is a copy of the dataset provided on DOI 10.5281/zenodo.14795444 in notebook/dataset.csv ProtGPS_idmapping_2025_08_13.tsv this is the ID mapping data downloaded from UniProt to map gene names and UniProt IDs found in protGPS_dataset.csv to UniProt IDs used in ZPS protGPS_data_only_disordered_segments.tsv this is the parsed version of protGPS_dataset.csv this includes protein IDs, train/dev/test split, a list of labels attributed to the protein, and a list of segment keys that over-lap with MobiDB disorder annotations these labels were used for multi-label evaluations uniprot_annotations_per_segment_multi-class.tsv this is a parsed version of uniprotkb_Human.txt this includes protein segment keys, protein IDs, gene IDs, and labels used in multi-class evaluations multi-class labels include: PROSITE_LABELS: labels of the top ~20 most commonly occuring protein domains as annotated by ProRule on UniProt IDR_VS_DOMAIN_LABELS: labels include Disordered (as annotated by MobiDB via UniProt), ProRule (as annotated by ProRule via UniProt, indicating domain), and Background (does not overlap with a MobiDB disorder annotation or a ProRule domain annotation) COMP_BIAS_LABLES: labels for compositional bias annotation (as annotated by MobiDB via UniProt) DISORDER_LABELS: for segments that overlap with a MobiDB disordered annotation (via UniProt), take the name of the other overlapping annotation with the highest IoU uniprot_annotations_per_segment_multi-label.tsv this is a parsed version of uniprotkb_Human.txt this includes protein segment keys and labels used in multi-label evaluations Protein segment keys: are formatted as "UniProtID start-stop", where start and stop positions reference the canonical protein sequence on UniProt and use zero-based indexing. *see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" (A. G. Sangster 2025) on how annotations were transfered to protein segments

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities