
uniprotkb_Human.txt this is a raw text file that contains a downloaded copy of UniProtKB this inlcudes all reviewed human protein sequences we used annotations from this file to copmare to ZPS predictions uniprotkb_Human_Sequences.fasta this is a fasta file that contains reviewed human protein sequences these are the sequences we used as input to ProtT5 to generate protein embeddings ZPS_Boundaries.tsv this is a tab separated file that contains the boundaries of protein segments defined by ZPS for reviewed human protein sequences we used zero-based indexing for the protein boundaries ZPS_Segment_Embeddings.hdf5 this is a hdf5 file that contains segment embeddings for the human proteome see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" A. G. Sangster 2025 for definition of segment embeddings segment boundaries in this file are also in zero-based indexing evaluation_data.zip includes: disprot_functional_annotations.tsv this conatins DisProt annotations that are labeled as "molecular_function" or "disorder_function" for the human proteome this is from the 2025-06 DisProt release disprot_functional_annotations_per_segment.tsv this is a parsed version of disprot_functional_annotations.tsv this includes protein segment keys and their corresponding disprot functional annotations these labels were used for multi-label evaluations protGPS_dataset.csv this is a copy of the dataset provided on DOI 10.5281/zenodo.14795444 in notebook/dataset.csv ProtGPS_idmapping_2025_08_13.tsv this is the ID mapping data downloaded from UniProt to map gene names and UniProt IDs found in protGPS_dataset.csv to UniProt IDs used in ZPS protGPS_data_only_disordered_segments.tsv this is the parsed version of protGPS_dataset.csv this includes protein IDs, train/dev/test split, a list of labels attributed to the protein, and a list of segment keys that over-lap with MobiDB disorder annotations these labels were used for multi-label evaluations uniprot_annotations_per_segment_multi-class.tsv this is a parsed version of uniprotkb_Human.txt this includes protein segment keys, protein IDs, gene IDs, and labels used in multi-class evaluations multi-class labels include: PROSITE_LABELS: labels of the top ~20 most commonly occuring protein domains as annotated by ProRule on UniProt IDR_VS_DOMAIN_LABELS: labels include Disordered (as annotated by MobiDB via UniProt), ProRule (as annotated by ProRule via UniProt, indicating domain), and Background (does not overlap with a MobiDB disorder annotation or a ProRule domain annotation) COMP_BIAS_LABLES: labels for compositional bias annotation (as annotated by MobiDB via UniProt) DISORDER_LABELS: for segments that overlap with a MobiDB disordered annotation (via UniProt), take the name of the other overlapping annotation with the highest IoU uniprot_annotations_per_segment_multi-label.tsv this is a parsed version of uniprotkb_Human.txt this includes protein segment keys and labels used in multi-label evaluations Protein segment keys: are formatted as "UniProtID start-stop", where start and stop positions reference the canonical protein sequence on UniProt and use zero-based indexing. *see "Zero-shot segmentation using embeddings from a language model identifies functional regions in the human proteome" (A. G. Sangster 2025) on how annotations were transfered to protein segments
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
