Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY NC
Data sources: Datacite
versions View all 2 versions
addClaim

FANTASIA V4.1 – LookUp Table – UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

Authors: Rojas-Mendoza, Ana M.; Perez-Canales, Francisco M.; Dominguez-Rodriguez, Àlex;

FANTASIA V4.1 – LookUp Table – UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers)

Abstract

FANTASIA V4.1 – LookUp Table UniProt December 2025 – Experimental Evidence Code (Early Layers and Final Layers) Release: December 2025 System: Protein Information System (PIS v3.1.0) Compatibility: FANTASIA V4.1 📖 Overview This PostgreSQL database backup uses the pgvector extension to store high-dimensional protein embeddings. It contains precomputed embeddings and functional annotations from the UniProt November 2025 release, restricted to entries with experimental evidence codes only. The lookup table was generated with PIS v3.1.0 (Protein Information System), an integrated platform for automated extraction, processing, and management of protein-related data. PIS consolidates information from UniProt, PDB, and GOA, enabling efficient retrieval of sequences, structures, and annotations. This release is designed for direct use with FANTASIA V4.1, an advanced pipeline for high-confidence functional annotation using Protein Language Models (PLMs). Unlike earlier releases, this dataset includes Early Layers (the first three) and Final Layers (0-2) for each PLM model, providing comprehensive embeddings for deep similarity search and GO term transfer. 🚫 Compatibility Notice This database is not compatible with versions of FANTASIA earlier than v4.1 and not compatible with PIS versions earlier than v3.1.0. A tokenization inconsistency affecting the ProtT5-XL-UniRef50 model was corrected in this release. Because of this fix: ProtT5 embeddings produced with versions < FANTASIA v4.1 will not match those stored in this lookup table. Incompatibility only affects workflows that use the ProtT5 model. However, we highly recommend updating all components (FANTASIA, PIS, database) to ensure consistent behavior across all PLMs. This lookup table serves as a ready-to-use reference for large-scale protein function transfer: Loads multi-layer embeddings into memory Performs high-speed nearest-neighbor search in embedding space Transfers experimentally supported GO terms from annotated UniProt proteins It provides a stable, optimized, and fully curated base for reproducible annotation workflows within the FANTASIA ecosystem. 📦 Embedding Coverage and Dataset Generation Details 📊 Layer Coverage by Model Each of the five protein language models in this release includes six embedding layers: the Early Layers (the first three) and the Final Layers (0-2). This configuration provides both low-level and high-level representational information. ESM — 33 layers → included: 0, 1, 2, 31, 32, 33 ProtT5 — 24 layers → included: 0, 1, 2, 22, 23, 24 ProstT5 — 24 layers → included: 0, 1, 2, 22, 23, 24 ANKH3-Large — 48 layers → included: 0, 1, 2, 46, 47, 48 ESM3c — 33 layers → included: 0, 1, 2, 31, 32, 33 This standardized multi-layer extraction ensures balanced coverage for downstream comparative analysis. 📌 Core Dataset Statistics UniProt accessions: 127,546 Protein records: 127,546 Unique sequences: 124,397 Total embeddings (5 models): 124,397 (Includes 3,149 proteins with identical sequences due to isoforms/redundancy) Experimental GO annotations: 627,932 Sequence redundancy: 2.47% 📈 Sequence Length Distribution (Unique Sequences) The 124,397 unique sequences span a wide range: Minimum: 3 aa Maximum: 35,375 aa Mean: 587.44 aa Q1: 262 aa Median: 431 aa Q3: 694 aa 🖥️ Computational Infrastructure All embeddings were generated on an NVIDIA GeForce RTX 3090 Ti (24 GB VRAM) hosted at the Computational Biology and Bioinformatics group (CABD). Previous lookup tables were created on CESGA Finisterrae III using A100 40 GB GPUs, which encountered memory limitations when processing long sequences (especially under shared-resource conditions). 🚫 Missing Embeddings Overview model percent_covered num_missing_emb min_length max_length avg_lenght esm2 99.97% 35 9,563 35,375 18,095.40 esm3c 100% - - - - prott5 99.66% 420 1,557 35,375 6,018.19 prostt5 99.18% 1,022 1,557 35,375 4,568.18 ankh3 99.95% 61 7,962 35,375 14,091.56 This table summarizes, for each model: percentage of successfully covered proteins number of sequences that could not be embedded minimum, maximum, and average lengths of the problematic sequences 📌 Commentary on the Missing Embeddings The data clearly shows that: Missing embeddings represent only 0–1% of the dataset, depending on the model. ESM3c achieved full coverage (100%) for all sequences, including the longest. ProtT5-based models (prott5, prostt5) show the highest failure rate, due to the substantial memory requirements of long transformer contexts. All failures are exclusively due to extremely long sequences, frequently in the 10,000–35,000 aa range. These lengths exceed the practical VRAM capacity for most PLM inference pipelines (24 GB in this release). 📄 Additional Files Included A companion file missing_embeddings_per_model.csv is provided, containing: affected UniProt accessions full sequence lengths model-specific missing status This file allows users to regenerate these embeddings on hardware with larger memory footprints (48–80 GB) or using architectures with efficient chunked attention. 🔬 Included GO Evidence Codes (Experimental Only) Only experimental evidence GO annotations are included: EXP — Inferred from Experiment IDA — Inferred from Direct Assay IPI — Inferred from Physical Interaction IMP — Inferred from Mutant Phenotype IGI — Inferred from Genetic Interaction IEP — Inferred from Expression Pattern TAS — Traceable Author Statement IC — Inferred by Curator These evidence codes ensure that downstream analyses rely strictly on experimentally validated functional annotations.

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average