Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2026
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

Authors: Frühbuß, David; Baakman, Coos; Teusink, Siem; Bekkers, Erik; Jegelka, Stefanie; Xue, Li C.;

MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

Abstract

# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) ## Overview This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction. | Property | Value | |----------|-------| | **Total structures** | 100,742 | | **X-ray structures** | 802 (from PDB and IMGT) | | **PANDORA structures** | 99,940 (computationally modeled) | | **MHC alleles** | 110 diverse HLA-I alleles | | **Peptide lengths** | 8–13 amino acids | | **Unique G-domains** | 286 | | **Number of clusters** | 10 | | **Total size** | ~47 GB | ## Data Sources - **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/) - **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA) ## Clustering Strategy MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization. The varying cluster sizes reflect the natural distribution of G-domain sequence families. ### Cluster Composition | Cluster | G-domains | PANDORA | X-ray | Total | |---------|-----------|---------|-------|-------| | 1 | 2 | 59 | 0 | 59 | | 2 | 74 | 30,044 | 138 | 30,182 | | 3 | 11 | 3,412 | 18 | 3,430 | | 4 | 11 | 523 | 52 | 575 | | 5 | 54 | 26,227 | 351 | 26,578 | | 6 | 2 | 0 | 2 | 2 | | 7 | 30 | 2,337 | 78 | 2,415 | | 8 | 2 | 10,425 | 0 | 10,425 | | 9 | 99 | 26,889 | 163 | 27,052 | | 10 | 1 | 24 | 0 | 24 | ## Files ``` mhc-diff-100k-v1.0/ ├── README.md # This file ├── LICENSE # CC-BY-4.0 license ├── SHA256SUMS # Checksums for all files ├── samples.parquet # Sample index (recommended) ├── samples.tsv.gz # Sample index (alternative format) ├── split_recipes/ # Split definitions │ ├── paper_split.json # Train/val/test as used in the paper │ ├── fold_cluster1.json # Leave cluster 1 out │ ├── ... │ └── README.json # Split recipe documentation └── structures/ # HDF5 structure files ├── cluster_1.hdf5 ├── cluster_2.hdf5.gz # Gzip compressed (decompress before use) ├── ... └── cluster_10.hdf5 ``` **Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use: ```bash gunzip structures/cluster_2.hdf5.gz ``` ## Paper Split (Recommended) | Split | Clusters | Structures | X-ray | |-------|----------|------------|-------| | **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 | | **Validation** | 7 | 2,415 | 78 | | **Test** | 3, 4, 6 | 4,007 | 72 | ## Data Format ### Sample Index (`samples.parquet`) | Column | Description | |--------|-------------| | `sample_id` | Unique structure identifier | | `cluster_id` | Cluster assignment (1-10) | | `source` | `xray` or `pandora` | | `structure_file` | HDF5 file containing the structure | ### HDF5 Structure Files Each HDF5 file contains multiple structures indexed by `sample_id`: **X-ray structures** (4-letter PDB codes): ```python import h5py with h5py.File('cluster_2.hdf5', 'r') as f: pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format ``` **PANDORA structures** (IDs starting with `BA-`): ```python with h5py.File('cluster_2.hdf5', 'r') as f: entry = f['BA-100003'] peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19) ``` ## Usage ### Paper Split ```python import pandas as pd import json # Load sample index samples = pd.read_parquet('samples.parquet') # Load paper split with open('split_recipes/paper_split.json') as f: split = json.load(f) # Create splits train = samples[samples['cluster_id'].isin(split['train_clusters'])] val = samples[samples['cluster_id'].isin(split['validation_clusters'])] test = samples[samples['cluster_id'].isin(split['test_clusters'])] print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}") ``` ### Leave-One-Cluster-Out Cross-Validation ```python for cluster_id in range(1, 11): with open(f'split_recipes/fold_cluster{cluster_id}.json') as f: fold = json.load(f) train = samples[samples['cluster_id'].isin(fold['train_clusters'])] test = samples[samples['cluster_id'].isin(fold['test_clusters'])] ``` ## Related Datasets The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides. - **MHC-Diff 8K Dataset**: [Zenodo DOI to be added] ## Citation If you use this dataset, please cite: ```bibtex @article{fruhbuss2025mhcdiff, title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model}, author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li}, year={2025} } ``` ## References 1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235 2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010 3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762 4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1 ## License This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/). ## Contact - Li Xue: Li.Xue@radboudumc.nl

Keywords

immunology, Deep Learning, Diffusion Model, peptide-MHC, structure prediction

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average