
# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures [](https://creativecommons.org/licenses/by/4.0/) ## Overview This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction. | Property | Value | |----------|-------| | **Total structures** | 100,742 | | **X-ray structures** | 802 (from PDB and IMGT) | | **PANDORA structures** | 99,940 (computationally modeled) | | **MHC alleles** | 110 diverse HLA-I alleles | | **Peptide lengths** | 8–13 amino acids | | **Unique G-domains** | 286 | | **Number of clusters** | 10 | | **Total size** | ~47 GB | ## Data Sources - **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/) - **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA) ## Clustering Strategy MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization. The varying cluster sizes reflect the natural distribution of G-domain sequence families. ### Cluster Composition | Cluster | G-domains | PANDORA | X-ray | Total | |---------|-----------|---------|-------|-------| | 1 | 2 | 59 | 0 | 59 | | 2 | 74 | 30,044 | 138 | 30,182 | | 3 | 11 | 3,412 | 18 | 3,430 | | 4 | 11 | 523 | 52 | 575 | | 5 | 54 | 26,227 | 351 | 26,578 | | 6 | 2 | 0 | 2 | 2 | | 7 | 30 | 2,337 | 78 | 2,415 | | 8 | 2 | 10,425 | 0 | 10,425 | | 9 | 99 | 26,889 | 163 | 27,052 | | 10 | 1 | 24 | 0 | 24 | ## Files ``` mhc-diff-100k-v1.0/ ├── README.md # This file ├── LICENSE # CC-BY-4.0 license ├── SHA256SUMS # Checksums for all files ├── samples.parquet # Sample index (recommended) ├── samples.tsv.gz # Sample index (alternative format) ├── split_recipes/ # Split definitions │ ├── paper_split.json # Train/val/test as used in the paper │ ├── fold_cluster1.json # Leave cluster 1 out │ ├── ... │ └── README.json # Split recipe documentation └── structures/ # HDF5 structure files ├── cluster_1.hdf5 ├── cluster_2.hdf5.gz # Gzip compressed (decompress before use) ├── ... └── cluster_10.hdf5 ``` **Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use: ```bash gunzip structures/cluster_2.hdf5.gz ``` ## Paper Split (Recommended) | Split | Clusters | Structures | X-ray | |-------|----------|------------|-------| | **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 | | **Validation** | 7 | 2,415 | 78 | | **Test** | 3, 4, 6 | 4,007 | 72 | ## Data Format ### Sample Index (`samples.parquet`) | Column | Description | |--------|-------------| | `sample_id` | Unique structure identifier | | `cluster_id` | Cluster assignment (1-10) | | `source` | `xray` or `pandora` | | `structure_file` | HDF5 file containing the structure | ### HDF5 Structure Files Each HDF5 file contains multiple structures indexed by `sample_id`: **X-ray structures** (4-letter PDB codes): ```python import h5py with h5py.File('cluster_2.hdf5', 'r') as f: pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format ``` **PANDORA structures** (IDs starting with `BA-`): ```python with h5py.File('cluster_2.hdf5', 'r') as f: entry = f['BA-100003'] peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19) ``` ## Usage ### Paper Split ```python import pandas as pd import json # Load sample index samples = pd.read_parquet('samples.parquet') # Load paper split with open('split_recipes/paper_split.json') as f: split = json.load(f) # Create splits train = samples[samples['cluster_id'].isin(split['train_clusters'])] val = samples[samples['cluster_id'].isin(split['validation_clusters'])] test = samples[samples['cluster_id'].isin(split['test_clusters'])] print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}") ``` ### Leave-One-Cluster-Out Cross-Validation ```python for cluster_id in range(1, 11): with open(f'split_recipes/fold_cluster{cluster_id}.json') as f: fold = json.load(f) train = samples[samples['cluster_id'].isin(fold['train_clusters'])] test = samples[samples['cluster_id'].isin(fold['test_clusters'])] ``` ## Related Datasets The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides. - **MHC-Diff 8K Dataset**: [Zenodo DOI to be added] ## Citation If you use this dataset, please cite: ```bibtex @article{fruhbuss2025mhcdiff, title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model}, author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li}, year={2025} } ``` ## References 1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235 2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010 3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762 4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1 ## License This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/). ## Contact - Li Xue: Li.Xue@radboudumc.nl
immunology, Deep Learning, Diffusion Model, peptide-MHC, structure prediction
immunology, Deep Learning, Diffusion Model, peptide-MHC, structure prediction
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
