MHC-Diff 100K pMHC Structure Dataset (Multi-allele)

# MHC-Diff 100K Dataset: Multi-Allele pMHC Structures [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) ## Overview This dataset contains **100,742 peptide-MHC class I (pMHC-I) structures** spanning 110 diverse HLA-I alleles and peptide lengths from 8 to 13 amino acids. It is designed for training and evaluating machine learning models for cross-allele pMHC structure prediction. | Property | Value | |----------|-------| | **Total structures** | 100,742 | | **X-ray structures** | 802 (from PDB and IMGT) | | **PANDORA structures** | 99,940 (computationally modeled) | | **MHC alleles** | 110 diverse HLA-I alleles | | **Peptide lengths** | 8–13 amino acids | | **Unique G-domains** | 286 | | **Number of clusters** | 10 | | **Total size** | ~47 GB | ## Data Sources - **X-ray structures**: Experimental structures from the [Protein Data Bank (PDB)](https://www.rcsb.org/) and [IMGT/3Dstructure-DB](http://www.imgt.org/3Dstructure-DB/) - **PANDORA structures**: Computationally modeled structures from the [PANDORA database](https://github.com/X-lab-3D/PANDORA) ## Clustering Strategy MHC alleles were clustered using **hierarchical clustering** on G-domain sequences (the peptide-binding groove) with **BLOSUM62** similarity scores. This ensures that test alleles have low sequence similarity to training alleles, enabling rigorous evaluation of cross-allele generalization. The varying cluster sizes reflect the natural distribution of G-domain sequence families. ### Cluster Composition | Cluster | G-domains | PANDORA | X-ray | Total | |---------|-----------|---------|-------|-------| | 1 | 2 | 59 | 0 | 59 | | 2 | 74 | 30,044 | 138 | 30,182 | | 3 | 11 | 3,412 | 18 | 3,430 | | 4 | 11 | 523 | 52 | 575 | | 5 | 54 | 26,227 | 351 | 26,578 | | 6 | 2 | 0 | 2 | 2 | | 7 | 30 | 2,337 | 78 | 2,415 | | 8 | 2 | 10,425 | 0 | 10,425 | | 9 | 99 | 26,889 | 163 | 27,052 | | 10 | 1 | 24 | 0 | 24 | ## Files ``` mhc-diff-100k-v1.0/ ├── README.md # This file ├── LICENSE # CC-BY-4.0 license ├── SHA256SUMS # Checksums for all files ├── samples.parquet # Sample index (recommended) ├── samples.tsv.gz # Sample index (alternative format) ├── split_recipes/ # Split definitions │ ├── paper_split.json # Train/val/test as used in the paper │ ├── fold_cluster1.json # Leave cluster 1 out │ ├── ... │ └── README.json # Split recipe documentation └── structures/ # HDF5 structure files ├── cluster_1.hdf5 ├── cluster_2.hdf5.gz # Gzip compressed (decompress before use) ├── ... └── cluster_10.hdf5 ``` **Note:** `cluster_2.hdf5.gz` is gzip-compressed to reduce download size. Decompress before use: ```bash gunzip structures/cluster_2.hdf5.gz ``` ## Paper Split (Recommended) | Split | Clusters | Structures | X-ray | |-------|----------|------------|-------| | **Train** | 1, 2, 5, 8, 9, 10 | 94,320 | 652 | | **Validation** | 7 | 2,415 | 78 | | **Test** | 3, 4, 6 | 4,007 | 72 | ## Data Format ### Sample Index (`samples.parquet`) | Column | Description | |--------|-------------| | `sample_id` | Unique structure identifier | | `cluster_id` | Cluster assignment (1-10) | | `source` | `xray` or `pandora` | | `structure_file` | HDF5 file containing the structure | ### HDF5 Structure Files Each HDF5 file contains multiple structures indexed by `sample_id`: **X-ray structures** (4-letter PDB codes): ```python import h5py with h5py.File('cluster_2.hdf5', 'r') as f: pdb_string = f['1AKJ'][()].decode('utf-8') # Raw PDB format ``` **PANDORA structures** (IDs starting with `BA-`): ```python with h5py.File('cluster_2.hdf5', 'r') as f: entry = f['BA-100003'] peptide_coords = entry['peptide']['atom14_gt_positions'][:, 1, :] # Cα coords protein_coords = entry['protein']['atom14_gt_positions'][:, 1, :] # Cα coords peptide_seq = entry['peptide']['aatype'][:] # Amino acid indices (0-19) ``` ## Usage ### Paper Split ```python import pandas as pd import json # Load sample index samples = pd.read_parquet('samples.parquet') # Load paper split with open('split_recipes/paper_split.json') as f: split = json.load(f) # Create splits train = samples[samples['cluster_id'].isin(split['train_clusters'])] val = samples[samples['cluster_id'].isin(split['validation_clusters'])] test = samples[samples['cluster_id'].isin(split['test_clusters'])] print(f"Train: {len(train)}, Val: {len(val)}, Test: {len(test)}") ``` ### Leave-One-Cluster-Out Cross-Validation ```python for cluster_id in range(1, 11): with open(f'split_recipes/fold_cluster{cluster_id}.json') as f: fold = json.load(f) train = samples[samples['cluster_id'].isin(fold['train_clusters'])] test = samples[samples['cluster_id'].isin(fold['test_clusters'])] ``` ## Related Datasets The **MHC-Diff 8K Dataset** is a subset of this dataset, focusing specifically on HLA-A\*02:01 with 9-mer peptides. - **MHC-Diff 8K Dataset**: [Zenodo DOI to be added] ## Citation If you use this dataset, please cite: ```bibtex @article{fruhbuss2025mhcdiff, title={MHC-Diff: Fast and Accurate Peptide-MHC Structure Prediction via an Equivariant Diffusion Model}, author={Fr{\"u}hbu{\ss}, David and Baakman, Coos and Teusink, Siem and Bekkers, Erik and Jegelka, Stefanie and Xue, Li}, year={2025} } ``` ## References 1. Berman, H.M., et al. "The Protein Data Bank." *Nucleic Acids Research* 28(1), 235–242 (2000). https://doi.org/10.1093/nar/28.1.235 2. Lefranc, M.-P., et al. "IMGT/3Dstructure-DB." *Nucleic Acids Research* 33(suppl 1), D593–D597 (2005). https://doi.org/10.1093/nar/gki010 3. Marzella, D.F., et al. "PANDORA: a fast, anchor-restrained modelling protocol for peptide:MHC complexes." *Frontiers in Immunology* 13, 878762 (2022). https://doi.org/10.3389/fimmu.2022.878762 4. Marzella, D.F., Crocioni, G., et al. "Geometric deep learning improves generalizability of MHC-bound peptide predictions." *Communications Biology* 7(1), 1661 (2024). https://doi.org/10.1038/s42003-024-07292-1 ## License This dataset is released under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/). ## Contact - Li Xue: Li.Xue@radboudumc.nl

Keywords

immunology, Deep Learning, Diffusion Model, peptide-MHC, structure prediction

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average