ANABAG: ANnotated Antibody AntiGen dataset

ANABAG (ANnotated AntiBody AntiGen) ANABAG is a curated dataset of antibody–antigen complexes. It includes: - 3D structural data (with various formats) - Per-sequence and per-residue features - Frequent updates (monthly on the GitHub repository) The analysis and prediction of antibody–antigen (Ab–Ag) interactions often overlook critical structural features such as glycosylation, physical chemical conditions like pH and salt concentration, as well as the lack of standardized criteria for selecting complexes based on structural properties and sequence identity. Common practices in dataset construction rely on removing redundancy using sequence identity thresholds, which can inadvertently exclude complexes with alternative binding modes that share identical sequences. To enable more precise Ab–Ag modeling and antibody engineering, it is essential to incorporate richer structural and physical information into both physics-based and machine learning models. To address these limitations, we present ANABAG, a new curated dataset of Ab–Ag complexes annotated at the residue level with UniProt sequence information and enriched with a wide range of structural and physicochemical features. The dataset allows flexible filtering of complexes using a variety of descriptors available at both the complex and residue levels. Selected features are ready to use in machine learning workflows, while the structural files are compatible with antibody design and docking pipelines like Rosetta or Haddock. The complete dataset is available on Zenodo, and all accompanying scripts and usage documentation can be accessed via GitHub. Files Included This dataset is provided in three versions to accommodate different computational requirements: 1. data.tar.gz (Full Dataset, ~30 GB) The complete ANABAG dataset containing all biological units (BUs) with comprehensive features and structures: Initial chain structures: Renumbered, chain-standardized format with antigen (AG) first and antibody (AG) second Formatted structures: Identical formating with the exeption of the chains: chain-standardized format with AG as chain A and AB as chain B Heteroatom files: Identical as Initial chain structure with the inclusion of all non-protein atoms (cofactors, glycans, water, etc.) Rosetta-processed data: Energy-minimized structures (relax) and associated features Note: Some Rosetta calculations did not complete successfully; these BUs lack Rosetta-specific outputs 2. light_version.tar.gz (Light Version, ~7 GB) A streamlined version for users who need core structural data without additional processing: Initial chain version of each biological unit Associated features and annotations Excludes: Heteroatom files, Rosetta features, and relaxed structures Ideal for initial exploration and machine learning applications that don't require heteroatoms 3. formated_structures_only.tar.gz (Minimal Version, ~4 GB) The most compact version containing essential structural information: Initial chain version of each biological unit only Suitable for quick access and overview of available complexes Recommended for users with limited storage or bandwidth 4. per_residue_files.tar.gz (Minimal Version, ~3 GB) The per residue features per_residue_information_AG.tsv containing all features for antigen residues per_residue_information_AB.tsv containing all features for antibody residues Note: All structures (except heteroatom files) include modeled regions where gaps up to 12 residues were modelled using Modeller and Disgro. Each residue is annotated in the 'Stat_res_pdbm' column as either 'Modelled' or 'Solved', allowing users to filter based on experimental vs. modeled content. The 'Distance_interface' column (in Ångströms) enables filtering of modeled residues (or any residue) by their proximity to the binding interface. Usage and Tools ANABAG can be used directly or through our companion tools available at: DSIMB/anabag-handler These scripts enable users to: Filter biological units based on specific criteria (pH range, experimental technique, resolution, secondary structures, etc.) Extract subsets for specialized analyses Convert between different structural formats Generate machine learning-ready features For detailed usage instructions and examples, please refer to the GitHub repository documentation.

Related Organizations

Keywords

Physical chemistry, Antigen-Antibody Complex, Uniprot

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average