Project Description Drug discovery pipelines nowadays rely on machine learning models to explore and evaluate large chemical spaces. While the inclusion of 3D complex information is considered to be beneficial, structural ML for affinity prediction suffers from data scarcity. We provide kinodata-3D, a dataset of ~138 000 docked complexes to enable more robust training of 3D-based ML models for kinase activity prediction (see github.com/volkamerlab/kinodata-3D-affinity-prediction). Dataset 1. Data This data set consists of three-dimensional protein-ligand complexes that were generated using computational docking from the OpenEye toolkit. The modeled proteins cover the kinase family for which a fair amount of structural data, i.e. co-crystallized protein-ligand complexes in the PDB, enriched through KLIFS annotations, is available. This enables us to use template docking (OpenEye’s POSIT functionality) in which the ligand placement is guided according to a similar co-crystallized ligand pose. The kinase-ligand pairs to dock are sourced from binding assay data via the public ChEMBL archive, version 33. In particular, we use kinase activity data as curated through the OpenKinome kinodata project. The final protein-ligand complexes are annotated with a predicted RMSD of the docked poses. The RMSD model is a simple neural network trained on a kinase-docking benchmark data set using ligand (fingerprint) similarity, docking score (ChemGauss 4), and Posit probability (see kinodata-3D repository). The final data set contains in total 138 286 deduplicated kinase-ligand pairs, covering ~98 000 distinct compounds and ~271 distinct kinase structures. 2. File structure The archive kinodata_3d.zip uses the following file structure data/raw | kinodata_docked_with_rmsd.sdf.gz | pocket_sequences.csv | mol2/pocket | 1_pocket.mol2 | ... The file kinodata_docked_with_rmsd.sdf.gz contains the docked ligand poses and the information on the protein-ligand pair inherited from kinodata. The protein pockets located in mol2/pocket are stored according to the MOL2 file format. The pocket structures were sourced from KLIFS (klifs.net) and complete the poses in the aforementioned SDF file. The files are named {klifs_structure_id}_pocket.mol2. The structure ID is given in the SDF file along with the ligand poses. The file pocket_sequences.csv contains all KLIFS pocket sequences relevant to the kinodata-3D dataset. 3. Related code The code used to create the poses can be found in the kinodata-3D repository. The docking pipeline makes heavy use of the kinoml framework, which in turn uses OpenEye's Posit template docking implementation. The details of the original pipeline can also be found in the manuscript by Schaller et al. (2023). Benchmarking Cross-Docking Strategies for Structure-Informed Machine Learning in Kinase Drug Discovery. bioRxiv.

Related Organizations

Saarland University
Germany

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average