Large-scale Docking Datasets for Machine Learning

Large-scale virtual screening has become a valuable tool for early-phase drug discovery. Recent expansions of commercial chemical space have made it computationally intractable to evaluate all compounds in the libraries. Machine learning is one of the methods that aim to prioritize specific subsets of these vast libraries. In order to put these methods to the test, access to large-scale datasets is beneficial. To help the community benchmark their work, we share the docking scores of several ultralarge virtual screening campaigns. The datasets we provide contain canonical SMILES, compound identifiers, and docking scores. We docked two different chemical libraries against eight different biological targets with therapeutic relevance. The first dataset contained approximately 15.5 million molecules adhering to the "Rule-of-Four", whereas the second datasets consists of approximately 235 million "lead-like" molecules. The biological targets represent different classes of proteins and binding sites. More details on the datasets and our methods can be found on (https://github.com/carlssonlab/conformalpredictor) and our pre-print (https://doi.org/10.26434/chemrxiv-2023-w3x36). Please feel free to download and use these datasets for your own research purposes. We only ask that you cite our pre-print and datasets appropriately if you use it in your work. Thank you for your interest in our research!

Related Organizations

Keywords

Virtual Screening, Chemical Space, Molecular Docking

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average