Background Trnasformer-based AI models have shown outstanding performance in identifying druggable candidate molecules. In most cases, models are trained on a massive amount of database of molecular information to capture the latent meaning of a given molecule. However, the desirable properties of candidate molecules include the feasibility of synthesizing them, low toxicity, and high druggability. In this study, we injected prior knowledge of the desirable properties of molecules during the training process. Methods Using the PubChem database (100 M), we filtered druglike molecules based on the quantity of drug-likeliness (QED) score and the Pfizer rule. With this dataset of drug-like molecules, we trained both the molecular representation model (chemBERTa) and the molecular generation models (MolGPT). The molecular representation model was evaluated by fine-tuning the results on the MoleculeNet benchmark datasets, and the molecular generation model was evaluated based on the generated samples (10 K). Results Training with druglike molecules enabled the generation of molecules with desirable properties without any conditioning. Although the molecular representation learning model was not remarkable, however, its performance in predicting clinical toxicology exceeded that of conventional molecular representation models. Conclusion By training based on a dataset of druglike molecules, our approach enables molecular representation models to predict clinical toxicity more precisely. Furthermore, it enables the molecule generation model to generate molecules with desirable druglike properties without any conditional generation procedures. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- import pickle with open("druglike_molecules_QED.pkl", "rb") as f: data = pickle.load(f)

Related Organizations

Hanyang University
Korea (Republic of)

1 Research products, page 1 of 1

The Druglike molecule pretraining strategy for drug discovery
2023IsAmongTopNSimilarDocuments

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average