Machine learning prediction of novel anthelmintics

# Machine learning prediction of novel anthelmintics This repository contains scripts that were used to obtain the findings reported in the study "Prediction and prioritisation of novel anthelmintic candidates from public databases by using deep learning and available bioactivity data sets" by Taki et al. ## Table of Contents 1. [Small-molecule bioactivity data used for training and validation](#1)2. [Feature generation, model architecture and training](#2)3. [Classification model](#3)4. [Prediction of activities](#4)5. [Clustering of compounds with predicted nematocidal activity](#5)6. [Post-processing](#6) ## 1. Small-molecule bioactivity data used for training and validation The dataset of 15,162 small-molecule compounds used for training and validation has been published as [DOI:10.5281/zenodo.10929251](https://doi.org/10.5281/zenodo.10929251). ## 2. Feature generation, model architecture and training The scripts used are compiled in the folder [02_model_training](02_model_training). | Script | Description | Input files ||--------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|| [dl_mlp_class_run_training.py](02_model_training/dl_mlp_class_run_training.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) with variable network architectures | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors || [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) | Main script to perform training or prediction (`mode`) of MLPs with specified architecture | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors || [mordred_descriptors.txt](02_model_training/mordred_descriptors.txt) | Line-separated list of Mordred descriptors | n/a | ## 3. Classification model The best classification model is located in folder [03_model](/03_model). | File | Description ||----------------------------------|-----------------------------------------|| [m1002a_label_dictionary.json](03_model/m1002a_label_dictionary.json) | Classification labels used by the model || [m1002a_model_architecture.json](03_model/m1002a_model_architecture.json) | MLP archticture || [m1002a_model_weights.h5](03_model/m1002a_model_weights.h5) | The weights of the trained model | ## 4. Prediction of activities Classification of compounds with respect to their activity labels was done using the wrapper script [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) and the main script [dl_mlp_class_v1.py](04_activity_prediction/dl_mlp_class_v1.py) located in the folder [03_activity_prediction](04_activity_prediction). A dataset of 14.2 million compounds was downloaded from the [ZINC15 database](https://zinc15.docking.org) and used as a search library. The downloaded ZINC15 data is not included in this repository. Files in folder [04_activity_prediction](04_activity_prediction): | File | Description | Input files | Output files ||----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------| | [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) in `prediction` mode | CSV-formatted file of library compounds with SMILES representation or h5-formatted file of encoded compounds; dictionary, architecture and weights of the model | n/a || [prepare_zinc_v3.py](04_activity_prediction/prepare_zinc_v3.py) | Script to encode compounds of the ZINC15 search library | CSV-formatted file of library compounds with SMILES representation | h5-formatted file of encoded compounds || [zinc_15_m1002a_active.csv.gz](04_activity_prediction/zinc_15_m1002a_active.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `active` | n/a | n/a || [zinc_15_m1002a_weak.csv.gz](04_activity_prediction/zinc_15_m1002a_weak.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `weakly active` | n/a | n/a || [zinc_15_m1002a_inactive.csv.gz](04_activity_prediction/zinc_15_m1002a_inactive.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `none` | n/a | n/a | ## 5. Clustering of compounds with predicted nematocidal activity The scripts used in this step are located in the folder [05_clustering](05_clustering). | File | Description | Input file | Output file ||------|-------------|------------|-------------|| [preprocess.py](05_clustering/preprocessing/preprocess.py) | Script that computes feature vectors of compounds following [Hadipour, H., Liu, C., Davis, R. et al](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04667-1)'s approach. It calls `mol2global` from [global_feature_generation.py](05_clustering/preprocessing/global_feature_generation.py), `mol2local` from [local_feature_generation.py](05_clustering/preprocessing/local_feature_generation.py) and [`combine_and_drop_features`](05_clustering/preprocessing/combine_and_drop_features.py) | [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file with SMILES representation of all compounds that need to be preprocessed | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the combined, final features for each compound || [vae/train.py](05_clustering/vae/train.py) | Script that trains a Variational Autoencoder (VAE) which is defined in [vae.py](05_clustering/vae/vae.py). | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound | A [Pytorch model checkpoint](https://pytorch.org/tutorials/beginner/saving_loading_models.html), containing the hyperparameter configuration and weights || [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py) | Script that generates embeddings from preprocessed features using the trained VAE. The embeddings are afterwards used for clustering. | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound. Also, the weights of the trained model are needed from a given [checkpoint](05_clustering/vae/checkpoint/model.pt) | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds || [vae/checkpoint/config.json](05_clustering/vae/checkpoint/config.json) | Hyperparameter configurations of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | | || [vae/checkpoint/model.pt](05_clustering/vae/checkpoint/config.json) | Checkpoint containing the weights of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | || [k_means.py](05_clustering/k_means.py) | Computes a label (1-k) for each compound via k-means clustering and stores it along with evaluation metrics for different hyperparameters | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds (output of [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)) | Multiple files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format. One file contains the computed label for each compound (1-k) and the other files contain the evaluation scores (silhouette, davies bouldin, calinski harabasz) for different hyperparameter configurations || [tsne.py](05_clustering/tsne.py) | Visualizes the k-means-clustered coordinates after reducing their dimensionality to 2D | Files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that contains the computed label for each compound (1-k) (output from [k_means.py](05_clustering/k_means.py)) and the embedding for each compound (output from [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)). Also, the path to a [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file containing the SMILES strings of all compounds in the correct order is needed. | A [Feather](https://arrow.apache.org/docs/python/feather.html)-file containing the TSNE coordinates of every compound and a graphical visualization (can be saved as any image format) | ## 6. Post-processing The scripts used in this step are located in the folder [06_post_processing](06_post_processing). | File | Description ||---------------------|------------------------------------------------------------------------|| [molport_search.py](06_post_processing/molport_search.py) | Searches for availability of compounds for purchase on [Molport](https://www.molport.com/shop/index) || [patent_scraper.py](06_post_processing/patent_scraper.py) | Scrapes [Google Patents](https://patents.google.com) for a list of compounds and searches for keywords in the patent titles/snippets || [lipinski_checker.py](06_post_processing/lipinski_checker.py) | Checks a list of compounds for adherence to the Lipinsky Rule of 5 |

Related Organizations

University of Melbourne
Australia
Max Rubner Institut
Germany

Keywords

nematodes, antiparasitics, drug discovery, Machine learning prediction

1 Research products, page 1 of 1

Bioactivity of small-molecule compounds against Haemonchus contortus
2024IsSupplementedBy

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now

Machine learning prediction of novel anthelmintics

Machine learning prediction of novel anthelmintics

1 Research products, page 1 of 1

Bioactivity of small-molecule compounds against Haemonchus contortus