Powered by OpenAIRE graph
Found an issue? Give us feedback
ZENODOarrow_drop_down
ZENODO
Other literature type . 2024
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2024
License: CC BY
Data sources: Datacite
ZENODO
Other literature type . 2024
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

Machine learning prediction of novel anthelmintics

Authors: Taki, Aya; Kapp, Louis; Hall, Ross; Gasser, Robin; Hofmann, Andreas;

Machine learning prediction of novel anthelmintics

Abstract

# Machine learning prediction of novel anthelmintics This repository contains scripts that were used to obtain the findings reported in the study "Prediction and prioritisation of novel anthelmintic candidates from public databases by using deep learning and available bioactivity data sets" by Taki et al. ## Table of Contents 1. [Small-molecule bioactivity data used for training and validation](#1)2. [Feature generation, model architecture and training](#2)3. [Classification model](#3)4. [Prediction of activities](#4)5. [Clustering of compounds with predicted nematocidal activity](#5)6. [Post-processing](#6) ## 1. Small-molecule bioactivity data used for training and validation The dataset of 15,162 small-molecule compounds used for training and validation has been published as [DOI:10.5281/zenodo.10929251](https://doi.org/10.5281/zenodo.10929251). ## 2. Feature generation, model architecture and training The scripts used are compiled in the folder [02_model_training](02_model_training). | Script | Description | Input files ||--------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|| [dl_mlp_class_run_training.py](02_model_training/dl_mlp_class_run_training.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) with variable network architectures | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors || [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) | Main script to perform training or prediction (`mode`) of MLPs with specified architecture | CSV-formatted file with compounds in SMILES notation and annotated labels; file with line-separated list of Mordred descriptors || [mordred_descriptors.txt](02_model_training/mordred_descriptors.txt) | Line-separated list of Mordred descriptors | n/a | ## 3. Classification model The best classification model is located in folder [03_model](/03_model). | File | Description ||----------------------------------|-----------------------------------------|| [m1002a_label_dictionary.json](03_model/m1002a_label_dictionary.json) | Classification labels used by the model || [m1002a_model_architecture.json](03_model/m1002a_model_architecture.json) | MLP archticture || [m1002a_model_weights.h5](03_model/m1002a_model_weights.h5) | The weights of the trained model | ## 4. Prediction of activities Classification of compounds with respect to their activity labels was done using the wrapper script [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) and the main script [dl_mlp_class_v1.py](04_activity_prediction/dl_mlp_class_v1.py) located in the folder [03_activity_prediction](04_activity_prediction). A dataset of 14.2 million compounds was downloaded from the [ZINC15 database](https://zinc15.docking.org) and used as a search library. The downloaded ZINC15 data is not included in this repository. Files in folder [04_activity_prediction](04_activity_prediction): | File | Description | Input files | Output files ||----------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------| | [dl_mlp_class_run_prediction.py](04_activity_prediction/dl_mlp_class_run_prediction.py) | Wrapper script to run [dl_mlp_class_v1.py](02_model_training/dl_mlp_class_v1.py) in `prediction` mode | CSV-formatted file of library compounds with SMILES representation or h5-formatted file of encoded compounds; dictionary, architecture and weights of the model | n/a || [prepare_zinc_v3.py](04_activity_prediction/prepare_zinc_v3.py) | Script to encode compounds of the ZINC15 search library | CSV-formatted file of library compounds with SMILES representation | h5-formatted file of encoded compounds || [zinc_15_m1002a_active.csv.gz](04_activity_prediction/zinc_15_m1002a_active.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `active` | n/a | n/a || [zinc_15_m1002a_weak.csv.gz](04_activity_prediction/zinc_15_m1002a_weak.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `weakly active` | n/a | n/a || [zinc_15_m1002a_inactive.csv.gz](04_activity_prediction/zinc_15_m1002a_inactive.csv.gz) | Compounds from the tested ZINC15 search library with predicted label `none` | n/a | n/a | ## 5. Clustering of compounds with predicted nematocidal activity The scripts used in this step are located in the folder [05_clustering](05_clustering). | File | Description | Input file | Output file ||------|-------------|------------|-------------|| [preprocess.py](05_clustering/preprocessing/preprocess.py) | Script that computes feature vectors of compounds following [Hadipour, H., Liu, C., Davis, R. et al](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04667-1)'s approach. It calls `mol2global` from [global_feature_generation.py](05_clustering/preprocessing/global_feature_generation.py), `mol2local` from [local_feature_generation.py](05_clustering/preprocessing/local_feature_generation.py) and [`combine_and_drop_features`](05_clustering/preprocessing/combine_and_drop_features.py) | [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file with SMILES representation of all compounds that need to be preprocessed | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the combined, final features for each compound || [vae/train.py](05_clustering/vae/train.py) | Script that trains a Variational Autoencoder (VAE) which is defined in [vae.py](05_clustering/vae/vae.py). | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound | A [Pytorch model checkpoint](https://pytorch.org/tutorials/beginner/saving_loading_models.html), containing the hyperparameter configuration and weights || [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py) | Script that generates embeddings from preprocessed features using the trained VAE. The embeddings are afterwards used for clustering. | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format containing the final features for each compound. Also, the weights of the trained model are needed from a given [checkpoint](05_clustering/vae/checkpoint/model.pt) | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds || [vae/checkpoint/config.json](05_clustering/vae/checkpoint/config.json) | Hyperparameter configurations of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | | || [vae/checkpoint/model.pt](05_clustering/vae/checkpoint/config.json) | Checkpoint containing the weights of the trained VAE model, the embeddings of which, showed best performance on the activity_prediction task | || [k_means.py](05_clustering/k_means.py) | Computes a label (1-k) for each compound via k-means clustering and stores it along with evaluation metrics for different hyperparameters | File in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that containes the embeddings for all compounds (output of [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)) | Multiple files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format. One file contains the computed label for each compound (1-k) and the other files contain the evaluation scores (silhouette, davies bouldin, calinski harabasz) for different hyperparameter configurations || [tsne.py](05_clustering/tsne.py) | Visualizes the k-means-clustered coordinates after reducing their dimensionality to 2D | Files in [NPY](https://numpy.org/devdocs/reference/generated/numpy.lib.format.html)-format that contains the computed label for each compound (1-k) (output from [k_means.py](05_clustering/k_means.py)) and the embedding for each compound (output from [vae/compute_embeddings.py](05_clustering/vae/compute_embeddings.py)). Also, the path to a [Feather](https://arrow.apache.org/docs/python/feather.html)-formatted file containing the SMILES strings of all compounds in the correct order is needed. | A [Feather](https://arrow.apache.org/docs/python/feather.html)-file containing the TSNE coordinates of every compound and a graphical visualization (can be saved as any image format) | ## 6. Post-processing The scripts used in this step are located in the folder [06_post_processing](06_post_processing). | File | Description ||---------------------|------------------------------------------------------------------------|| [molport_search.py](06_post_processing/molport_search.py) | Searches for availability of compounds for purchase on [Molport](https://www.molport.com/shop/index) || [patent_scraper.py](06_post_processing/patent_scraper.py) | Scrapes [Google Patents](https://patents.google.com) for a list of compounds and searches for keywords in the patent titles/snippets || [lipinski_checker.py](06_post_processing/lipinski_checker.py) | Checks a list of compounds for adherence to the Lipinsky Rule of 5 |

Related Organizations
Keywords

nematodes, antiparasitics, drug discovery, Machine learning prediction

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Upload OA version
Are you the author of this publication? Upload your Open Access version to Zenodo!
It’s fast and easy, just two clicks!