Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction.

Research datakeyboard_double_arrow_right Dataset 16 Mar 2023 English Publisher:ZenodoFunded by:NIH | Extracting rich informati..., NIH | Extracting rich informati..., NSF | Collaborative Research: I...

Authors: Moshkov, Nikita; Becker, Tim; Yang, Kevin; Horvath, Peter; Dancik, Vlado; Wagner, Bridget K.; Clemons, Paul A.; +3 Authors

doi: 10.5281/zenodo.7729583 , 10.5281/zenodo.7729582

Chemical structures, Cell Painting and transcriptional profiles for compound bioactivity prediction.

- Summary
- Subjects
- Metrics

Abstract

This is the related data, both input and produced for the paper "Predicting compound activity from phenotypic profiles and chemical structures". This data can be merged with paper's GitHub repository for reproduction. Folders and files and are described below: ├── assay_data ├── assay_matrix_discrete_270_assays.csv Assay matrix with hits for assays (270) and compounds (16170). Note that this is the final file that we used to produce splits. ├── assay_metadata.csv Assay metadata ├── broad_ids.txt List of broad ids used in this study. That is an unfiltered list of compounds required by some analysis scripts. ├── smiles.txt Same as broad_ids.txt, but SMILES strings. ├── feature_data (for 16978 compounds, can be masked with ./misc/compounds16978to16170.npy) ├── cp.npz Classical chemical features ├── ge.npz Gene expression features ├── ge_scale.npz Gene expression scaled features ├── mo.npz Morphology features (not batch corrected) ├── mobc.npz Morphology features (batch corrected) ├── misc ├── compound_analysis.npz Compounds in the dataset identified as PAINS ├── compounds16978to16170.npy Used to filter features from the bigger set of compounds to the final one ├── fingerprints.npz Calculated fingerprints of compounds, those were then used to calculate similarity ├── similarity_fingerprints.npz Similarity matrix for compounds (16978) ├── population_normalized.csv.gz Well-level morphological profiles that were used for batch-correction ├── Table for PUMA Excel file with additional data and plots ├── predictions ├── scaffold_median(mean)_AUC.csv Aggregated median(mean) AUC scores over scaffold-based cross-validation splits. In the paper, median results were reported. ├── scaffold_median(mean)_EF.csv Aggregated median(mean) enrichment factor (EF) over scaffold-based cross-validation splits. In the paper, median results were reported. ├── toprank_chemical_cv{}_hitsnorm.csv Those files are needed to create enrichment plots and contain hit rate and top rank hit rate. ├── Each folder here stands for an experiment type, the number in the folder name is a number of the split. Inside each folder there are the following elements: ├── predictions Folder with predictions for each assay-compound pair for each modality ├── 2022_01_evaluation_all_data.csv File with AUC scores for each assay for the test set in the split ├── 2022_01_evaluation_all_data_EF.csv File with enrichment factor (EF) values for each assay for the test set in the split. Those files exist only for *chemical* folders. ├── assay_matrix_discrete_train(test)_old_scaff.csv Training and test subsets of data for the split. The first column contains broad_id. ├── assay_matrix_discrete_train(test)_old_scaff.csv Same, but SMILES strings in the first column. Those files are used as input to ChemProp! Experiments in this folder are the following: - chemical Scaffold-based 5-fold cross-validation splits, the main results in the paper are reported with this series of experiments. - chemical_bal Same splits as in chemical, but training were run with ChemProp built-in data balancing. - chemical_st Same splits as in chemical, but separate models were trained for each assay. - CV Random 5-fold cross-validation splits. - GE 5-fold cross-validation splits based on same-size clustering of gene expression features. - MOBC 5-fold cross-validation splits based on same-size clustering of batch-corrected morphology features. - random 10 random splits, ~80% of compounds in the training set and the rest in the test set. ├── splitting This folder contains numpy files which help to match compounds and features to create training and test sets for a split, which can be reused in the analysis notebook for data preparation. ├── scaffold_based_split.npz Splitting for scaffold-based splits. ├── random_split_{}.npz Random split indices of test set compounds (10 files). ├── cross_validation_indicies.npz Indices for random cross-validation splits ├── GE_clusters_size_constrained.npz Indicies of clusters of same-size clustering for gene-expression features. ├── MOBC_clusters_size_constrained.npz Indices of clusters of same-size clustering for batch-corrected morphology features.

This study was supported by a grant from the National Institutes of Health (R35 GM122547 to AEC), by the Broad Institute Schmidt Fellowship program (JCC) and by National Science Foundation (NSF-DBI award 2134695 to JCC). NM and PH acknowledge support from the LENDULET BIOMAG Grant (2018–342), from TKP2021-EGA09, SYMMETRY-ERAPerMed, from CZI Deep Visual Proteomics, H2020-Fair-CHARM, from the ELKH-Excellence grant, from OTKA-SNN 139455/ARRS N2-0136.

{"references": ["Wawer, M. J. et al. Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling. Proceedings of the National Academy of Sciences 111, 10911\u201310916 (2014).", "Bray, M.-A. et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. Gigascience 6, 1\u20135 (2017)."]}

Related Organizations

Broad Institute
United States
University of California, Berkeley
United States
MTA Biological Research Centre
Hungary

Keywords

Drug discovery, Cheminformatics, L1000, CellProfiler, Image-based profiling, Cell Painting, Gene expression

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average