
In this paper, we describe the implementation and evaluation of a cluster‐based enrichment strategy to call hits from a high‐throughput screen using a typical cell‐based assay of 160,000 chemical compounds. Our focus is on statistical properties of the prospective design choices throughout the analysis, including how to choose the number of clusters for optimal power, the choice of test statistic, the significance thresholds for clusters and the activity threshold for candidate hits, how to rank selected hits for carry‐forward to the confirmation screen, and how to identify confirmed hits in a data‐driven manner. Whereas previously the literature has focused on choice of test statistic or chemical descriptors, our studies suggest that cluster size is the more important design choice. We recommend clusters to be ranked by enrichment odds ratio, not by p‐value. Our conceptually simple test statistic is seen to identify the same set of hits as more complex scoring methods proposed in the literature do. We prospectively confirm that such a cluster‐based approach can outperform the naive top X approach and estimate that we improved confirmation rates by about 31.5% from 813 using the top X approach to 1187 using our cluster‐based method. Copyright © 2012 John Wiley & Sons, Ltd.
top X, Data Interpretation, Statistics & Probability, Statistics, Antineoplastic Agents, Statistical, hit selection, high-throughput screening, High-Throughput Screening Assays, HTS hit selection, Murcko fragments, Research Design, Data Interpretation, Statistical, Drug Discovery, Public Health and Health Services, Odds Ratio, Cluster Analysis, Humans, False Positive Reactions, Prospective Studies, fingerprint descriptors, cluster analysis
top X, Data Interpretation, Statistics & Probability, Statistics, Antineoplastic Agents, Statistical, hit selection, high-throughput screening, High-Throughput Screening Assays, HTS hit selection, Murcko fragments, Research Design, Data Interpretation, Statistical, Drug Discovery, Public Health and Health Services, Odds Ratio, Cluster Analysis, Humans, False Positive Reactions, Prospective Studies, fingerprint descriptors, cluster analysis
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 16 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
