Performance reproducibility index for classification

descriptionPublicationkeyboard_double_arrow_right Article 06 Sep 2012 English Publisher:Oxford University Press (OUP)Journal:Bioinformatics, volume 28, pages 2,824-2,833 (issn: 1367-4803, eissn: 1367-4811,

Copyright policy )

Authors: Mohammadmahdi R. Yousefi; Edward R. Dougherty;

doi: 10.1093/bioinformatics/bts509

pmid: 22954625

pmc: PMC3476329

Performance reproducibility index for classification

- Summary
- Subjects
- Metrics

Abstract

Abstract Motivation: A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study? Results: This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed. Availability: We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/. Supplementary simulation results are also included. Contact: edward@ece.tamu.edu Supplementary Information: Supplementary data are available at Bioinformatics online.

Related Organizations

The University of Texas System
United States
Texas A&M University
United States
Translational Genomics Research Institute
United States

Keywords

Genetic Markers, Models, Statistical, Gene Expression Profiling, Reproducibility of Results, Pattern Recognition, Automated, Bias, Sample Size, Humans, Regression Analysis, Precision Medicine, Algorithms, Biomarkers, Software, Follow-Up Studies, Probability

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	8
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

8

Average

Top 10%

gold

Fields of Science (4) View all

engineering and technology

medical engineering

Fields of Science

engineering and technology

medical engineering

View all