k‐Nearest neighbors optimization‐based outlier removal

descriptionPublicationkeyboard_double_arrow_right Article 15 Dec 2014 English Publisher:WileyJournal:Journal of Computational Chemistry, volume 36, pages 493-506 (issn: 0192-8651, eissn: 1096-987X,

Copyright policy )

Authors: Abraham Yosipof; Hanoch Senderowitz;

doi: 10.1002/jcc.23803

pmid: 25503870

k‐Nearest neighbors optimization‐based outlier removal

- Summary
- Metrics

Abstract

Datasets of molecular compounds often contain outliers, that is, compounds which are different from the rest of the dataset. Outliers, while often interesting may affect data interpretation, model generation, and decisions making, and therefore, should be removed from the dataset prior to modeling efforts. Here, we describe a new method for the iterative identification and removal of outliers based on a k‐nearest neighbors optimization algorithm. We demonstrate for three different datasets that the removal of outliers using the new algorithm provides filtered datasets which are better than those provided by four alternative outlier removal procedures as well as by random compound removal in two important aspects: (1) they better maintain the diversity of the parent datasets; (2) they give rise to quantitative structure activity relationship (QSAR) models with much better prediction statistics. The new algorithm is, therefore, suitable for the pretreatment of datasets prior to QSAR modeling. © 2014 Wiley Periodicals, Inc.

Related Organizations

Bar-Ilan University
Israel

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	27
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%