
arXiv: 1704.00497
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a widely used clustering method. Here we show that UPGMA is a greedy heuristic for the normalized equidistant minimum evolution (NEME) problem, that is, finding a rooted tree that minimizes the minimum evolution score relative to the dissimilarity matrix among all rooted trees with the same leaf-set in which all leaves have the same distance to the root. We prove that the NEME problem is NP-hard. In addition, we present some heuristic and approximation algorithms for solving the NEME problem, including a polynomial time algorithm that yields a binary, rooted tree whose NEME score is within O(log^2 n) of the optimum. We expect that these results to eventually provide further insights into the behavior of the UPGMA algorithm.
29 pages, 8 figures
FOS: Computer and information sciences, Classification and discrimination; cluster analysis (statistical aspects), Discrete Mathematics (cs.DM), Populations and Evolution (q-bio.PE), Computational Complexity (cs.CC), balanced minimum evolution, 004, 510, minimum evolution, Computer Science - Computational Complexity, Problems related to evolution, FOS: Biological sciences, UPGMA, Analysis of algorithms, Quantitative Biology - Populations and Evolution, hierarchical clustering, Computer Science - Discrete Mathematics
FOS: Computer and information sciences, Classification and discrimination; cluster analysis (statistical aspects), Discrete Mathematics (cs.DM), Populations and Evolution (q-bio.PE), Computational Complexity (cs.CC), balanced minimum evolution, 004, 510, minimum evolution, Computer Science - Computational Complexity, Problems related to evolution, FOS: Biological sciences, UPGMA, Analysis of algorithms, Quantitative Biology - Populations and Evolution, hierarchical clustering, Computer Science - Discrete Mathematics
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 10 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
