
arXiv: 1206.6927
Biclustering, the process of simultaneously clustering the rows and columns of a data matrix, is a popular and effective tool for finding structure in a high-dimensional dataset. Many biclustering procedures appear to work well in practice, but most do not have associated consistency guarantees. To address this shortcoming, we propose a new biclustering procedure based on profile likelihood. The procedure applies to a broad range of data modalities, including binary, count, and continuous observations. We prove that the procedure recovers the true row and column classes when the dimensions of the data matrix tend to infinity, even if the functional form of the data distribution is misspecified. The procedure requires computing a combinatorial search, which can be expensive in practice. Rather than performing this search directly, we propose a new heuristic optimization procedure based on the Kernighan-Lin heuristic, which has nice computational properties and performs well in simulations. We demonstrate our procedure with applications to congressional voting records, and microarray analysis.
40 pages, 11 figures; R package in development at https://github.com/patperry/biclustpl
FOS: Computer and information sciences, biological data analysis, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), biclustering, Applications of statistics to biology and medical sciences; meta analysis, Methodology (stat.ME), Statistics - Machine Learning, Asymptotic properties of nonparametric inference, FOS: Mathematics, profile likelihood, 62G20, Statistics - Methodology, Applications of statistics to social sciences, congressional voting, Classification and discrimination; cluster analysis (statistical aspects), block model, Biclustering, microarray data, 62-07, direct clustering, co-clustering
FOS: Computer and information sciences, biological data analysis, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), biclustering, Applications of statistics to biology and medical sciences; meta analysis, Methodology (stat.ME), Statistics - Machine Learning, Asymptotic properties of nonparametric inference, FOS: Mathematics, profile likelihood, 62G20, Statistics - Methodology, Applications of statistics to social sciences, congressional voting, Classification and discrimination; cluster analysis (statistical aspects), block model, Biclustering, microarray data, 62-07, direct clustering, co-clustering
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 6 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
