Statistical and machine learning methods for identifying clusters of variables: with applications in omics, ecology and psychology.

Marion, Rebecca

Found an issue? Give us feedback

Dépôt Institutionel ...arrow_drop_down

Dépôt Institutionel de l’Université catholique de Louvain et de l’Université Saint-Louis

Doctoral thesis . 2021

Data sources: Dépôt Institutionel de l’Université catholique de Louvain et de l’Université Saint-Louis

DBLP

Doctoral thesis . 2022

Data sources: DBLP

Statistical and machine learning methods for identifying clusters of variables: with applications in omics, ecology and psychology.

descriptionPublicationkeyboard_double_arrow_right Doctoral thesis 01 Jan 2021 Belgium

Authors: Marion, Rebecca;

handle: 2078.1/252866

Statistical and machine learning methods for identifying clusters of variables: with applications in omics, ecology and psychology.

- Summary
- Subjects
- Metrics

Abstract

In many fields, researchers are confronted by datasets whose variables demonstrate grouping patterns. For example, in transcriptomics data, where the variables are gene expression levels, certain groups of genes are involved in the same biological processes, so their expression levels are highly correlated. For complex diseases, such as cancer or heart disease, entire groups of genes are expected to contribute to the development or progression of disease. Thus, identifying these variable groups, or "clusters," can be instrumental in uncovering the mechanisms of disease and developing targeted treatments. However, in practice, these variable clusters are not known in advance and must be learned from the data. Clustering is a data analysis technique used to assign a set of objects to groups, or clusters, where similar objects are assigned to the same cluster and dissimilar objects to different clusters. While most work in the literature has focused on the problem of clustering observations (e.g. patients) given a set of variables (e.g. genes), this thesis proposes several statistical and machine learning methods for the problem of variable clustering. The objective of the thesis is to propose methods that can improve data analysis in contexts where the ultimate objective is to predict one or more targets (e.g. disease class) and identify clusters of predictor variables (e.g. genes, metabolites) that are most predictive of the target(s). We explore three problems related to this theme, drawing on applications from the fields of metabolomics, genomics, ecology and psychology. First, we propose AdaCLV, a variable clustering method for pre-processing high-dimensional metabolomics data such that important clusters of variables can be identified with greater precision. Second, we investigate the added value of integrating the target variable (e.g. disease class) into the variable clustering process. We introduce Weighted SOS-NMF, a method that improves variable clustering and variable selection performance by supervising the clustering of variables with the target before a predictive model is fitted. Finally, we examine the case of supervised variable clustering for data with multiple, orthogonal targets. Inspired by a common research problem in ecology and psychology, we propose BIOT, a method for transforming the dimensions of the target matrix so that they can be accurately predicted by small clusters of predictor variables. (SC - Sciences) -- UCL, 2021

Country

Belgium

Related Organizations

Université Catholique de Louvain
Belgium

Keywords

Nonnegative matrix factorization, Variable selection, Matrix factorization, Prediction, Sparsity, Regression, Clustering, Variable clustering

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Cancer Research