<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

Exploration of dimension reduction techniques for clustering spatial transcriptomics data

Name: Exploration of dimension reduction techniques for clustering spatial transcriptomics data
Keywords: Spatial Domain Identification, Spatial Transcriptomics, Dimension Reduction, Mclust, Model Based Clustering

descriptionPublicationkeyboard_double_arrow_right Conference object 20 Nov 2024 English Publisher:Zenodo

Authors: Jia, Ruize; Hung, Ling-Hong; Yeung, Ka Yee;

doi: 10.5281/zenodo.14188936 , 10.5281/zenodo.14188935

Exploration of dimension reduction techniques for clustering spatial transcriptomics data

- Summary
- Subjects
- Metrics

Abstract

Spatial transcriptomics (ST) provides a spatially resolved, high-dimensional assessment of gene transcription. Spatial domain identification (SDI) is a critical task in ST, as it enables a deeper understanding of tissue microenvironments and biological functions. SDI typically involves a clustering step to infer spatial domains. Existing methods utilize statistical or deep learning models to incorporate spatial information for clustering. For statistical methods, Giotto implements a Hidden Markov Random Field model to detect spatial domains with consistent gene expression patterns, while BayesSpace uses a Bayesian model to encourage neighboring spots to be grouped together. [1] Among deep learning methods, GraphST uses graph convolutional networks and self-supervised contrastive learning to reconstruct gene expression matrix with spatial information. [2] SEDR adopts a variational graph autoencoder to produce embeddings that represent gene expression profiles with spatial information. However, existing methods face two major limitations in the clustering process. First, they often rely on a hardcoded number of clusters and/or model type. In practice, ground truth annotations, such as the number of spatial domains, are generally not available. Second, principal component analysis (PCA) is commonly used for dimension reduction of the gene expression matrix. However, PCA primarily captures variability that may not align with features needed for clustering, potentially hindering accurate domain identification. To tackle these limitations, we applied model-based clustering with various dimension reduction techniques. We compared and benchmarked different clustering and dimension reduction methods using the dorsolateral prefrontal cortex reference dataset consisting of 12 samples. Specifically, we experimented with mclust and substituted PCA with alternative dimensionality reduction techniques. Most importantly, we used the Bayesian Information Criterion (BIC) to select the best model and determine the optimal number of clusters. Clustering was performed on both the spatial embeddings and the spatially enhanced gene expression matrix, with results compared to external knowledge using the Adjusted Rand Index (ARI). Our preliminary results on dimensionality reduction methods suggest that Spatially Variable Genes (SVG) may offer a more effective approach compared to PCA. We explored various SVG selection methods, including Giotto KMeans, Giotto Rank, and Spark-X to reduce the dimensions of GraphST's reconstructed gene expression matrix. Using Giotto KMeans, the best BIC-selected model achieved a higher ARI of 0.6198, outperforming GraphST's default hardcoded model, which uses PCA embeddings for clustering and achieved an ARI of 0.5993 on sample 151673.

Related Organizations

University of Mary
United States

Keywords

Spatial Domain Identification, Spatial Transcriptomics, Dimension Reduction, Mclust, Model Based Clustering

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green

Upload OA version

Are you the author? Do you have the OA version of this publication?

uploadUpload now!