ACTIVA: realistic single-cellRNA-seq generation with automatic cell-type identificationusing introspective variational autoencoders

(References to the used tools are available in the manuscript) Datasets 68K PBMC: To compare our results with the current state-of-the-art deep learning model, scGAN/cscGAN, we trained and evaluated our model on a dataset containing 68579 peripheral blood mononuclear cells (PBMCs) from a healthy donor (68K PBMC). 68K PBMC is an ideal dataset for evaluating generative models due to the distinct cell populations, data complexity, and size scGAN. After pre-processing, the data contained 17789 genes. We then performed a balanced split on this data, which resulted in 6991 testing and 61588 training cells. Brain Small: In addition to 68K PBMC, we used a randomly-selected subset of a larger dataset called Brain Large (both by 10x Genomics). Brain Large contains approximately 1.3 million cells from the cortex, hippocampus, and the subventricular zone of two embryonic day 18 mice. Compared to 68K PBMC, this dataset has fewer cells, and it varies in complexity and organism. Both Brain Large and its subset (Brain Small) are available on 10X Genomics portal. After performing the pre-processing steps, Brain Small contained 17970 genes, which we then split (via "balanced split") to 1997 test cells and 18003 training cells. NeuroCOVID: This dataset (Heming et al.) contains scRNAseq data of immune cells from the cerebrospinal fluid (CSF) of Neuro-COVID patients and patients with non-inflammatory and autoimmune neurological diseases or with viral encephalitis. Our pre-processing resulted in data of dimensions 85414 cells x 22824 genes, which we split to testing and training subsets as mentioned above. Pre-Processing We the pipeline provided by Marouf et al. 2020 (scGAN) to pre-process the data. First, we removed genes that were expressed in < 3 cells and cells that expressed <10 genes. Next, cells were normalized by total unique molecular identifiers (UMI) counts and scaled to 20000 reads/cell. Then, we selected a "test set'' ( approximately 10% of each dataset). Post-Processing After generating a count matrix with a generative model (e.g. ACTIVA or scGAN), we add the gene names (from the real data) and save as a Scanpy/Seurat object. We then use Seurat to identify 3000 highly variable genes through the use of variance-stabilization transformation (VST), which applies a negative binomial regression to identify outlier genes. The shared highly variable genes are then used for integration [\cite{seurat-integrate}] which allows for biological feature overlap between different datasets in order to perform the downstream analyses presented in this work. We next perform a gene-level scaling, i.e. centering the mean of each feature to zero and scaling by the standard deviation. The feature space in then reduce to 50 principal components, followed by Uniform Manifold Approximation and Projection (UMAP) and t-distributed Stochastic Neighbor Embedding (t-SNE). As noted by Marouf et al, analysis with lower-dimensional representations have two main advantages: (i) most biologically relevant information is captured while noise is reduced and (ii) statistically, it is more acceptable to use lower dimensional embeddings in classification tasks when samples and features are of the same order of magnitude, which is often the case with scRNAseq datasets (such as the ones we used). Lastly, we use Scater to visualize the datasets.

Related Organizations

University of California, Merced
United States

Keywords

Deep Learning, scRNAseq, Generative Models

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average