CellO Data Sets

Overview A permanent archive of the datasets used in the CellO manuscript (https://www.biorxiv.org/content/10.1101/634097v2). Expression data Quantified gene expression for all bulk RNA-seq samples used in this study are available as an HDF5 file (in log transcripts per million): bulk_log_tpm.h5 Each bulk RNA-seq experiment accession is mapped to a set of cell type labels from the Cell Ontology: bulk_labels.json The single-cell data used in this study are also available as an HDF5 file (in log transcripts per million): single_cell_log_tpm.h5 Each single-cell experiment accession is mapped to a set of cell type labels from the Cell Ontology: single_cell_labels.json Dataset partitions We partitioned the bulk RNA-seq data into several subsets that were used for various purposes in the study: The list of bulk RNA-seq samples used for training the classifier for evaluation on the bulk validation-set (i.e. the pre-taining set): pre_training_bulk_experiments.json The list of bulk RNA-seq samples in the validation-set: validation_bulk_experiments.json The list of single-cell experiments in the test set used for evaluting CellO. These are all samples with cell type terms that also appear in the bulk RNA-seq data (i.e. the training data): test_single_cell_experiments.json Technical variable annotations We annotated 27,097 RNA-seq samples in the Sequence Read Archive (SRA) with technical variables in order to derive a set of primary, healthy, untreated samples (i.e. the datasets above). Our annotations were based on a custom label-hierarchy of technical variables: tags.json The mapping from each SRA experiment accession to its set of technical variable labels: experiment_tags.json Trained model coefficients After training the binary classifiers for each cell type, the model coefficients can be used to investigate up and downregulated genes in each cell type. Below, we post the model coefficients for the one-versus-rest trained binary classifiers (used in the Isotonic Regression and True Path Rule algorithms) as well as the coefficients for the classifiers in the Cascaded Logistic Regression algorithm. Each model was trained on the full set of bulk RNA-seq samples used in the study. Each algorithm's cell type model coefficients are available in a tab-separated-value file: One-versus-rest classifier coefficients: one_vs_rest_coefficients.tsv.gz Cascaded logistic regression coefficients: cascaded_logistic_regression_coefficients.tsv.gz

Related Organizations

Morgridge Institute for Research
United States
University of Wisconsin–Oshkosh
United States

Keywords

machine learning, Sequence Read Archive, cell type, CellO, RNA-seq

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	17
download	downloads	5

17
views
5
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

17

5