Predicting Phenotypic Traits Using a Massive RNA-seq Dataset

Abstract The included datasets are a conglomerate of all available Arabidopsis thaliana RNA-seq data available from NCBI as of November 2022 processed to count data. In addition, the associated annotation files from NCBI BioProject database and processed versions of this data is included. Data has been processed according to the "Data Description Methods" in the manuscript titled "Predicting Phenotypic Traits Using a Massive RNA-seq Dataset" (in publication). The associated Methods can be found at this repository: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomics. These datasets can be used for exploring machine learning methods for predicting both continuous (Age) and categorical (Tissue) phenotypic traits using gene expression. Additionally, the gene expression data can be used on its own for the investigation of gene expression in Arabidopsis thaliana. Note to Researchers This repository contains all of the datasets and information necessary to recreate the experiments in our paper. However, if may be that you are interested in our dataset for testing your own hypotheses/programs. If this is the case, we predict that you are looking for one or more of the the following 5 datasets Note on File Compression All files in this repository are compressed using bzip2 to conserve space and allow for easier file transfer. The unzip command on linux systems is `bzip2 -d FILE_NAME`. For other computer systems (Windows and Apple) please consult your user manual. Description All Datasets: Title: Gene Expression Count Data of all Arabidopsis thaliana data available from NCBI SRA as of November 2022Abstract: Gene Expression Count data was created using the workflow GEMmaker. The resulting Gene Expression Matrix (GEM) was then normalized and thresholded. The following 4 files are normalizations of the same data for Trimmed Mean of M values (TMM), Median Ratios Normalization (MRN), Transcripts Per kilobase Million (TPM), and No Normalization (NoNo) respectively. Additionally, Each file is included as a tsv and a python pickle. The tsv file is human readable, whereas the pickle file can be read into memory substantially faster. Format for tsv is each row represents a sample and each column represents a gene. NCBI_Nov2022_SRR_runinfo.csv is the starting file from NCBI which reports SRR information for each sample. Note 1 to Researchers: MRN normalization performed the best in our experiments and is likely what you want to use if you are doing additional expermentation with this dataset. Otherwise start with NoNo and perform your own normalizations. Note 2 to Researchers: the 54547 dataset will need to be thresholded prior to use. We include it in addition to the 32432 datasets in case you wish to try a different thresholding to the one outlined in our manuscript. Author: John Anthony HadishData Type: Gene Expression Count DataOrganism: Arabidopsis thalianaFiles: NCBI_Nov2022_SRR_runinfo.csv - Arabidopsis RNA-seq SRA RunInfo Retrieved from NCBI November 2022. This is the unprocessed data.Dataset_54547_NoFilter_raw.pkl - Raw File Before thresholding (".pkl" format). Same as NoNo normalization without thresholding.Dataset_54547_NoFilter_raw.tsv - Raw File Before thresholding (".tsv" format). Same as NoNo normalization without thresholding.Dataset_32432_MRN.pkl - MRN normalized (".pkl" format)Dataset_32432_MRN.tsv - MRN normalized (".tsv" format)Dataset_32432_NoNo.pkl - NoNo normalized (".pkl" format)Dataset_32432_NoNo.tsv - NoNo normalized (".tsv" format)Dataset_32432_TMM.pkl - TMM normalized (".pkl" format)Dataset_32432_TMM.tsv - TMM normalized (".tsv" format)Dataset_32432_TPM.pkl - TPM normalized (".pkl" format)Dataset_32432_TPM.tsv - TPM normalized (".tsv" format)Title: Meta Data Arabidopsis Age and TissueAbstract: Meta Data for Age and Tissue after processing. In our experiment this was used as response variable to gene expression. Shared columns are "bio_sample", "bioproject_name", "experiment". In addition to these processed datasets, NCBI_Nov2022_BioSample_data.tsv is the unprocessed starting material for these two data frames.Author: John Anthony HadishData Type: Metadata on phenotypes. ".tsv" format Organism: Arabidopsis thalianaFiles: NCBI_Nov2022_BioSample_data.tsv - Arabidopsis BioSample data retrieved from NCBI November 2022. This is the unprocessed data.df_metadata_tissue.tsv - Tissue Annotations for 24876 samplesdf_metadata_age.tsv - Age Annotations for 16078 samples. In addition to shared columns includes "days_age"(how many days old the sample is converted to days) and "annotation_age" (how the annotation was reported for this sample in the raw data file-- i.e. "days", "weeks" etc.)Title: Machine Learning Dataset for Arabidopsis thaliana AgeAbstract: The dataset used for Machine learning on the phenotype Age that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits.Author: John Anthony HadishData Type: Gene Expression Matrix and Annotations Combined, split into train and test Organism: Arabidopsis thalianaFiles:Dataset_Age_TrainTestSplits_mrn.pkl - MRN normalizedDataset_Age_TrainTestSplits_NoNo.pkl - NoNo normalizedDataset_Age_TrainTestSplits_tmm.pkl - TMM normalizedDataset_Age_TrainTestSplits_tpm.pkl - TPM normalizedTitle: Machine Learning Dataset for Arabidopsis thaliana TissueAbstract: The dataset used for Machine learning on the phenotype Tissue that is a combination of the Gene Expression Matrix and the Annotation Matrix. Consists of a list of 4 for the train and test splits. Saved as python ".pkl" files.Author: John Anthony HadishData Type: Gene Expression Matrix and Annotations Combined, split into train and test. Saved as python ".pkl" files.Organism: Arabidopsis thalianaFiles:Dataset_Tissue_TrainTestSplits_mrn.pkl - MRN normalizedDataset_Tissue_TrainTestSplits_NoNo.pkl - NoNo normalizedDataset_Tissue_TrainTestSplits_tmm.pkl - TMM normalizedDataset_Tissue_TrainTestSplits_tpm.pkl - TPM normalizedDataset_Tissue_TrainTestSplits_mrn_4category.pkl - MRN for the tissue-4 datasetTitle: BioProject NamesAbstract: Three Column File With BioProject Name, BioSample Name, and Experiment NameAuthor: John Anthony HadishData Type: ".tsv"Organism: Arabidopsis thalianaFiles:BioProject_Names_All.tsvTitle: Manuscript Supplemental MaterialAbstract: Supplemental tables and figures described in the manuscript (included with manuscript and here for convenience). Please see manuscript for additional information.Author: John Anthony HadishData Type: ".tsv", ".png".pdf"Organism: Arabidopsis thalianaFiles:Supplemental_Figures.zip - Supplemental figures from the manuscript. Includes description of each figure.Supplemental_Tables.zip - Supplemental tables from the manuscript. Includes description of each table. Title: Splits of data for 3 experimentsAbstract: 2 column tsv files. The first column is the experiment (sample) name, and the second column is if it is included in the train or test data. Included here to make sure pkl files are reproducible in case the pkl package breaks in the future. Not used by scripts, included to prevent future potential loss of data.Author: John Anthony HadishData Type: ".tsv"Organism: Arabidopsis thalianaFiles:Dataset_Tissue_TrainTestSplits_4category_namesOnly.tsvDataset_Tissue_TrainTestSplits_namesOnly.tsvDataset_Age_TrainTestSplits_namesOnly.tsv Title: Git Code RepositoryAbstract: A tar bz2 compression of the git repository containing all of the code created for this manuscript. The same code found in this file is also avalible on GitLab at the link: https://gitlab.com/ficklinlab-public/modeling-with-transcriptomicsAuthor: John Anthony HadishData Type: Git Repository, python codeFiles:modeling-with-transcriptomics-main.tar.bz2 - Compressed Git repository of all code used in paper.

Related Organizations

Washington State University
United States

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average