MarPRISM

Data used to develop and test MarPRISM, a model to predict the in situ trophic mode of marine protists. To examine the in situ activity of protists, Lambert et al., 2022 developed a machine learning model to predict the trophic mode of marine protist species based on gene expression from metatranscriptomes. Recent studies (Groussman et al., 2023; Lasek-Nesselquist and Johnson, 2019; Van Vlierberghe et al., 2021) identified that a number of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) transcriptomes used for training the Lambert model have a low number of sequences and/or high contamination. trainingData_withContam_withLowSeqs.csv.gz: training data used by Lambert et al. (2022), includes contaminated and low-sequence entries. trainingDataMarPRISM.csv.gz: training data with contaminated and low-sequence entries removed, training data used for MarPRISM. Transcriptomes were removed from the training data and used for testing that had less than 1200 total sequences, less than 500 total assigned Pfam domains, and/or greater than 50% contamination from non-target organisms. trainingDataMarPRISM_binary.csv.gz: trainingDataMarPRISM.csv.gz but TPM values greater than 0 were converted to 1 in order to determine whether a model could be built based on the binary expression of Pfams rathern than continuous expression values. trainingData_micromonasMixToPhot.csv.gz: trainingDataMarPRISM.csv.gz but with any transcriptomes from Micromonas strains that were originally labeled mixotrophic switched to phototrophic. This dataset was created to determine the effect of Micromonas in the training data, as well as the effect of permuting trophic mode labels in the training data as there may be errors in some of the trophic mode labels. The training datasets are unbalanced with more phototrophic transcriptomes than heterotrophic and mixotrophic transcriptomes. So feature selection and hyperparameter search was conducted on versions of the training datasets with phototrophic transcriptomes randomly undersampled. trainingData_withContam_withLowSeqs_80phot.zip: 80 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingData_withContam_withLowSeqs.csv.gz trainingData_withContam_withLowSeqs_100phot.zip: 100 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingData_withContam_withLowSeqs.csv.gz trainingData_withContam_withLowSeqs_120phot.zip: 120 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingData_withContam_withLowSeqs.csv.gz trainingData_withContam_withLowSeqs_140phot.zip: 140 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingData_withContam_withLowSeqs.csv.gz trainingData_contamLowSeqsRemoved_50phot.zip: 50 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingDataMarPRISM.csv.gz trainingData_contamLowSeqsRemoved_80phot.zip: 80 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingDataMarPRISM.csv.gz trainingData_contamLowSeqsRemoved_100phot.zip: 100 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingDataMarPRISM.csv.gz trainingData_contamLowSeqsRemoved_120phot.zip: 120 phototrophic transcriptomes, all of the mixotrophic and heterotrophic transcriptomes in trainingDataMarPRISM.csv.gz trainingData_contamLowSeqsRemoved_binary_50phot.zip: same as trainingData_contamLowSeqsRemoved_50phot.zip but TPM values > 0 were converted to 1 trainingData_contamLowSeqsRemoved_binary_80phot.zip: same as trainingData_contamLowSeqsRemoved_80phot.zip but TPM values > 0 were converted to 1 trainingData_contamLowSeqsRemoved_binary_100phot.zip: same as trainingData_contamLowSeqsRemoved_100phot.zip but TPM values > 0 were converted to 1 trainingData_contamLowSeqsRemoved_binary_120phot.zip: same as trainingData_contamLowSeqsRemoved_120phot.zip but TPM values > 0 were converted to 1 trainingData_contamLowSeqsRemoved_micromonasMixToPhot_50phot.zip: after converting mixotrophy labels for any Micromonas strains in trainingDataMarPRISM.csv.gz to phototrophy, 50 phototrophic transcriptomes were randomly selected along with all of the mixotrophic and heterotrophic transcriptomes trainingData_contamLowSeqsRemoved_micromonasMixToPhot_80phot.zip: after converting mixotrophy labels for any Micromonas strains in trainingDataMarPRISM.csv.gz to phototrophy, 80 phototrophic transcriptomes were randomly selected along with all of the mixotrophic and heterotrophic transcriptomes trainingData_contamLowSeqsRemoved_micromonasMixToPhot_100phot.zip: after converting mixotrophy labels for any Micromonas strains in trainingDataMarPRISM.csv.gz to phototrophy, 100 phototrophic transcriptomes were randomly selected along with all of the mixotrophic and heterotrophic transcriptomes trainingData_contamLowSeqsRemoved_micromonasMixToPhot_120phot.zip: after converting Micromonas mixotrophy labels in trainingDataMarPRISM.csv.gz to phototrophy, 120 phototrophic transcriptomes were randomly selected along with all of the mixotrophic and heterotrophic transcriptomes Feature selection was run on the training datasets after undersampling phototrophic transcriptomes. Feature selection was run for both XGBoost and Random Forest models. MarPRISM_featurePfams.csv.gz: MarPRISM feature Pfams, XGBoost model, contaminated and low-sequence entries removed from training data Extracted_Pfams_contaminationLowSeqsRemoved_rfModel_rfFeatures.csv.gz: Random Forest model, contaminated and low-sequence entries removed from training data Extracted_Pfams_contaminationLowSeqsRemoved_xgModel_xgRFFeatures.csv.gz: Union of XGBoost and Random Forest feature Pfams, contaminated and low-sequence entries removed from training data Extracted_Pfams_contaminationLowSeqsIncluded_xgModel_xgFeatures.csv.gz: XGBoost model, contaminated and low-sequence entries included in training data Extracted_Pfams_contaminationLowSeqsIncluded_xgModel_xgFeatures.csv.gz: Random Forest model, contaminated and low-sequence entries included in training data Extracted_Pfams_contaminationLowSeqsRemoved_xgModel_xgFeatures_binary.csv.gz: XGBoost model, contaminated and low-sequence entries removed from training data, TPM values > 0 were converted to 1 Extracted_Pfams_contaminationLowSeqsRemoved_xgModel_xgFeatures_micromonasMixToPhot.csv.gz: XGBoost model, contaminated and low-sequence entries removed from training data, mixotrophy labels for any Micromonas strains were converted to phototrophy Transcriptomes not included in the training data and not from the MMETSP were used to test MarPRISM. testTranscriptomes.csv.gz: Transcript per million counts by Pfam ID for these test transcriptomes Some of these transcriptomes were processed and used for testing by Lambert et al., 2022, while other transcriptomes, from Pterosperma cristatum, Amphora coffeaeformis, Chaetoceros sp., and Cylindrotheca closterium, were newly added for testing. Transcripts per million for the latter transcriptomes derived from publicly salmon mappings. Pfam annotations for the newly added transcriptomes were generated through the following: the publicly available assembled transcriptomes for each species was six-frame translated with transeq , the longest reading frame (minimum 100 amino acid length) was selected for each contig, the longest contig was compared to the Pfam database version 34 with hmmsearch, and the Pfam annotation with the best bitscore for each contig was retained (e-value < 1e-05). Transcripts per million were summed by Pfam. testTranscriptomes.xlsx.gz: Accession IDs and references for these test transcriptomes; transcriptome culture conditions. Transcriptomes that were excluded from the training data for MarPRISM due to having high contamination or low-sequence abundance were also used to test MarPRISM. testTranscriptomes_MMETSP.csv.gz: Transcript per million counts by Pfam ID for these excluded transcriptomes testTranscriptomes_MMETSP.xlsx.gz: MMETSP IDs and references for these excluded transcriptomes. Contamination and low-sequence entries were identified by Groussman et al., 2023; Lasek-Nesselquist and Johnson, 2019; Van Vlierberghe et al., 2021 and curated by Groussman et al., 2023. Identity of ribosomal sequences was analyzed by Groussman et al., 2023. qc_flag_Groussman: LOW_SEQS; less than 1,200 raw sequences; LOW_PFAMS; less than 500 Pfam domain annotations. num_sequences_Groussman: Number of sequences in original sequence file. num_pfams_Groussman: Number of Pfam domains identified in protein sequences. flag_Lasek: Flag notes from Lasek-Nesselquist and Johnson, 2019; CONTAM NOTED; ciliate samples reported as contaminated in this study. flag_VanVlierberghe: Flag for a high level of estimated contamination from 'flag_VanVlierberghe'; CONTAM_50PCT; contamination percentages over 50%: flag_ribosomalContamination_Groussman: Flag for a high level of estimated contamination, from ‘ribosomal_contam_pct_Groussman'; CONTAM_50PCT; contamination percentages over 50%. ribosomal_contam_pct_Groussman: Percent of total ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity. Ribosomal taxonomy of most abundant contaminant: For entries with greater than 50% ribosomal protein sequences with an inferred taxonomic identity in any lineage other than the recorded identity, the taxonomic identity of the most abundant ribosomal protein sequences not identified as the recorded identity of the transcriptome in the MMETSP. Expected taxonomy of transcriptome: recorded identity in MMETSP.

Related Organizations

University of Mary
United States

Keywords

protists, Machine Learning, Micro-organism, mixotrophy, FOS: Earth and related environmental sciences, Oceanography, Aquatic micro-organism, biological oceanography

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average