Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: Datacite; ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2023
License: CC BY
Data sources: ZENODO; Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 5 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

Authors: Groussman, Ryan D; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia;

MarFERReT: an open-source, version-controlled reference library of marine microbial eukaryote functional genes

Abstract

Metatranscriptomics generates large volumes of sequence data about transcribed genes in natural environments. Taxonomic annotation of these datasets depends on availability of curated reference sequences. For marine microbial eukaryotes, current reference libraries are limited by gaps in sequenced organism diversity and barriers to updating libraries with new sequence data, resulting in taxonomic annotation of only about half of eukaryotic environmental transcripts. Here, we introduce version 1.0 of Marine Functional EukaRyotic Reference Taxa (MarFERReT), an updated marine microbial eukaryotic sequence library with a version-controlled framework designed for taxonomic annotation of eukaryotic metatranscriptomes. MarFERReT v1 contains reference sequences from 787 curated marine eukaryotic genomes and transcriptomes drawn from multiple sources, covering 444 species and 287 genera, totaling over 27 million protein sequences with an associated NCBI Taxonomy identifier and Pfam functional annotations. MarFERReT is linked to a code repository hosting containerized build scripts, documentation on installation and use case examples, and information on new versions of MarFERReT here: https://github.com/armbrustlab/marferret The raw source data for the 902 candidate entries considered for MarFERReT v1, including the 787 accepted entries, are available for download from their respective online locations. The source URL for each of the entries is listed here in MarFERReT.v1.entry_curation.csv, and detailed instructions and code for downloading the raw sequence data from source are available in the MarFERReT code repository (link). The following MarFERReT data products are available in this repository: MarFERReT.v1.entry_curation.csv This CSV file contains information on each candidate entry considered for inclusion into MarFERReT including the curation of NCBI Taxonomy IDs for MarFERReT entries and entry validation through hierarchical clustering of protein content and other supporting evidence. This file has rows for each of the 902 entries and contains the following columns: candidate_id: Numeric identifier for the candidate entry accepted: Acceptance into the final MarFERReT build (Y/N) ref_id: Numeric identifier for accepted entries marferret_name: Organism name in human and machine friendly format tax_id: Verified NCBI taxID used in MarFERReT NCBI_tax_name: Organism name in NCBI Taxonomy linked to taxID original_taxID: Original NCBI taxID from entry data source metadata, if available source_name: Full organismal name from entry source alias: Additional identifiers for the entry, if available data_source: Origin of the publicly available reference sequence entry source_filename: Name of the original sequence file name from the data source. source_reference: Shorthand reference source_link: URL where the original sequence data and/or metadata was collected. ref_link: Pubmed URL directs to the published reference for entry, if available. ref_doi: DOI of entry data from source, if available. n_seqs_raw: Number of sequences in the original sequence file. status: Status of the final NCBI taxID (Assigned, Updated, or Unchanged) taxID_notes: Notes on the original_taxID FLAG: Type of flag for candidate entries withheld from final build HCLUST_FLAG: Flag notes from hierarchical clustering analysis (see manuscript) FLAG_Lasek-Nesselquist: Flag notes from Lasek-Nesselquist and Johnson (2019) FLAG_VanVlierberghe: Flag notes from Van Vlierberghe et al., (2021) MarFERReT.v1.metadata.csv This CSV file contains rows for the 787 accepted entries in MarFERReT v1 with key metadata for each entry and values for internal process steps. The columns in this file contain the following information: candidate_id: Identifier for candidate entries from MarFERReT.v1.entry_curation.csv ref_id: Numeric identifier for accepted MarFERReT entries marferret_name: A human and machine friendly string derived from the NCBI Taxonomy organism name tax_id: The NCBI Taxonomy ID (taxID). lineage: High-level taxonomic lineage of organism data_type: Type of sequence data; transcriptome shotgun assemblies (TSA), gene models from assembled genomes (genome), and single-cell amplified genomes (SAG) or transcriptomes (SAT). data_source: Online location of sequence data; the Zenodo data repository (Zenodo), the datadryad.org repository (datadryad.org), MMETSP re-assemblies on Zenodo (MMETSP), NCBI Genbank (NCBI), JGI Phycocosm (JGI-Phycocosm), the TARA Oceans portal on Genoscope (TARA), or the Roscoff Culture Collection through the METdb portal (RCC). pub_year: Year of data release or publication of linked reference. source_filename: Name of the original sequence file name from the data source. seq_type: Entry sequence data retrieved in nucleotide (nt) or amino acid (aa) alphabets. n_seqs_raw: Number of sequences in the original sequence file. aa_fasta: Internal name used for protein sequence files. MarFERReT.v1.proteins.faa.gz This Gzip-compressed FASTA file contains the 27,788,088 final translated and clustered protein sequences for all 787 accepted MarFERReT entries. The sequence defline contains the unique identifier for the sequence and its reference (mftX, where ‘X’ is a ten-digit integer value). MarFERReT.v1.taxonomies.tab.gz This Gzip-compressed tab-separated file is formatted for interoperability with the DIAMOND protein alignment tool commonly used for downstream analyses (see Usage Notes) and contains some columns without any data. Each row contains an entry for one of the MarFERReT protein sequences in MarFERReT.v1.proteins.faa.gz. The columns in this file contain the following information: accession: Not used (NA). accession.version: The unique MarFERReT sequence identifier (‘mftX’). taxid: The NCBI Taxonomy ID associated with this reference sequence. gi: Not used (NA). MarFERReT.v1.proteins_info.tab.gz This Gzip-compressed tab-separated file contains a row for each final MarFERReT protein sequence with the following columns: aa_id: the unique identifier for each MarFERReT protein sequence. ref_id: The unique numeric identifier for each MarFERReT entry. source_defline: The original, unformatted sequence identifier MarFERReT.candidate_entry_Pfam_annotations.tar.gz This Gzip-compressed archive contains the raw HMMER3 output from the search of Pfam 34.0 HMM profiles against the full set of protein sequences from 899 of the candidate entries. The archive contains 899 files with the suffix ‘.Pfam34.domtblout.tab’ and a prefix with the ‘candidate_id’ and ‘marferret_name’ values from MarFERReT.v1.metadata.csv. The ‘domtblout.tab’ files are the output from hmmsearch using the --domtblout parameter containing 3 header and 10 footer rows beginning with ‘#’ and rows for each hmmsearch match with 22 whitespace-delimited fields and a target sequence description (see here for more information on the hmmsearch output file formats). The ‘target name’ (original sequence identifier from MarFERReT.v1.proteins_info.tab.gz), ‘query name’ (Pfam name), ‘accession’ (Pfam ID), ‘E-value’ and ‘score’ (full sequence match scores) are retained in downstream data products. MarFERReT.v1.best_pfam.csv.gz This Gzip-compressed CSV file contains the best-scoring Pfam annotation for intra-species clustered protein sequences from the 787 final MarFERReT entries; derived from the raw hmmsearch annotations in MarFERReT.candidate_entry_Pfam_annotations.tar.gz. This files contain the following fields: aa_id: The unique MarFERReT protein sequence ID (‘mftX’). ref_id: Identifier for validated MarFERReT entries source_defline: Original FASTA sequence identifier candidate_id: Identifier for entries validated for inclusion into MarFERReT pfam_name: The shorthand Pfam protein family name. pfam_id: The Pfam identifier. MarFERReT.v1.entry_pfam_sums.csv.gz This Gzip-compressed CSV file contains a reduced version of MarFERReT.v1.best_pfam.csv.gz; grouped by `ref_id` and `pfam_name` to summarize the number of sequences (`n_seqs`) with each unique ref_id-pfam_name pair. Contains the `ref_id`, `pfam_name` and `n_seqs` columns. MarFERReT.core_genes.v1.csv This CSV file contains the core transcribed gene (CTG) catalog derived from MarFERReT transcribed reference sequence data (see Methods) to be used in environmental metatranscriptome analysis in conjunction with other MarFERReT data products. The columns contain the following values: lineage: Name of major marine microbial eukaryote lineage n_species: Number of species under this lineage in MarFERReT pfam_id: Pfam protein family identifier frequency: Proportion of species (n_species) in lineage where pfam_id is observed

Related Organizations
Keywords

protists, metatranscriptomics, annotation, marine microbiology

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    2
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 48
    download downloads 89
  • 48
    views
    89
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
2
Top 10%
Average
Average
48
89
Related to Research communities