Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2024
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2024
License: CC BY
Data sources: Datacite
versions View all 3 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Authors: Groussman, Mora; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia;

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Abstract

Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3Processing and annotation of protein-level NPEGC metatranscripts is done in 5 steps:1. Six-frame translation into protein sequences2. Frame-selection of protein-coding translation frames3. Clustering of protein sequences at 99% sequence identity4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3# Define local NPEGC base directory here:NPEGC_DIR="/mnt/nfs/projects/armbrust-metat" # Raw assemblies are located in the /assemblies/raw/ directory# for each of the metatranscriptome projectsPROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel" # raw Trinity assemblies:RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"TranslationWe began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.Example of six-frame translation using transeqtranseq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fastaFrame selectionWe use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.Clustering by sequence identityTo reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.Sample of linclust clustering script: core mmseqs functionfunction NPEGC_linclust {# make an index of the fasta file:$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db# cluster sequences at $MIN_SEQ_ID$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}# retieve cluster representatives:$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep# generate flat FASTA output with cluster reps$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header}Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz) NPac.G1PA.bf100.id99.aa.fasta.gz NPac.G2PA.bf100.id99.aa.fasta.gz NPac.G3PA.bf100.id99.aa.fasta.gz NPac.G3PA_diel.bf100.id99.aa.fasta.gz NPac.D1PA.bf100.id99.aa.fasta.gzMarFERReT + MARMICRODB taxonomic annotation with DIAMOND Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation. Excerpt of core DIAMOND function:function NPEGC_diamond {# FASTA filename for $STUDYFASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"# Output filename for LCA results in lca.tab file:LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"echo "Beginning ${STUDY}"singularity exec --no-home --bind ${DATA_DIR} \ "${CONTAINER_DIR}/diamond.sif" diamond blastp \ -c 4 --threads $N_THREADS \ --db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \ --memory-limit 110 \ --query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1}Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz) NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gzPfam 35.0 functional annotation using HMMER3Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.shExcerpt of core hmmsearch function:function NPEGC_hmmer {# Define input FASTAINPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"# hmmsearch call:hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}# compress output file:gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab}Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz) G1PA.Pfam35.domtblout.tab.gz G2PA.Pfam35.domtblout.tab.gz G3PA.Pfam35.domtblout.tab.gz G3PA_diel.Pfam35.domtblout.tab.gz D1PA.Pfam35.domtblout.tab.gz

Related Organizations
  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    1
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
1
Average
Average
Average