<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

Research datakeyboard_double_arrow_right Dataset 25 Jan 2024Publisher:Zenodo

Authors: Groussman, Mora; Blaskowski, Stephen; Coesel, Sacha; Armbrust, E. Virginia;

doi: 10.5281/zenodo.10472590 , 10.5281/zenodo.10472589 , 10.5281/zenodo.12630398

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

- Summary
- Related research
  (4)
- Metrics

Abstract

Excerpts of key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog This data continues with the development of the unprocessed NPEGC Trinity de novo metatranscriptome assemblies, uploaded to this Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3Processing and annotation of protein-level NPEGC metatranscripts is done in 5 steps:1. Six-frame translation into protein sequences2. Frame-selection of protein-coding translation frames3. Clustering of protein sequences at 99% sequence identity4. Taxonomic annotation against MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library with DIAMOND5. Functional annotation against Pfam 35.0 protein family HMM profiles using HMMER3# Define local NPEGC base directory here:NPEGC_DIR="/mnt/nfs/projects/armbrust-metat" # Raw assemblies are located in the /assemblies/raw/ directory# for each of the metatranscriptome projectsPROJECT_LIST="D1PA G1PA G2PA G3PA G3PA_diel" # raw Trinity assemblies:RAW_ASSEMBLY_DIR="${NPEGC_DIR}/${PROJECT}/assemblies/raw"TranslationWe began processing the raw metatranscriptome assemblies by six-frame translation from nucleotide transcripts into three forward and three reverse reading frame translations, using the transeq function in the EMBOSS package. We add a cruise and sample prefix to the sequence IDs to ensure unique identification downstream (ex, `>TRINITY_DN2064353_c0_g1_i1_1` to `>G1PA_S09C1_3um_TRINITY_DN2064353_c0_g1_i1_1` for the S09C1_3um sample in the G1PA assemblies). See NPEGC.6tr_frame_selection_clustering.sh for full code description.Example of six-frame translation using transeqtranseq -auto -sformat pearson -frame 6 -sequence 6tr/${PREFIX}.Trinity.fasta -outseq 6tr/${PREFIX}.Trinity.6tr.fastaFrame selectionWe use a custom frame-selection python script keep_longest_frame.py to determine the longest coding length in each open reading frame and retain this sequence (or multiple sequences if there is a tie) for downstream analyses. See NPEGC.6tr_frame_selection_clustering.sh for full code description.Clustering by sequence identityTo reduce sequence redundancy and near-identical sequences, we cluster protein sequences at the 99% sequence identity level and retain the sequence cluster representative in a reduced-size FASTA output file. See NPEGC.6tr_frame_selection_clustering.sh for full code description of linclust/mmseqs clustering.Sample of linclust clustering script: core mmseqs functionfunction NPEGC_linclust {# make an index of the fasta file:$MMSEQS_DIR/mmseqs createdb $FASTA_PATH/$FASTA_FILE NPac.$STUDY.bf100.db# cluster sequences at $MIN_SEQ_ID$MMSEQS_DIR/mmseqs linclust NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac_tmp --min-seq-id ${MIN_SEQ_ID}# retieve cluster representatives:$MMSEQS_DIR/mmseqs result2repseq NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.db NPac.${STUDY}.clusters.rep# generate flat FASTA output with cluster reps$MMSEQS_DIR/mmseqs result2flat NPac.${STUDY}.bf100.db NPac.${STUDY}.bf100.db NPac.${STUDY}.clusters.rep NPac.${STUDY}.bf100.id99.fasta --use-fasta-header}Corresponding files uploaded to this repository: Gzip-compressed FASTA files after translation, frame-selection, and clustering at 99% sequence identity (.bf100.id99.aa.fasta.gz) NPac.G1PA.bf100.id99.aa.fasta.gz NPac.G2PA.bf100.id99.aa.fasta.gz NPac.G3PA.bf100.id99.aa.fasta.gz NPac.G3PA_diel.bf100.id99.aa.fasta.gz NPac.D1PA.bf100.id99.aa.fasta.gzMarFERReT + MARMICRODB taxonomic annotation with DIAMOND Taxonomy was inferred for the NPEGC metatranscripts with the DIAMOND fast read alignment software against the MarFERReT v1.1 + MARMICRODB v1.0 multi-kingdom marine reference protein sequence library (v1.1), a combined database of the MarFERReT v1.1 marine microbial eukaryote sequence library and MARMICRODB v1.0 prokaryote-focused marine genome database. See NPEGC.diamond_taxonomy.log.sh for full description of DIAMOND annotation. Excerpt of core DIAMOND function:function NPEGC_diamond {# FASTA filename for $STUDYFASTER_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"# Output filename for LCA results in lca.tab file:LCA_TAB="NPac.${STUDY}.MarFERReT_v1.1_MMDB.lca.tab"echo "Beginning ${STUDY}"singularity exec --no-home --bind ${DATA_DIR} \ "${CONTAINER_DIR}/diamond.sif" diamond blastp \ -c 4 --threads $N_THREADS \ --db $MFT_MMDB_DMND_DB -e $EVALUE --top 10 -f 102 \ --memory-limit 110 \ --query ${FASTER_FASTA} -o ${LCA_TAB} >> "${STUDY}.MarFERReT_v1.1_MMDB.log" 2>&1}Corresponding files uploaded to this repository: Gzip-compressed diamond lowest common ancestor predictions with NCBI Taxonomy against a combined MarFERReT + MARMICRODB taxonomic library (*.Pfam35.domtblout.tab.gz) NPac.G1PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G2PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G3PA.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.G3PA_diel.MarFERReT_v1.1_MMDB.lca.tab.gz NPac.D1PA.MarFERReT_v1.1_MMDB.lca.tab.gzPfam 35.0 functional annotation using HMMER3Clustered protein sequences were annotated against the Pfam 35.0 collection of 19,179 protein family Hidden Markov Models (HMMs) using HMMER 3.3 with the Pfam 35.0 protein family database. Pfam annotation code is documented here: NPEGC.hmmer_function.shExcerpt of core hmmsearch function:function NPEGC_hmmer {# Define input FASTAINPUT_FASTA="NPac.${STUDY}.bf100.id99.aa.fasta"# hmmsearch call:hmmsearch --cut_tc --cpu $NCORES --domtblout $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab $HMM_PROFILE ${INPUT_FASTA}# compress output file:gzip $ANNOTATION_DIR/${STUDY}.Pfam35.domtblout.tab}Corresponding files uploaded to this repository: Gzip-compressed hmmsearch domain table files for Pfam35 queries (*.Pfam35.domtblout.tab.gz) G1PA.Pfam35.domtblout.tab.gz G2PA.Pfam35.domtblout.tab.gz G3PA.Pfam35.domtblout.tab.gz G3PA_diel.Pfam35.domtblout.tab.gz D1PA.Pfam35.domtblout.tab.gz

Related Organizations

University of Mary
United States

Filter by relation

All relations

arrow_drop_down

4 Research products, page 1 of 1

The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3
2023IsDerivedFrom
The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts
2024IsSupplementedBy
Gradients 1-3 polyA-selected transcripts per million, Gradients 3 depth profile polyA-selected processed metatranscriptomes
2024IsContinuedBy
The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts
2024IsSupplementedBy

Impact byBIP!

	citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

The North Pacific Eukaryotic Gene Catalog: metatranscriptome assemblies with taxonomy, function and abundance annotations

4 Research products, page 1 of 1

The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3

The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts

Gradients 1-3 polyA-selected transcripts per million, Gradients 3 depth profile polyA-selected processed metatranscriptomes

The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts