Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs (v250214_91366)

Authors: Toghani, AmirAli; Sugihara, Yu; Kamoun, Sophien;

Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs (v250214_91366)

Abstract

Abstract Plant nucleotide-binding leucine-rich repeat receptors (NLRs) are intracellular immune receptors crucial for pathogen recognition and immune responses. Despite their importance, NLRs are often challenging to annotate and frequently overlooked by standard annotation pipelines. To address the variability in NLR annotation accuracy across pipelines, we performed a harmonized de novo annotation of 230 high-quality superasterid genomes using the deep learning-based software Helixer (Holst et al. 2023), resulting in the annotation of 10,124,265 protein sequences. Additionally, we employed NLRtracker, which leverages InterProScan for domain identification, to detect NLR and NLR-associated sequences (Kourelis et al. 2021, Blum et al. 2025). Using the NLR definition from the RefPlantNLR dataset, we identified 91,366 NLRs, with counts ranging from 12 and 19 in the parasitic plants Cuscuta campestris and Orobanche coerulescens to 2,804 in Solanum tuberosum (potato). Beyond NLR annotation, we provide genome annotations, including proteomes, coding nucleotide sequences (CDS), and GFF files generated by Helixer. This dataset offers a valuable resource for standardized comparative genomics and evolutionary studies across superasterids. Available at Dryad: https://doi.org/10.5061/dryad.sxksn03d6 Methods Helixer v0.3.2 (Stiehler et al. 2020; Holst et al. 2023) was executed using Singularity for genome FASTA files with the option '--lineage land_plant', which applies the default model (land_plant_v0.3_a_0080.h5) for land plants. Coding DNA sequences (CDS) and protein FASTA files were extracted from the output GFF files using GffRead v0.12.7 (Pertea and Pertea 2020) with the '-x' and '-y' options, respectively. The extracted protein sequences were then analyzed using NLRtracker (Kourelis et al. 2021), which integrates InterProScan v5.65-97.0 (Jones et al. 2014). BUSCO scores were generated using BUSCO v5.5.0 with [-m protein --lineage_dataset viridiplantae_odb10] options (Manni et al. 2021). Helixer output legend Genome annotations are categorized according to the phylogenetic order, based on information from APG IV (The Angiosperm Phylogeny Group et al. 2016). Each order has its own subdirectory containing genome assembly FASTA, GFF annotations, CDS FASTA, protein FASTA, and NLRtracker output files. Additionally, two files containing compiled proteomes and CDS FASTA files with source assembly tags are provided. NLRtracker output legend File extension Description * _NLRtracker.tsv NLRtracker overview output with gene status. *_NLR.lst Identifier list of NLRs. *_NLR.gff3 NLR annotation of motifs, domains, and regions in GFF3 format. *_NLR.fasta NLR FASTA sequences. *_NLR-associated.lst Identifier list of NLR associated genes. *_NLR-associated.gff3 NLR associated genes annotation of motifs, domains, and regions in GFF3 format. *_NLR_associated.fasta NLR associated genes FASTA sequences. *_NBARC.fasta NB-ARC domain FASTA sequences. *_NBARC_deduplictated.fasta Deduplicated NB-ARC domain FASTA sequences. *_iTOL.txt Domain annotation file for iTOL. *_iTOL_dedup.txt Domain annotation file of the deduplicated sequences for iTOL. *_Domains.tsv Full-length and domain sequence and metadata for all NLRtracker output. interpro_result.gff InterProScan output of the query proteome. Recommended decompressing method for NLRtracker output files: "tar -xzvf" Supplementary Data Data S1. Species list and metadata. Data S2. Per genome sequence number statistics table for proteomes, total NLR, and putative NLR types determined by NLRtracker, and proteome BUSCO scores.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    2
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
2
Top 10%
Average
Average