A metadata managed FAIR end-to-end workflow for microbial community Omics data analysis

Background: Molecular profiling using high-throughput ’omics technologies has tremendously increased our ability to interrogate complex microbial communities at the molecular level. In the context of data reuse, the FAIRification of these extensive datasets is frequently perceived as a secondary administrative task, addressed only after data analysis has been completed. However, this approach overlooks the potential benefits of early metadata integration as the procedures forprocessing and analyzing raw data are primarily dictated by the underlying research design and experimental conditions. Gathering interoperable research metadata at the earliest stages creates a standardized basis for managing, processing, and analyzing data enabling more efficient and reproducible FAIR workflows.Results: The single containment principle was used to develop modular containerized reproducible workflows that support the FAIR principles for research software by systematically capturing standardized metadata for each data processing step along with the resulting data products. Using defined mock metagenomic datasets as an example, we show that interoperable research metadata can be used to drive such computational workflows. By processing raw data accordingly,machine-actionable provenance chains are created that enhance the reproducibility and reusability of the resulting data products.Conclusions: A seamless integration of wet lab experiments with computational investigations is essential for a FAIR end-to-end research process. Meta-data-managed workflows prevent the need for unnecessary data manipulation. Workflow provenance registration explicates the complex multi-step methods employed for data processing and analysis. Combining FAIR principles with data provenance registration enhances the reusability of omics datasets by promoting transparency and reproducibility. Data Availability The datasets supporting the results of this article are available in the following repositories: Test datasets:The mock community datasets (BMOCK12 and ZYMO) used for validation are available from their original publications [30,31]. Supplementary data files:The following supplementary files are deposited in the Zenodo repository [48] and are also included with this article: Supplementary File S1: FAIR-DS experimental metadata in RDF/Turtle format, including ISA model structure and MIxS-compliant metadata for all mock communities. Supplementary File S2: FAIR-DS experimental metadata in Excel format for human-readable access. Supplementary File S3: MIMAG/MIxS-compliant metadata reports for all MAGs, including completeness, contamination, and taxonomic classification. Supplementary File S4: CWL tool definition configuration files (YAML format) for all workflow runs. Supplementary File S5: SPARQL query templates for extracting operational and quality metrics from GraphDB. Supplementary File S6: Complete operational metadata for all workflow runs, including runtime statistics and tool execution times. Supplementary File S7: Raw ANI matrices (pairwise values) for all three datasets. Supplementary File S8: Complete workflow provenance data in RDF/Turtle format (PROV-O/CWLProv compliant). Filenames: ZYMO_EVEN_PROVENANCE.trig.gz, ZYMO_LOG_PROVENANCE.trig.gz, BMOCK12_PROVENANCE.trig.gz Supplementary Files S9: Functional annotation data in RDF/Turtle format (GBOL ontology). Filenames: ZYMO_LOG_FUNCTIONAL_ANALYSIS.trig.gz, ZYMO_EVEN_FUNCTIONAL_ANALYSIS.trig.gz, BMOCK12_FUNCTIONAL_ANALYSIS.trig.gz Supplementary File S10: GBOL data model schema in Mermaid format. Supplementary File S11: GBOL data model schema in ShEx (Shape Expressions) format. Supplementary File S12: Binning reproducibility analysis for Bacillus subtilis in ZYMO-LOG dataset, showing contig count variations across assembly strategies and replicate runs with SemiBin2. Supplementary Figures S1–S3: ANI heatmaps for the ZYMO-EVEN, ZYMO-LOG, and BMOCK12 datasets. Supplementary Figure S4: GBOL schema class diagram illustrating the structure of functional annotation data. The RDF datasets (Supplementary Files S8 and S9) can be loaded into any RDF-compatible triple store and queried using standard SPARQL tools. Example SPARQL queries are provided in Supplementary File S5. The RDF data use standard ontologies (PROV-O [42], CWLProv [29], and GBOL), ensuring interoperability and enabling integration with other FAIR-compliant datasets. The complete GBOL data model schema is provided in Supplementary Files S10 and S11 and visualized in Supplementary Figure S4. Workflow code and analysis notebooks:The workflow source code and Jupyter notebooks used for data analysis, figure generation, and table preparation are available on GitLab at:https://git.wur.nl/unlock/projects/FAIRwf4MicrobialCommunity Workflows:The workflow definitions are available on WorkflowHub [49], and their source code is hosted on GitLab at:https://gitlab.com/m-unlock/cwl

Related Organizations

Wageningen University & Research
Netherlands

Keywords

provenance chain, FAIR4RS, cwl, microbial community, common workflow language, omics

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Related to Research communities

Netherlands Research Portal