Provenance and data differencing for workflow reproducibility analysis

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Other literature type 30 Apr 2013Embargo end date: 01 Jan 2014 United Kingdom English Publisher:WileyJournal:Concurrency and Computation: Practice and Experience, volume 28, pages 995-1,015 (issn: 1532-0626, eissn: 1532-0634,

Copyright policy )

Authors: Paolo Missier; Simon Woodman; Hugo Hiden; Paul Watson 0001;

doi: 10.1002/cpe.3035 , 10.48550/arxiv.1406.0905

arXiv: 1406.0905

Provenance and data differencing for workflow reproducibility analysis

- Summary
- Subjects
- Metrics

Abstract

SummaryOne of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e‐Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of ‘reproducibility’. Secondly, it describes a new algorithm, PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been reproduced; the main innovation is that if this is not the case then the specific point(s) of divergence are identified through graph analysis, assisting any researcher wishing to understand those differences. One key feature is support for user‐defined, semantic data comparison operators. Finally, the paper describes an implementation of PDIFF that leverages the power of the e‐Science Central platform that enacts workflows in the cloud. As well as automatically generating a provenance trace for consumption by PDIFF, the platform supports the storage and reuse of old versions of workflows, data and services; the paper shows how this can be powerfully exploited to achieve reproduction and reuse. Copyright © 2013 John Wiley & Sons, Ltd.

Country

United Kingdom

Related Organizations

Newcastle University
United Kingdom

Keywords

FOS: Computer and information sciences, Computer Science - Databases, Databases (cs.DB)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	31
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

31

Top 10%

Green

bronze

Fields of Science (3) View all

medical and health sciences

basic medicine

Fields of Science

medical and health sciences

basic medicine

View all