<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=undefined&type=result"></script>');
-->
</script>
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the Oil and Gas industry, along with its evaluation using 48 GPUs in parallel.
10 pages, 7 figures, Accepted at Workflows in Support of Large-scale Science (WORKS) co-located with the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019, Denver, Colorado
I.2, FOS: Computer and information sciences, Computer Science - Machine Learning, J.2, C.4, H.2, 65Y05, 68P15, Databases (cs.DB), I.2; H.2; C.4; J.2, Machine Learning (cs.LG), Computational Science and Engineering, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Databases, [INFO.INFO-DB] Computer Science [cs]/Databases [cs.DB], Workflow Provenance, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning Lifecycle
I.2, FOS: Computer and information sciences, Computer Science - Machine Learning, J.2, C.4, H.2, 65Y05, 68P15, Databases (cs.DB), I.2; H.2; C.4; J.2, Machine Learning (cs.LG), Computational Science and Engineering, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Databases, [INFO.INFO-DB] Computer Science [cs]/Databases [cs.DB], Workflow Provenance, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning Lifecycle
citations This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 21 | |
popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |