Checkpointing Workflows for Fail-Stop Errors

Article, Conference object, Report English OPEN
Han, Li; Canon, Louis-Claude; Casanova, Henri; Robert, Yves; Vivien, Frédéric;
  • Publisher: Institute of Electrical and Electronics Engineers
  • Related identifiers: doi: 10.1109/TC.2018.2801300
  • Subject: checkpoint | [INFO] Computer Science [cs] | workflow | [ INFO.INFO-MO ] Computer Science [cs]/Modeling and Simulation | [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS] | [ INFO.INFO-ET ] Computer Science [cs]/Emerging Technologies [cs.ET] | [ INFO.INFO-CR ] Computer Science [cs]/Cryptography and Security [cs.CR] | fail-stop error | [ INFO.INFO-DC ] Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] | [INFO.INFO-PF]Computer Science [cs]/Performance [cs.PF] | [ INFO.INFO-MA ] Computer Science [cs]/Multiagent Systems [cs.MA] | [ INFO.INFO-IU ] Computer Science [cs]/Ubiquitous Computing | [ INFO.INFO-SE ] Computer Science [cs]/Software Engineering [cs.SE] | [ INFO.INFO-DS ] Computer Science [cs]/Data Structures and Algorithms [cs.DS] | [ INFO.INFO-PF ] Computer Science [cs]/Performance [cs.PF] | [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] | resilience

International audience; We consider the problem of orchestrating the exe- cution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall e... View more
  • References (55)
    55 references, page 1 of 6

    [1] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger, “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, no. 0, pp. 17-35, 2015.

    [2] T. Fahringer, R. Prodan, R. Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig, J. Qin, M. Siddiqui, H.-L. Truong et al., “Askalon: A development and grid computing environment for scientific workflows,” in Workflows for e-Science. Springer, 2007, pp. 450- 471.

    [3] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster, “Swift: A language for distributed parallel scripting,” Parallel Computing, vol. 37, no. 9, pp. 633-652, 2011.

    [4] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher et al., “The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud,” Nucleic acids research, p. gkt328, 2013.

    [5] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock, “Kepler: an extensible system for design and execution of scientific workflows,” in Proc. 16th Int. Conf. Scientific and Statistical Database Management. IEEE, 2004, pp. 423-424.

    [6] M. Albrecht, P. Donnelly, P. Bui, and D. Thain, “Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids,” in 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. ACM, 2012, p. 1.

    [7] F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi, “Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform,” in Proc. 26th IEEE Int. Parallel and Distributed Processing Symposium, 2012, pp. 1352-1363.

    [8] J. N. Hagstrom, “Computational complexity of PERT problems,” Networks, vol. 18, no. 2, pp. 139-147, 1988.

    [9] M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems, 5th ed. Springer, 2016.

    [10] L. G. Valiant, “The complexity of enumeration and reliability problems,” SIAM J. Comput., vol. 8, no. 3, pp. 410-421, 1979.

  • Related Research Results (1)
    Inferred by OpenAIRE
    Checkpointing Workflows for Fail-Stop Errors-code (2017)
  • Related Organizations (2)
  • Metrics
Share - Bookmark