Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2020
License: CC BY
Data sources: Datacite
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Conference object . 2020
License: CC BY
Data sources: ZENODO
versions View all 6 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

Tolérance aux pannes dans l'exécution distribuée de graphes de tâches

Authors: Lion, Romain;

Tolérance aux pannes dans l'exécution distribuée de graphes de tâches

Abstract

Les plus grands supercalculateurs rassemblent un nombre toujours croissant d’unités de calcul, ce qui augmente d’autant le taux de pannes. Des méthodes de checkpoint/restart ont été proposées pour éviter que, lorsqu’un nœud est totalement perdu, l’on doive reprendre l’exécution de l’application depuis son départ. Ces méthodes sont cependant en général transparentes et ne profitent pas d’informations connues sur le comportement de l’application. Inversement, le paradigme de programmation par graphe de tâches fournit l’opportunité de proposer des méthodes de checkpoint/restart bien plus judicieuses. Nous proposons ainsi une approche qui permettra de ne sauvegarder que les données utiles en cohérence avec les communications de l’application, de supporter un redémarrage local, tout en exhibant une interface de programmation simple intégrée à la programmation de graphe de tâches.

https://hal.inria.fr/hal-02296118

Country
France
Keywords

Support d’exécution, Tolérance aux pannes, MPI, Graphe de tâches, [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC]

1. Ashraf (R. A.), Hukerikar (S.) et Engelmann (C.). - Shrink or Substitute : Handling Process Failures in HPC Systems Using In-Situ Recovery. - In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 178-185, mars 2018.

2. Augonnet (C.), Aumage (O.), Furmento (N.), Thibault (S.) et Namyst (R.). - StarPU-MPI : Task Programming over Clusters of Machines Enhanced with Accelerators. - report, INRIA, mai 2014. [OpenAIRE]

3. Bland (W.), Bouteiller (A.), Herault (T.), Bosilca (G.) et Dongarra (J.). - Post-failure recovery of MPI communication capability : Design and rationale. The International Journal of High Performance Computing Applications, vol. 27, n3, août 2013, pp. 244-254.

4. Bouteiller (A.), Herault (T.), Krawezik (G.), Lemarinier (P.) et Cappello (F.). - MPICH-V Project : A Multiprotocol Automatic Fault-Tolerant MPI. Int. J. High Perform. Comput. Appl., vol. 20, n3, août 2006, pp. 319-333.

5. Coti (C.). - Fault Tolerance Techniques for Distributed, Parallel Applications. Innovative Research and Applications in Next-Generation High Performance Computing, 2016, pp. 221-252. [OpenAIRE]

6. Elnozahy (E. N. M.), Alvisi (L.), Wang (Y.-M.) et Johnson (D. B.). - A Survey of Rollbackrecovery Protocols in Message-passing Systems. ACM Comput. Surv., vol. 34, n3, septembre 2002, pp. 375-408.

7. Fagg (G. E.) et Dongarra (J. J.). - FT-MPI : Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. - In Dongarra (J.), Kacsuk (P.) et Podhorszki (N.) (édité par), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science, Lecture Notes in Computer Science, pp. 346-353. Springer Berlin Heidelberg, 2000.

8. Sergent (M.) et Archipoff (S.). - Modulariser les ordonnanceurs de tâches : une approche structurelle. - avril 2014. [OpenAIRE]

9. Tessier (F.), Vishwanath (V.) et Jeannot (E.). - TAPIOCA : An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers. - In CLUSTER 2017 - IEEE International Conference on Cluster Computing, pp. 1-11, Honolulu, United States, septembre 2017. IEEE.

10. Woo (N.), Jung (H.), Yeom (H. Y.), Park (T.) et Park (H.). - MPICH-GF : Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes. IEICE Transactions, vol. 87-D, 2004, pp. 1820-1828.

  • BIP!
    Impact byBIP!
    citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    OpenAIRE UsageCounts
    Usage byUsageCounts
    visibility views 117
    download downloads 45
  • citations
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
    Powered byBIP!BIP!
  • 117
    views
    45
    downloads
    Powered byOpenAIRE UsageCounts
Powered by OpenAIRE graph
Found an issue? Give us feedback
visibility
download
citations
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
views
OpenAIRE UsageCountsViews provided by UsageCounts
downloads
OpenAIRE UsageCountsDownloads provided by UsageCounts
0
Average
Average
Average
117
45
Funded by
EC| EXA2PRO
Project
EXA2PRO
Enhancing Programmability and boosting Performance Portability for Exascale Computing Systems
  • Funder: European Commission (EC)
  • Project Code: 801015
  • Funding stream: H2020 | RIA
moresidebar

Do the share buttons not appear? Please make sure, any blocking addon is disabled, and then reload the page.