D4.2 – Report on algorithms for exascale robustness (fault tolerance and large-scale communications) in QMC flagship codes
We expect exascale machines to enable QMC applications on larger systems than those that can be treated today. This implies that systems will have larger numbers of electrons, and/or larger Configuration Interaction (CI) expansions. In this Work Package (WP), we investigate ways to overcome new difficulties that will arise when running exascale simulations.
Exascale machines will often be used to run simulations that can’t run on smaller systems. So the computed data will be particularly valuable to users, and it should not be lost by accident during the simulation. In addition, an exascale machine will be such a complex piece of hardware and software that it is not reasonable to neglect system failures in the design of dedicated software. The first section of this document discusses different strategies used to make simulations robust to system failures.