Talk – Mar 11: Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing

Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing

Speaker: Linda Stals, Australian National University

Time: Wednesday, March 11, 2020, 14:15
Room: 01.150-128 – Seminarraum (Cauerstraße 11)

Abstract

On future extreme-scale computers, faults will become increasingly common as the number of individual components grows without a compensating improvement in reliability. Achieving resilience is expensive since it inevitably requires redundancy and thus more system resources and additional energy. Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes and store the data to backup memory, but this will be too expensive and too slow in extreme-scale computing.