Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Thesis title in Czech: | Detekce a oprava takzvaných bitových chyb v metodách pipelinových Krylovových podprostorů |
---|---|
Thesis title in English: | Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods |
Key words: | fault tolerance|iterative methods|computer science|numerical mathematics|errors|algorithms|high-performance computing|matrix computations|Krylov subspace methods |
English key words: | fault tolerance|iterative methods|computer science|numerical mathematics|errors|algorithms|high-performance computing|matrix computations|Krylov subspace methods |
Academic year of topic announcement: | 2022/2023 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Numerical Mathematics (32-KNM) |
Supervisor: | Erin Claire Carson, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 14.04.2023 |
Date of assignment: | 26.04.2023 |
Confirmed by Study dept. on: | 03.05.2023 |
Date and time of defence: | 13.06.2024 09:00 |
Date of electronic submission: | 27.04.2024 |
Date of submission of printed version: | 27.04.2024 |
Date of proceeded defence: | 13.06.2024 |
Opponents: | prof. Ing. Miroslav Tůma, CSc. |
Advisors: | doc. RNDr. Petr Tichý, Ph.D. |
Guidelines |
The work involves developing fault-tolerant algorithms based on predict-and-recompute variants of pipelined Krylov subspace methods for solving linear systems Ax=b. The insight is that by monitoring the difference between the "predicted" and "recomputed" values of a certain quantity, one can potentially determine whether or not a silent error (e.g., a bit flip) has likely occurred. This project will involve 1) developing a method that detects when a silent error has likely occurred, 2) developing a method for correcting these errors and continuing execution of the method, and 3) writing MATLAB (or other high-level language, such as Python) implementations which simulate the injection of silent errors in predict-and-recompute pipelined Krylov subspace methods in order to evaluate the effectiveness of the developed detection and correction strategies.
Broad questions to answer via numerical experiments: + How reliably does your method correctly predict whether or not an error has occurred (rate of false positives and false negatives)? + What is the cost (overhead) of a false positive? If there is a false negative, is convergence destroyed, or can the method recover? The work will also involve doing a review of the literature on this topic. |
References |
* Tyler Chen and Erin Carson. Predict-and-recompute conjugate gradient variants. SIAM Journal on Scientific Computing, vol. 42, no. 5, pp. A3084-A3108, 2020.
* Gérard Meurant. Detection and correction of silent errors in the conjugate gradient algorithm. Numerical Algorithms, pp. 1-23, 2022. * Gérard Meurant. Multitasking the conjugate gradient method on the Cray X-MP/48, Parallel Comput., vol. 5, pp. 267--280, 1987. * Zizhong Chen. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In ACM SIGPLAN Notices, vol. 48, no. 8, pp. 167-176. ACM, 2013. * Mark Frederick Hoemmen, Michael Allen Heroux, Kurt Brian Ferreira, and Patrick G. Bridges. Fault-tolerant iterative methods via selective reliability. No. SAND2011-8603C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2011. * Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications 13, no. 1 (1992): 312-332. |
Preliminary scope of work |
Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves. |
Preliminary scope of work in English |
Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves. |