Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods

Thesis title in Czech:	Detekce a oprava takzvaných bitových chyb v metodách pipelinových Krylovových podprostorů
Thesis title in English:	Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Key words:	fault tolerance\|iterative methods\|computer science\|numerical mathematics\|errors\|algorithms\|high-performance computing\|matrix computations\|Krylov subspace methods
English key words:	fault tolerance\|iterative methods\|computer science\|numerical mathematics\|errors\|algorithms\|high-performance computing\|matrix computations\|Krylov subspace methods
Academic year of topic announcement:	2022/2023
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Numerical Mathematics (32-KNM)
Supervisor:	Erin Claire Carson, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	14.04.2023
Date of assignment:	26.04.2023
Confirmed by Study dept. on:	03.05.2023
Date and time of defence:	13.06.2024 09:00
Date of electronic submission:	27.04.2024
Date of submission of printed version:	27.04.2024
Date of proceeded defence:	13.06.2024
Opponents:	prof. Ing. Miroslav Tůma, CSc.



Advisors:	doc. RNDr. Petr Tichý, Ph.D.

Guidelines

The work involves developing fault-tolerant algorithms based on predict-and-recompute variants of pipelined Krylov subspace methods for solving linear systems Ax=b. The insight is that by monitoring the difference between the "predicted" and "recomputed" values of a certain quantity, one can potentially determine whether or not a silent error (e.g., a bit flip) has likely occurred. This project will involve 1) developing a method that detects when a silent error has likely occurred, 2) developing a method for correcting these errors and continuing execution of the method, and 3) writing MATLAB (or other high-level language, such as Python) implementations which simulate the injection of silent errors in predict-and-recompute pipelined Krylov subspace methods in order to evaluate the effectiveness of the developed detection and correction strategies.

Broad questions to answer via numerical experiments:
+ How reliably does your method correctly predict whether or not an error has occurred (rate of false positives and false negatives)?
+ What is the cost (overhead) of a false positive? If there is a false negative, is convergence destroyed, or can the method recover?

The work will also involve doing a review of the literature on this topic.

References

* Tyler Chen and Erin Carson. Predict-and-recompute conjugate gradient variants. SIAM Journal on Scientific Computing, vol. 42, no. 5, pp. A3084-A3108, 2020.
* Gérard Meurant. Detection and correction of silent errors in the conjugate gradient algorithm. Numerical Algorithms, pp. 1-23, 2022.
* Gérard Meurant. Multitasking the conjugate gradient method on the Cray X-MP/48, Parallel
Comput., vol. 5, pp. 267--280, 1987.
* Zizhong Chen. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In ACM SIGPLAN Notices, vol. 48, no. 8, pp. 167-176. ACM, 2013.
* Mark Frederick Hoemmen, Michael Allen Heroux, Kurt Brian Ferreira, and Patrick G. Bridges. Fault-tolerant iterative methods via selective reliability. No. SAND2011-8603C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2011.
* Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications 13, no. 1 (1992): 312-332.

Preliminary scope of work

Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves.

Preliminary scope of work in English