Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods

Název práce v češtině:	Detekce a oprava takzvaných bitových chyb v metodách pipelinových Krylovových podprostorů
Název v anglickém jazyce:	Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Klíčová slova:	fault tolerance\|iterative methods\|computer science\|numerical mathematics\|errors\|algorithms\|high-performance computing\|matrix computations\|Krylov subspace methods
Klíčová slova anglicky:	fault tolerance\|iterative methods\|computer science\|numerical mathematics\|errors\|algorithms\|high-performance computing\|matrix computations\|Krylov subspace methods
Akademický rok vypsání:	2022/2023
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Katedra numerické matematiky (32-KNM)
Vedoucí / školitel:	Erin Claire Carson, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	14.04.2023
Datum zadání:	26.04.2023
Datum potvrzení stud. oddělením:	03.05.2023
Datum odevzdání elektronické podoby:	27.04.2024
Konzultanti:	doc. RNDr. Petr Tichý, Ph.D.

Zásady pro vypracování

The work involves developing fault-tolerant algorithms based on predict-and-recompute variants of pipelined Krylov subspace methods for solving linear systems Ax=b. The insight is that by monitoring the difference between the "predicted" and "recomputed" values of a certain quantity, one can potentially determine whether or not a silent error (e.g., a bit flip) has likely occurred. This project will involve 1) developing a method that detects when a silent error has likely occurred, 2) developing a method for correcting these errors and continuing execution of the method, and 3) writing MATLAB (or other high-level language, such as Python) implementations which simulate the injection of silent errors in predict-and-recompute pipelined Krylov subspace methods in order to evaluate the effectiveness of the developed detection and correction strategies.

Broad questions to answer via numerical experiments:
+ How reliably does your method correctly predict whether or not an error has occurred (rate of false positives and false negatives)?
+ What is the cost (overhead) of a false positive? If there is a false negative, is convergence destroyed, or can the method recover?

The work will also involve doing a review of the literature on this topic.

Seznam odborné literatury

* Tyler Chen and Erin Carson. Predict-and-recompute conjugate gradient variants. SIAM Journal on Scientific Computing, vol. 42, no. 5, pp. A3084-A3108, 2020.
* Gérard Meurant. Detection and correction of silent errors in the conjugate gradient algorithm. Numerical Algorithms, pp. 1-23, 2022.
* Gérard Meurant. Multitasking the conjugate gradient method on the Cray X-MP/48, Parallel
Comput., vol. 5, pp. 267--280, 1987.
* Zizhong Chen. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In ACM SIGPLAN Notices, vol. 48, no. 8, pp. 167-176. ACM, 2013.
* Mark Frederick Hoemmen, Michael Allen Heroux, Kurt Brian Ferreira, and Patrick G. Bridges. Fault-tolerant iterative methods via selective reliability. No. SAND2011-8603C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2011.
* Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications 13, no. 1 (1992): 312-332.

Předběžná náplň práce

Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves.

Předběžná náplň práce v anglickém jazyce