Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Název práce v češtině: Detekce a oprava takzvaných bitových chyb v metodách pipelinových Krylovových podprostorů
Název v anglickém jazyce: Detection and Correction of Silent Errors in Pipelined Krylov Subspace Methods
Klíčová slova: fault tolerance|iterative methods|computer science|numerical mathematics|errors|algorithms|high-performance computing|matrix computations|Krylov subspace methods
Klíčová slova anglicky: fault tolerance|iterative methods|computer science|numerical mathematics|errors|algorithms|high-performance computing|matrix computations|Krylov subspace methods
Akademický rok vypsání: 2022/2023
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Katedra numerické matematiky (32-KNM)
Vedoucí / školitel: Erin Claire Carson, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 14.04.2023
Datum zadání: 26.04.2023
Datum potvrzení stud. oddělením: 03.05.2023
Datum odevzdání elektronické podoby:27.04.2024
Konzultanti: doc. RNDr. Petr Tichý, Ph.D.
Zásady pro vypracování
The work involves developing fault-tolerant algorithms based on predict-and-recompute variants of pipelined Krylov subspace methods for solving linear systems Ax=b. The insight is that by monitoring the difference between the "predicted" and "recomputed" values of a certain quantity, one can potentially determine whether or not a silent error (e.g., a bit flip) has likely occurred. This project will involve 1) developing a method that detects when a silent error has likely occurred, 2) developing a method for correcting these errors and continuing execution of the method, and 3) writing MATLAB (or other high-level language, such as Python) implementations which simulate the injection of silent errors in predict-and-recompute pipelined Krylov subspace methods in order to evaluate the effectiveness of the developed detection and correction strategies.

Broad questions to answer via numerical experiments:
+ How reliably does your method correctly predict whether or not an error has occurred (rate of false positives and false negatives)?
+ What is the cost (overhead) of a false positive? If there is a false negative, is convergence destroyed, or can the method recover?

The work will also involve doing a review of the literature on this topic.
Seznam odborné literatury
* Tyler Chen and Erin Carson. Predict-and-recompute conjugate gradient variants. SIAM Journal on Scientific Computing, vol. 42, no. 5, pp. A3084-A3108, 2020.
* Gérard Meurant. Detection and correction of silent errors in the conjugate gradient algorithm. Numerical Algorithms, pp. 1-23, 2022.
* Gérard Meurant. Multitasking the conjugate gradient method on the Cray X-MP/48, Parallel
Comput., vol. 5, pp. 267--280, 1987.
* Zizhong Chen. Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In ACM SIGPLAN Notices, vol. 48, no. 8, pp. 167-176. ACM, 2013.
* Mark Frederick Hoemmen, Michael Allen Heroux, Kurt Brian Ferreira, and Patrick G. Bridges. Fault-tolerant iterative methods via selective reliability. No. SAND2011-8603C. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2011.
* Daniel L. Boley, Richard P. Brent, Gene H. Golub, and Franklin T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications 13, no. 1 (1992): 312-332.
Předběžná náplň práce
Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves.
Předběžná náplň práce v anglickém jazyce
Error handling and fault tolerance is a hot topic in numerical mathematics, as on large, exascale-sized machines, the "Mean Time to Failure" is expected to be close to 0. Thus we want to investigate how a randomly injected error (say, a single bit flip) affects the execution of iterative methods for solving linear systems and how this can be remedied in an inexpensive way within the algorithms themselves.
 
Univerzita Karlova | Informační systém UK