Precise and Efficient Incremental Update of Data Lineage Graph
Název práce v češtině: | Přesné efektivní inkrementální modifikace grafu datových toků |
---|---|
Název v anglickém jazyce: | Precise and Efficient Incremental Update of Data Lineage Graph |
Klíčová slova: | {datové toky}|{inkrementální update}|{statická analýza}|{graf datových toků}|{Manta} |
Klíčová slova anglicky: | {data lineage}|{incremental updates}|{static analysis}|{data flow graph}|{Manta} |
Akademický rok vypsání: | 2020/2021 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Katedra distribuovaných a spolehlivých systémů (32-KDSS) |
Vedoucí / školitel: | doc. RNDr. Pavel Parízek, Ph.D. |
Řešitel: | Mgr. Josef Kumstýř - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 26.02.2021 |
Datum zadání: | 26.02.2021 |
Datum potvrzení stud. oddělením: | 08.03.2021 |
Datum a čas obhajoby: | 07.06.2022 10:00 |
Datum odevzdání elektronické podoby: | 05.05.2022 |
Datum odevzdání tištěné podoby: | 16.05.2022 |
Datum proběhlé obhajoby: | 07.06.2022 |
Oponenti: | RNDr. Filip Zavoral, Ph.D. |
Zásady pro vypracování |
Every organization uses data to stay relevant and competitive while undergoing constant digital transformation process. Nowadays, for many organizations the amount of data is too huge to manually inspect. MANTA Flow is a platform that generates and automatically updates data lineage information, which shows the origin of data and its journey through all the data-processing systems. The platform eliminates human error and provides accurate lineage information based on hard facts rather than guesses and assumptions.
MANTA Flow generates data lineage graphs based on analyzing extracted source files provided by users as input. However, in the current version of MANTA Flow, if a user wants to update the data lineage graph because of a small change in source files, all of the input source files are reanalyzed, which can take many hours. Most of this time is spent analyzing unchanged files that generate the same data lineage graph as the previous analysis run. The goal of this thesis project is to speed up the data lineage analysis by incremental update. The main idea of incremental update in the context of MANTA Flow is to reanalyze only a fraction of all the input files that is sufficient to obtain the same data lineage graph as if a user would run the full analysis. The following tasks should be done within this project: * Analyze the current version of the MANTA Flow platform and identify all the technical challenges related to design and implementation of incremental update, including the effects on the current data lineage analysis. * Design an efficient and precise algorithm for incremental update of data lineage graphs. * Implementation of a working prototype of the incremental analysis for the Oracle database technology. * Extensive testing, validation, and performance evaluation of the prototype. |
Seznam odborné literatury |
1. MANTA Flow Platform. https://getmanta.com/
2. Jan Sýkora. Incremental update of data lineage storage in a graph database. Master thesis, Czech Technical University in Prague, 2018. 3. Titan: distributed graph database. https://titan.thinkaurelius.com/ 4. Neo4j Graph Platform. https://neo4j.com/ |