Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Precise and Efficient Incremental Update of Data Lineage Graph
Thesis title in Czech: Přesné efektivní inkrementální modifikace grafu datových toků
Thesis title in English: Precise and Efficient Incremental Update of Data Lineage Graph
Key words: {datové toky}|{inkrementální update}|{statická analýza}|{graf datových toků}|{Manta}
English key words: {data lineage}|{incremental updates}|{static analysis}|{data flow graph}|{Manta}
Academic year of topic announcement: 2020/2021
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Distributed and Dependable Systems (32-KDSS)
Supervisor: doc. RNDr. Pavel Parízek, Ph.D.
Author: Mgr. Josef Kumstýř - assigned and confirmed by the Study Dept.
Date of registration: 26.02.2021
Date of assignment: 26.02.2021
Confirmed by Study dept. on: 08.03.2021
Date and time of defence: 07.06.2022 10:00
Date of electronic submission:05.05.2022
Date of submission of printed version:16.05.2022
Date of proceeded defence: 07.06.2022
Opponents: RNDr. Filip Zavoral, Ph.D.
 
 
 
Guidelines
Every organization uses data to stay relevant and competitive while undergoing constant digital transformation process. Nowadays, for many organizations the amount of data is too huge to manually inspect. MANTA Flow is a platform that generates and automatically updates data lineage information, which shows the origin of data and its journey through all the data-processing systems. The platform eliminates human error and provides accurate lineage information based on hard facts rather than guesses and assumptions.

MANTA Flow generates data lineage graphs based on analyzing extracted source files provided by users as input. However, in the current version of MANTA Flow, if a user wants to update the data lineage graph because of a small change in source files, all of the input source files are reanalyzed, which can take many hours. Most of this time is spent analyzing unchanged files that generate the same data lineage graph as the previous analysis run.

The goal of this thesis project is to speed up the data lineage analysis by incremental update. The main idea of incremental update in the context of MANTA Flow is to reanalyze only a fraction of all the input files that is sufficient to obtain the same data lineage graph as if a user would run the full analysis.

The following tasks should be done within this project:
* Analyze the current version of the MANTA Flow platform and identify all the technical challenges related to design and implementation of incremental update, including the effects on the current data lineage analysis.
* Design an efficient and precise algorithm for incremental update of data lineage graphs.
* Implementation of a working prototype of the incremental analysis for the Oracle database technology.
* Extensive testing, validation, and performance evaluation of the prototype.
References
1. MANTA Flow Platform. https://getmanta.com/
2. Jan Sýkora. Incremental update of data lineage storage in a graph database. Master thesis, Czech Technical University in Prague, 2018.
3. Titan: distributed graph database. https://titan.thinkaurelius.com/
4. Neo4j Graph Platform. https://neo4j.com/
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html