Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Data Lineage Analysis Service for Embedded Code
Thesis title in Czech: Služba pro analýzu datových toků vestavěného kódu
Thesis title in English: Data Lineage Analysis Service for Embedded Code
Key words: dataová linie|datové toky|vložený kód|python|AWS Glue
English key words: data lineage|data flow|embedded code|python|AWS Glue
Academic year of topic announcement: 2021/2022
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Distributed and Dependable Systems (32-KDSS)
Supervisor: doc. RNDr. Pavel Parízek, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 12.01.2022
Date of assignment: 12.01.2022
Confirmed by Study dept. on: 24.01.2022
Date and time of defence: 06.09.2023 09:00
Date of electronic submission:19.07.2023
Date of submission of printed version:24.07.2023
Date of proceeded defence: 06.09.2023
Opponents: RNDr. David Bednárek, Ph.D.
 
 
 
Guidelines
Data management and analytics tools often use embedded code for data manipulation tasks. Popular examples of such tools include the AWS Glue data integration service, Databricks platform, Snowflake data cloud or SQL Server Integration Services (SSIS). Embedded code is typically written in languages such as Python, Java, C# or JavaScript. Manta Flow is an automated data lineage platform that can analyze database models and definitions created in these tools and programming languages, but not at the same time.

The goal of this thesis project is to design and implement a data lineage analysis service for embedded code that will enable integration of data lineage graph from the data analytics tools with the data lineage graph derived from the embedded code.

One of the main tasks is to create a solid design of the service that should be easily extendable with support for new tools and their embedded code in the future. Benefits and usefulness of this design will be then demonstrated on a prototype implementation of the service for AWS Glue and the embedded code written in Python. Other specific tasks include a proof-of-concept implementation of a metadata extractor for AWS Glue and modifications to the existing Python scanner. A very important aspect of the service is high performance, because it will be called many times during a run of the Manta Flow analysis platform, specifically every time the analysis processes a statement that executes a piece of embedded code.
References
1. MANTA Flow Platform. https://getmanta.com/
2. AWS Glue documentation. https://docs.aws.amazon.com/glue/index.html
3. Python. https://docs.python.org/3/
4. Databricks: https://databricks.com/
5. Snowflake: https://www.snowflake.com/
6. MANTA Python scanner. Team software project, MFF UK, 2021
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html