Data Lineage Analysis Service for Embedded Code
Thesis title in Czech: | Služba pro analýzu datových toků vestavěného kódu |
---|---|
Thesis title in English: | Data Lineage Analysis Service for Embedded Code |
Key words: | dataová linie|datové toky|vložený kód|python|AWS Glue |
English key words: | data lineage|data flow|embedded code|python|AWS Glue |
Academic year of topic announcement: | 2021/2022 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 12.01.2022 |
Date of assignment: | 12.01.2022 |
Confirmed by Study dept. on: | 24.01.2022 |
Date and time of defence: | 06.09.2023 09:00 |
Date of electronic submission: | 19.07.2023 |
Date of submission of printed version: | 24.07.2023 |
Date of proceeded defence: | 06.09.2023 |
Opponents: | RNDr. David Bednárek, Ph.D. |
Guidelines |
Data management and analytics tools often use embedded code for data manipulation tasks. Popular examples of such tools include the AWS Glue data integration service, Databricks platform, Snowflake data cloud or SQL Server Integration Services (SSIS). Embedded code is typically written in languages such as Python, Java, C# or JavaScript. Manta Flow is an automated data lineage platform that can analyze database models and definitions created in these tools and programming languages, but not at the same time.
The goal of this thesis project is to design and implement a data lineage analysis service for embedded code that will enable integration of data lineage graph from the data analytics tools with the data lineage graph derived from the embedded code. One of the main tasks is to create a solid design of the service that should be easily extendable with support for new tools and their embedded code in the future. Benefits and usefulness of this design will be then demonstrated on a prototype implementation of the service for AWS Glue and the embedded code written in Python. Other specific tasks include a proof-of-concept implementation of a metadata extractor for AWS Glue and modifications to the existing Python scanner. A very important aspect of the service is high performance, because it will be called many times during a run of the Manta Flow analysis platform, specifically every time the analysis processes a statement that executes a piece of embedded code. |
References |
1. MANTA Flow Platform. https://getmanta.com/
2. AWS Glue documentation. https://docs.aws.amazon.com/glue/index.html 3. Python. https://docs.python.org/3/ 4. Databricks: https://databricks.com/ 5. Snowflake: https://www.snowflake.com/ 6. MANTA Python scanner. Team software project, MFF UK, 2021 |