Data Lineage Analysis for Databricks platform
Thesis title in Czech: | Analýza datových toků pro platformu Databricks |
---|---|
Thesis title in English: | Data Lineage Analysis for Databricks platform |
Key words: | databricks|data lineage|data flow|symbolicka analyza |
English key words: | databricks|data lineage|data flow|symbolic analysis |
Academic year of topic announcement: | 2022/2023 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 20.10.2022 |
Date of assignment: | 21.10.2022 |
Confirmed by Study dept. on: | 07.11.2022 |
Date and time of defence: | 06.09.2023 09:00 |
Date of electronic submission: | 17.07.2023 |
Date of submission of printed version: | 17.07.2023 |
Date of proceeded defence: | 06.09.2023 |
Opponents: | Mgr. Petr Škoda, Ph.D. |
Guidelines |
Databricks is a cloud-based platform used for data science, engineering and analytics. It combines advantages of data warehouses and data lakes into so-called lakehouse architecture. Users can write and run notebooks, scripts that access various data sources and process the loaded data by code snippets in Python or R, for example.
The main goal of this project is to extend the MANTA Flow platform for automated data lineage analysis with support for Databricks. Work on this project should have two phases: (1) Design and implementation of a generic basic support for data lineage scanners of notebook platforms, such as Databricks and Jupyter Notebook. The main characteristic of notebooks is combination of various languages and SQL dialects. (2) Creating one instance of the generic support (framework) that will be tailored for the Databricks platform and its API. Specific tasks and expected outputs also include the following: - Implementing support for (1) invoking existing scanners for various languages and technologies (such as Python or SQL), and (2) merging outputs from those scanners. - Extending the MANTA Python scanner such that it properly handles library procedures and features typically used in DataBricks notebooks. - Maintaining the shared context (analysis state) between the invocations of different individual scanners (e.g., for Python and SQL). - Implement extraction (loading) of metadata from the respective storage (e.g., Hive metastore or Unity catalog). An important part of this project will be thorough analysis of the relevant technologies. |
References |
1. Databricks documentation. https://docs.databricks.com/
2. MANTA Flow Platform. https://getmanta.com/ 3. Hive metastore. https://docs.databricks.com/data/metastores/index.html 4. Unity catalog. https://www.databricks.com/product/unity-catalog |