Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Data Lineage Analysis for Databricks platform

Thesis title in Czech:	Analýza datových toků pro platformu Databricks
Thesis title in English:	Data Lineage Analysis for Databricks platform
Key words:	databricks\|data lineage\|data flow\|symbolicka analyza
English key words:	databricks\|data lineage\|data flow\|symbolic analysis
Academic year of topic announcement:	2022/2023
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Distributed and Dependable Systems (32-KDSS)
Supervisor:	doc. RNDr. Pavel Parízek, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	20.10.2022
Date of assignment:	21.10.2022
Confirmed by Study dept. on:	07.11.2022
Date and time of defence:	06.09.2023 09:00
Date of electronic submission:	17.07.2023
Date of submission of printed version:	17.07.2023
Date of proceeded defence:	06.09.2023
Opponents:	Mgr. Petr Škoda, Ph.D.

Guidelines

Databricks is a cloud-based platform used for data science, engineering and analytics. It combines advantages of data warehouses and data lakes into so-called lakehouse architecture. Users can write and run notebooks, scripts that access various data sources and process the loaded data by code snippets in Python or R, for example.

The main goal of this project is to extend the MANTA Flow platform for automated data lineage analysis with support for Databricks.
Work on this project should have two phases:
(1) Design and implementation of a generic basic support for data lineage scanners of notebook platforms, such as Databricks and Jupyter Notebook. The main characteristic of notebooks is combination of various languages and SQL dialects.
(2) Creating one instance of the generic support (framework) that will be tailored for the Databricks platform and its API.

Specific tasks and expected outputs also include the following:
- Implementing support for (1) invoking existing scanners for various languages and technologies (such as Python or SQL), and (2) merging outputs from those scanners.
- Extending the MANTA Python scanner such that it properly handles library procedures and features typically used in DataBricks notebooks.
- Maintaining the shared context (analysis state) between the invocations of different individual scanners (e.g., for Python and SQL).
- Implement extraction (loading) of metadata from the respective storage (e.g., Hive metastore or Unity catalog).

An important part of this project will be thorough analysis of the relevant technologies.

References

1. Databricks documentation. https://docs.databricks.com/
2. MANTA Flow Platform. https://getmanta.com/
3. Hive metastore. https://docs.databricks.com/data/metastores/index.html
4. Unity catalog. https://www.databricks.com/product/unity-catalog