Témata prací (Výběr práce)

Váš prohlížeč nepodporuje JavaScript nebo je jeho podpora vypnutá. Některé funkce nemusejí být dostupné.

Data Lineage Analysis for Databricks platform

Název práce v češtině:	Analýza datových toků pro platformu Databricks
Název v anglickém jazyce:	Data Lineage Analysis for Databricks platform
Klíčová slova:	databricks\|data lineage\|data flow\|symbolicka analyza
Klíčová slova anglicky:	databricks\|data lineage\|data flow\|symbolic analysis
Akademický rok vypsání:	2022/2023
Typ práce:	diplomová práce
Jazyk práce:	angličtina
Ústav:	Katedra distribuovaných a spolehlivých systémů (32-KDSS)
Vedoucí / školitel:	doc. RNDr. Pavel Parízek, Ph.D.
Řešitel:	skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení:	20.10.2022
Datum zadání:	21.10.2022
Datum potvrzení stud. oddělením:	07.11.2022
Datum a čas obhajoby:	06.09.2023 09:00
Datum odevzdání elektronické podoby:	17.07.2023
Datum odevzdání tištěné podoby:	17.07.2023
Datum proběhlé obhajoby:	06.09.2023
Oponenti:	Mgr. Petr Škoda, Ph.D.

Zásady pro vypracování

Databricks is a cloud-based platform used for data science, engineering and analytics. It combines advantages of data warehouses and data lakes into so-called lakehouse architecture. Users can write and run notebooks, scripts that access various data sources and process the loaded data by code snippets in Python or R, for example.

The main goal of this project is to extend the MANTA Flow platform for automated data lineage analysis with support for Databricks.
Work on this project should have two phases:
(1) Design and implementation of a generic basic support for data lineage scanners of notebook platforms, such as Databricks and Jupyter Notebook. The main characteristic of notebooks is combination of various languages and SQL dialects.
(2) Creating one instance of the generic support (framework) that will be tailored for the Databricks platform and its API.

Specific tasks and expected outputs also include the following:
- Implementing support for (1) invoking existing scanners for various languages and technologies (such as Python or SQL), and (2) merging outputs from those scanners.
- Extending the MANTA Python scanner such that it properly handles library procedures and features typically used in DataBricks notebooks.
- Maintaining the shared context (analysis state) between the invocations of different individual scanners (e.g., for Python and SQL).
- Implement extraction (loading) of metadata from the respective storage (e.g., Hive metastore or Unity catalog).

An important part of this project will be thorough analysis of the relevant technologies.

Seznam odborné literatury

1. Databricks documentation. https://docs.databricks.com/
2. MANTA Flow Platform. https://getmanta.com/
3. Hive metastore. https://docs.databricks.com/data/metastores/index.html
4. Unity catalog. https://www.databricks.com/product/unity-catalog