Data Lineage Analysis for Databricks platform
Název práce v češtině: | Analýza datových toků pro platformu Databricks |
---|---|
Název v anglickém jazyce: | Data Lineage Analysis for Databricks platform |
Klíčová slova: | databricks|data lineage|data flow|symbolicka analyza |
Klíčová slova anglicky: | databricks|data lineage|data flow|symbolic analysis |
Akademický rok vypsání: | 2022/2023 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Katedra distribuovaných a spolehlivých systémů (32-KDSS) |
Vedoucí / školitel: | doc. RNDr. Pavel Parízek, Ph.D. |
Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
Datum přihlášení: | 20.10.2022 |
Datum zadání: | 21.10.2022 |
Datum potvrzení stud. oddělením: | 07.11.2022 |
Datum a čas obhajoby: | 06.09.2023 09:00 |
Datum odevzdání elektronické podoby: | 17.07.2023 |
Datum odevzdání tištěné podoby: | 17.07.2023 |
Datum proběhlé obhajoby: | 06.09.2023 |
Oponenti: | Mgr. Petr Škoda, Ph.D. |
Zásady pro vypracování |
Databricks is a cloud-based platform used for data science, engineering and analytics. It combines advantages of data warehouses and data lakes into so-called lakehouse architecture. Users can write and run notebooks, scripts that access various data sources and process the loaded data by code snippets in Python or R, for example.
The main goal of this project is to extend the MANTA Flow platform for automated data lineage analysis with support for Databricks. Work on this project should have two phases: (1) Design and implementation of a generic basic support for data lineage scanners of notebook platforms, such as Databricks and Jupyter Notebook. The main characteristic of notebooks is combination of various languages and SQL dialects. (2) Creating one instance of the generic support (framework) that will be tailored for the Databricks platform and its API. Specific tasks and expected outputs also include the following: - Implementing support for (1) invoking existing scanners for various languages and technologies (such as Python or SQL), and (2) merging outputs from those scanners. - Extending the MANTA Python scanner such that it properly handles library procedures and features typically used in DataBricks notebooks. - Maintaining the shared context (analysis state) between the invocations of different individual scanners (e.g., for Python and SQL). - Implement extraction (loading) of metadata from the respective storage (e.g., Hive metastore or Unity catalog). An important part of this project will be thorough analysis of the relevant technologies. |
Seznam odborné literatury |
1. Databricks documentation. https://docs.databricks.com/
2. MANTA Flow Platform. https://getmanta.com/ 3. Hive metastore. https://docs.databricks.com/data/metastores/index.html 4. Unity catalog. https://www.databricks.com/product/unity-catalog |