Data Lineage Analysis for PySpark and Python ORM Libraries
Thesis title in Czech: | Analýza datových toků pro PySpark a ORM knihovny jazyka Python |
---|---|
Thesis title in English: | Data Lineage Analysis for PySpark and Python ORM Libraries |
Key words: | data lineage|python|symbolická analýza|dátové toky |
English key words: | data lineage|data flow|python|symbolic analysis |
Academic year of topic announcement: | 2021/2022 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 09.05.2022 |
Date of assignment: | 09.05.2022 |
Confirmed by Study dept. on: | 13.05.2022 |
Date and time of defence: | 06.06.2023 09:00 |
Date of electronic submission: | 03.05.2023 |
Date of submission of printed version: | 09.05.2023 |
Date of proceeded defence: | 06.06.2023 |
Opponents: | Mgr. Petr Škoda, Ph.D. |
Guidelines |
In the world of ETL tools and data processing, Python is one of the main languages used in practice. Python scripts that define data manipulations usually use the same Python framework - PySpark, which is the Python API for the Spark framework, alongside database libraries, using their ORM features. These ORM features usually work in a similar way in most of the relevant libraries.
Recently, MANTA Flow, a highly automated data lineage analysis tool, was extended with a Python language scanner and now it is in the phase of being extended to support more commonly used frameworks. The main goals of this project are the following: (1) extend the MANTA's Python scanner with support for analysis of the PySpark framework, (2) design of an easily extensible base for the analysis of ORM features of Python libraries, and (3) implementation of an ORM support for one database library, for example SQLAlchemy or SQLObject. The main challenge of this project lies in the fact that Python is a dynamically typed language and many functions and variables are to be deduced (or created) during execution, rather than at the beginning, causing potential problems for the current Python scanner, which relies on the fact that the analysis works with a static and non-changing code. For example, when a PySpark's DataFrame object is created, the object initializes new variables that are named after the loaded table's columns without any explicit declaration in the code. Also, the way ORM models of database libraries are defined is not universal and each library has its own specifics - for this reason, the ORM-processing base must be flexible and provide a reasonable level of abstraction. In particular, the ORM core must be able to process the model definition and map it to the database. |
References |
1. PySpark documentation. https://spark.apache.org/docs/latest/api/python/
2. SQLAlchemy ORM documentation. https://docs.sqlalchemy.org/en/14/orm/ 3. SQLObject ORM documentation. http://www.sqlobject.org/SQLObject.html 4. Oskar Hýbl. Data Lineage Analysis of Frameworks with Complex Interaction Patterns. Master thesis. Charles University, Prague, 2020. 5. Dalibor Zeman. Extending Data Lineage Analysis Towards .NET Frameworks. Master thesis. Charles University, Prague, 2021. |