Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Data Lineage Analysis for PySpark and Python ORM Libraries

Thesis title in Czech:	Analýza datových toků pro PySpark a ORM knihovny jazyka Python
Thesis title in English:	Data Lineage Analysis for PySpark and Python ORM Libraries
Key words:	data lineage\|python\|symbolická analýza\|dátové toky
English key words:	data lineage\|data flow\|python\|symbolic analysis
Academic year of topic announcement:	2021/2022
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Distributed and Dependable Systems (32-KDSS)
Supervisor:	doc. RNDr. Pavel Parízek, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	09.05.2022
Date of assignment:	09.05.2022
Confirmed by Study dept. on:	13.05.2022
Date and time of defence:	06.06.2023 09:00
Date of electronic submission:	03.05.2023
Date of submission of printed version:	09.05.2023
Date of proceeded defence:	06.06.2023
Opponents:	Mgr. Petr Škoda, Ph.D.

Guidelines

In the world of ETL tools and data processing, Python is one of the main languages used in practice. Python scripts that define data manipulations usually use the same Python framework - PySpark, which is the Python API for the Spark framework, alongside database libraries, using their ORM features. These ORM features usually work in a similar way in most of the relevant libraries.
Recently, MANTA Flow, a highly automated data lineage analysis tool, was extended with a Python language scanner and now it is in the phase of being extended to support more commonly used frameworks.
The main goals of this project are the following: (1) extend the MANTA's Python scanner with support for analysis of the PySpark framework, (2) design of an easily extensible base for the analysis of ORM features of Python libraries, and (3) implementation of an ORM support for one database library, for example SQLAlchemy or SQLObject.
The main challenge of this project lies in the fact that Python is a dynamically typed language and many functions and variables are to be deduced (or created) during execution, rather than at the beginning, causing potential problems for the current Python scanner, which relies on the fact that the analysis works with a static and non-changing code.
For example, when a PySpark's DataFrame object is created, the object initializes new variables that are named after the loaded table's columns without any explicit declaration in the code. Also, the way ORM models of database libraries are defined is not universal and each library has its own specifics - for this reason, the ORM-processing base must be flexible and provide a reasonable level of abstraction. In particular, the ORM core must be able to process the model definition and map it to the database.

References

1. PySpark documentation. https://spark.apache.org/docs/latest/api/python/
2. SQLAlchemy ORM documentation. https://docs.sqlalchemy.org/en/14/orm/
3. SQLObject ORM documentation. http://www.sqlobject.org/SQLObject.html
4. Oskar Hýbl. Data Lineage Analysis of Frameworks with Complex Interaction Patterns. Master thesis. Charles University, Prague, 2020.
5. Dalibor Zeman. Extending Data Lineage Analysis Towards .NET Frameworks. Master thesis. Charles University, Prague, 2021.