Extending Data Lineage Analysis for Python with Runtime Types
Thesis title in Czech: | Rozšíření analýzy datových toků pro jazyk Python o podporu běhových typů |
---|---|
Thesis title in English: | Extending Data Lineage Analysis for Python with Runtime Types |
Key words: | Python|datové toky|typová inference|Manta |
English key words: | Python|data flow|data lineage|type inference|Manta |
Academic year of topic announcement: | 2022/2023 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | Mgr. Václav Luňák - assigned and confirmed by the Study Dept. |
Date of registration: | 08.06.2023 |
Date of assignment: | 14.06.2023 |
Confirmed by Study dept. on: | 28.06.2023 |
Date and time of defence: | 14.02.2024 09:00 |
Date of electronic submission: | 10.01.2024 |
Date of submission of printed version: | 10.01.2024 |
Date of proceeded defence: | 14.02.2024 |
Opponents: | Mgr. Tomáš Petříček, Ph.D. |
Guidelines |
An important component of the data lineage analysis platform Manta Flow is the scanner for Python scripts, mainly due to high popularity and wide usage of Python in the fields of data management and data analytics. However, the current version of the Python scanner computes very approximate analysis results due to (i) the dynamic nature of Python and (ii) missing support for inferring precise information about runtime types of program variables. One particular source of this approximation is the limited ability to precisely determine the set of possible targets of a given function invocation.
The main goal of this project is to extend the Python scanner with support for computing information about runtime types of program variables (expressions) and using it within the data lineage analysis. We expect that successful completion of this project will involve the following specific tasks: - Design and implementation of the module for processing the class hierarchy of Python applications. - Development of an efficient algorithm for inference and tracking of runtime types of expressions used in analyzed Python code, which should enable more precise identification of possible target functions at each call. - Adding support for analysis of callbacks and function pointers. All the information provided by these extensions will be used to improve both precision and performance of the scanner. The actual impact of these changes on the overall precision and performance will be very thoroughly tested and empirically evaluated. |
References |
1. Python language, https://www.python.org/
2. Python Type Checking (Guide), https://realpython.com/python-type-checking/ 3. PEP 484 - Type Hints, https://peps.python.org/pep-0484/ 4. Yun Peng, Cuiyun Gao, Zongjie Li, Bowei Gao, David Lo, Qirun Zhang, and Michael R. Lyu. Static Inference Meets Deep learning: A Hybrid Type Inference Approach for Python. ICSE 2022 5. Types in Python, https://pyre-check.org/docs/types-in-python/ |