Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Data Lineage Analysis of Frameworks with Complex Interaction Patterns
Thesis title in Czech: Analýza datových toků pro knihovny se složitými vzory interakcí
Thesis title in English: Data Lineage Analysis of Frameworks with Complex Interaction Patterns
Key words: datové toky, statická analýza programu, Java frameworky, Apache Spark
English key words: data lineage, static program analysis, Java frameworks, Apache Spark
Academic year of topic announcement: 2019/2020
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Distributed and Dependable Systems (32-KDSS)
Supervisor: doc. RNDr. Pavel Parízek, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 08.01.2020
Date of assignment: 08.01.2020
Confirmed by Study dept. on: 13.01.2020
Date and time of defence: 16.09.2020 09:00
Date of electronic submission:30.07.2020
Date of submission of printed version:30.07.2020
Date of proceeded defence: 16.09.2020
Opponents: doc. RNDr. Petr Hnětynka, Ph.D.
 
 
 
Guidelines
Business information systems are often built from various interconnected databases and programs that process and manipulate the data. In many software systems of this kind, data lineage analysis needs to be performed occasionally - for example, to identify data sources for values presented in summary reports and for the purpose of auditing. While the data lineage analysis can be done manually, often the task is very complex and tedious. Manta Flow is an automated data lineage platform that supports many industrial SQL databases, basic Java programs that use core libraries for interaction with databases and standard I/O operations, and also rather simple data processing frameworks such as MyBatis and Spring JDBC Template. The module that performs data lineage analysis of Java programs is called Java scanner. It is based on symbolic analysis of possible behaviors of Java programs, which is extensible by plugins that handle data processing frameworks.

The main goal of this project is to extend the capabilities of the Java scanner with support for these additional frameworks:
- Apache Spark, a general-purpose cluster-computing framework.
- Hibernate, one of the most popular ORM frameworks.
Usage of these frameworks in Java applications involves much more complex interactions patterns, when compared to MyBatis and Spring JDBC Template.

In the scope of this project, the student will perform the following specific tasks:
- Analyze the frameworks to identify the interfaces responsible for data access, file I/O and, in the case of Spark, processing of complex datasets.
- Extend the current plugin system (i.e., the interface between plugins and the core symbolic data lineage analysis) towards more complicated interaction patterns between client Java programs and frameworks, which are characteristic for Spark and Hibernate.
- Implement plugins for the Java scanner that enable data lineage analysis of programs that also use Apache Spark and Hibernate, respectively.
- Compare the general approach to data lineage analysis used by Manta Flow with other tools that aim to solve the same problem using different means. An example of such a tool is Spline, which tracks queries during execution of Apache Spark jobs.
An important part of the Hibernate plugin will be the generic support for Java Persistence API (JPA), which could enable much easier implementation of other JPA-based frameworks in the future.
References
1. R. Ikeda and J. Widom. Data Lineage: A Survey. Technical Report, Stanford InfoLab, 2009
2. Apache Spark, https://spark.apache.org/
3. Spline, https://absaoss.github.io/spline/
4. Hibernate, http://hibernate.org/
5. Richard Eliáš. Analyzing Data Lineage in Database Frameworks. Master thesis, Charles University, Prague, 2019.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html