Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Analyzing Data Lineage in Database Frameworks
Thesis title in Czech: Analýza datových toků ve databázových systémech
Thesis title in English: Analyzing Data Lineage in Database Frameworks
Key words: dátové toky, statická analýza programu, Java frameworky
English key words: data lineage, data flow visualization, static program analysis, Java frameworks
Academic year of topic announcement: 2018/2019
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Distributed and Dependable Systems (32-KDSS)
Supervisor: doc. RNDr. Pavel Parízek, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 11.10.2018
Date of assignment: 25.10.2018
Confirmed by Study dept. on: 20.11.2018
Date and time of defence: 16.09.2019 09:00
Date of electronic submission:17.07.2019
Date of submission of printed version:17.07.2019
Date of proceeded defence: 16.09.2019
Opponents: doc. RNDr. Petr Hnětynka, Ph.D.
 
 
 
Guidelines
Large information systems are typically implemented using frameworks and libraries that provide basic infrastructure, database queries (e.g., in SQL), and code in programming languages such as Java and C#.
An important property of such systems is data lineage - the flow of data loaded from one database, through the program code, and back into another database.
Data flow must be tracked for the purpose of security and auditing.

MANTA is a tool that currently performs data lineage analysis of simple Java programs and classic SQL databases, like Oracle.
The goal of this project is to implement support for:
- application frameworks, such as MyBatis and Hibernate, and
- big data frameworks, like Apache Kafka and Apache Spark.
Only few representative frameworks in each category will be actually considered.

Necessary steps include the following:
- inspect the relevant frameworks in order to find source code locations where they read/write data from/to database or perform file I/O,
- propose an approach for modeling data-flow between database endpoints within the frameworks,
- define semantics of relevant actions performed by the frameworks, in particular whether they correspond to reading or writing of data.

The student should also create a prototype implementation that will have two parts: a generic interface to static analysis of data lineage and a module for each framework.
References
[1] R. Ikeda and J. Widom. Data Lineage: A Survey. Technical Report, Stanford InfoLab, 2009
[2] MyBatis, http://www.mybatis.org/mybatis-3/
[3] Hibernate, http://hibernate.org/
[4] Spring framework, https://spring.io/
[5] Apache Kafka, https://kafka.apache.org/
[6] Apache Spark, https://spark.apache.org/
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html