Analyzing Data Lineage in Database Frameworks
Thesis title in Czech: | Analýza datových toků ve databázových systémech |
---|---|
Thesis title in English: | Analyzing Data Lineage in Database Frameworks |
Key words: | dátové toky, statická analýza programu, Java frameworky |
English key words: | data lineage, data flow visualization, static program analysis, Java frameworks |
Academic year of topic announcement: | 2018/2019 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 11.10.2018 |
Date of assignment: | 25.10.2018 |
Confirmed by Study dept. on: | 20.11.2018 |
Date and time of defence: | 16.09.2019 09:00 |
Date of electronic submission: | 17.07.2019 |
Date of submission of printed version: | 17.07.2019 |
Date of proceeded defence: | 16.09.2019 |
Opponents: | doc. RNDr. Petr Hnětynka, Ph.D. |
Guidelines |
Large information systems are typically implemented using frameworks and libraries that provide basic infrastructure, database queries (e.g., in SQL), and code in programming languages such as Java and C#.
An important property of such systems is data lineage - the flow of data loaded from one database, through the program code, and back into another database. Data flow must be tracked for the purpose of security and auditing. MANTA is a tool that currently performs data lineage analysis of simple Java programs and classic SQL databases, like Oracle. The goal of this project is to implement support for: - application frameworks, such as MyBatis and Hibernate, and - big data frameworks, like Apache Kafka and Apache Spark. Only few representative frameworks in each category will be actually considered. Necessary steps include the following: - inspect the relevant frameworks in order to find source code locations where they read/write data from/to database or perform file I/O, - propose an approach for modeling data-flow between database endpoints within the frameworks, - define semantics of relevant actions performed by the frameworks, in particular whether they correspond to reading or writing of data. The student should also create a prototype implementation that will have two parts: a generic interface to static analysis of data lineage and a module for each framework. |
References |
[1] R. Ikeda and J. Widom. Data Lineage: A Survey. Technical Report, Stanford InfoLab, 2009
[2] MyBatis, http://www.mybatis.org/mybatis-3/ [3] Hibernate, http://hibernate.org/ [4] Spring framework, https://spring.io/ [5] Apache Kafka, https://kafka.apache.org/ [6] Apache Spark, https://spark.apache.org/ |