Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Data Lineage Analysis of Frameworks with Complex Interaction Patterns
Název práce v češtině: Analýza datových toků pro knihovny se složitými vzory interakcí
Název v anglickém jazyce: Data Lineage Analysis of Frameworks with Complex Interaction Patterns
Klíčová slova: datové toky, statická analýza programu, Java frameworky, Apache Spark
Klíčová slova anglicky: data lineage, static program analysis, Java frameworks, Apache Spark
Akademický rok vypsání: 2019/2020
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Katedra distribuovaných a spolehlivých systémů (32-KDSS)
Vedoucí / školitel: doc. RNDr. Pavel Parízek, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 08.01.2020
Datum zadání: 08.01.2020
Datum potvrzení stud. oddělením: 13.01.2020
Datum a čas obhajoby: 16.09.2020 09:00
Datum odevzdání elektronické podoby:30.07.2020
Datum odevzdání tištěné podoby:30.07.2020
Datum proběhlé obhajoby: 16.09.2020
Oponenti: doc. RNDr. Petr Hnětynka, Ph.D.
 
 
 
Zásady pro vypracování
Business information systems are often built from various interconnected databases and programs that process and manipulate the data. In many software systems of this kind, data lineage analysis needs to be performed occasionally - for example, to identify data sources for values presented in summary reports and for the purpose of auditing. While the data lineage analysis can be done manually, often the task is very complex and tedious. Manta Flow is an automated data lineage platform that supports many industrial SQL databases, basic Java programs that use core libraries for interaction with databases and standard I/O operations, and also rather simple data processing frameworks such as MyBatis and Spring JDBC Template. The module that performs data lineage analysis of Java programs is called Java scanner. It is based on symbolic analysis of possible behaviors of Java programs, which is extensible by plugins that handle data processing frameworks.

The main goal of this project is to extend the capabilities of the Java scanner with support for these additional frameworks:
- Apache Spark, a general-purpose cluster-computing framework.
- Hibernate, one of the most popular ORM frameworks.
Usage of these frameworks in Java applications involves much more complex interactions patterns, when compared to MyBatis and Spring JDBC Template.

In the scope of this project, the student will perform the following specific tasks:
- Analyze the frameworks to identify the interfaces responsible for data access, file I/O and, in the case of Spark, processing of complex datasets.
- Extend the current plugin system (i.e., the interface between plugins and the core symbolic data lineage analysis) towards more complicated interaction patterns between client Java programs and frameworks, which are characteristic for Spark and Hibernate.
- Implement plugins for the Java scanner that enable data lineage analysis of programs that also use Apache Spark and Hibernate, respectively.
- Compare the general approach to data lineage analysis used by Manta Flow with other tools that aim to solve the same problem using different means. An example of such a tool is Spline, which tracks queries during execution of Apache Spark jobs.
An important part of the Hibernate plugin will be the generic support for Java Persistence API (JPA), which could enable much easier implementation of other JPA-based frameworks in the future.
Seznam odborné literatury
1. R. Ikeda and J. Widom. Data Lineage: A Survey. Technical Report, Stanford InfoLab, 2009
2. Apache Spark, https://spark.apache.org/
3. Spline, https://absaoss.github.io/spline/
4. Hibernate, http://hibernate.org/
5. Richard Eliáš. Analyzing Data Lineage in Database Frameworks. Master thesis, Charles University, Prague, 2019.
 
Univerzita Karlova | Informační systém UK