Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Monitoring Tool for Distributed Java Applications

Thesis title in Czech:	Monitorovací nástroj pro distribuované aplikace v jazyce Java
Thesis title in English:	Monitoring Tool for Distributed Java Applications
Key words:	monitorování, cluster, instrumentace, distribuované systémy, výkonost
English key words:	monitoring, cluster, instrumentation, distributed systems, performance
Academic year of topic announcement:	2015/2016
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Distributed and Dependable Systems (32-KDSS)
Supervisor:	doc. RNDr. Pavel Parízek, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	04.02.2016
Date of assignment:	14.03.2016
Confirmed by Study dept. on:	17.05.2016
Date and time of defence:	12.06.2017 09:00
Date of electronic submission:	09.05.2017
Date of submission of printed version:	12.05.2017
Date of proceeded defence:	12.06.2017
Opponents:	doc. RNDr. Petr Hnětynka, Ph.D.

Guidelines

Distributed applications are inherently complex. A combination of parallel processing with distributed computation makes applications hard to monitor and debug.
The main problem is the absence of a "global view", which would enable developers to trace and identify procedure calls, inspect their cost and associate related calls invoked on different machines.
Construction of such a global view has been already identified as a hard problem in distributed systems. However, any approximation of the global view would be very beneficial.

The goal of this thesis is to create a tool for fine-grained monitoring of distributed Java-based applications that can identify problematic behavior early as well as perform post-mortem crash analysis.
The tool should be able to collect execution traces from machines in the cluster and present them in the form of so-called "distributed trace", which enables tracing of calls across machines and computation of the cost of cross-machine calls (e.g., time and amount of transferred data).
Such information are useful for debugging of distributed systems and for detection of weak points in the cluster.
Furthermore, frequent inspection of the current status and behavior of individual machines could enable runtime analysis and utilization of machine learning methods for identification of unusual behavior of the whole application (i.e., anomaly detection).
Other desired features of the tool include application-level transparency and good support for data visualization.

From the technical point of view, the candidate should make a good trade-off between the computation overhead, universality, and transparency for application developers, when designing and implementing the tool.
Therefore, a recommended approach is to combine instrumentation with adaptive sampling in order to minimize the overhead.

The prototype solution will be applied to distributed computation engines such as H2O.

References

1. B.H. Sigelman, L.A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google, 2010. http://research.google.com/archive/papers/dapper-2010-1.pdf
2. I. Laguna, D.H. Ahn, B.R. de Supinski, T. Gamblin, G.L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen, and F. Qin. Debugging High-Performance Computing Applications at Massive Scales. Communications of the ACM, 58(9), 2015
3. K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making Sense of Performance in Data Analytics Frameworks. NSDI 2015, USENIX.
4. A.S Tanenbaum and M. Van Steen. Distributed Systems: Principles and Paradigms. Pearson, 2nd edition, 2013.
5. BTrace: dynamic tracing tool for Java. https://kenai.com/projects/btrace
6. H2O: Platform for Fast Scalable Machine Learning. http://www.h2o.ai/