Monitoring Tool for Distributed Java Applications
Thesis title in Czech: | Monitorovací nástroj pro distribuované aplikace v jazyce Java |
---|---|
Thesis title in English: | Monitoring Tool for Distributed Java Applications |
Key words: | monitorování, cluster, instrumentace, distribuované systémy, výkonost |
English key words: | monitoring, cluster, instrumentation, distributed systems, performance |
Academic year of topic announcement: | 2015/2016 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 04.02.2016 |
Date of assignment: | 14.03.2016 |
Confirmed by Study dept. on: | 17.05.2016 |
Date and time of defence: | 12.06.2017 09:00 |
Date of electronic submission: | 09.05.2017 |
Date of submission of printed version: | 12.05.2017 |
Date of proceeded defence: | 12.06.2017 |
Opponents: | doc. RNDr. Petr Hnětynka, Ph.D. |
Guidelines |
Distributed applications are inherently complex. A combination of parallel processing with distributed computation makes applications hard to monitor and debug.
The main problem is the absence of a "global view", which would enable developers to trace and identify procedure calls, inspect their cost and associate related calls invoked on different machines. Construction of such a global view has been already identified as a hard problem in distributed systems. However, any approximation of the global view would be very beneficial. The goal of this thesis is to create a tool for fine-grained monitoring of distributed Java-based applications that can identify problematic behavior early as well as perform post-mortem crash analysis. The tool should be able to collect execution traces from machines in the cluster and present them in the form of so-called "distributed trace", which enables tracing of calls across machines and computation of the cost of cross-machine calls (e.g., time and amount of transferred data). Such information are useful for debugging of distributed systems and for detection of weak points in the cluster. Furthermore, frequent inspection of the current status and behavior of individual machines could enable runtime analysis and utilization of machine learning methods for identification of unusual behavior of the whole application (i.e., anomaly detection). Other desired features of the tool include application-level transparency and good support for data visualization. From the technical point of view, the candidate should make a good trade-off between the computation overhead, universality, and transparency for application developers, when designing and implementing the tool. Therefore, a recommended approach is to combine instrumentation with adaptive sampling in order to minimize the overhead. The prototype solution will be applied to distributed computation engines such as H2O. |
References |
1. B.H. Sigelman, L.A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Google, 2010. http://research.google.com/archive/papers/dapper-2010-1.pdf
2. I. Laguna, D.H. Ahn, B.R. de Supinski, T. Gamblin, G.L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen, and F. Qin. Debugging High-Performance Computing Applications at Massive Scales. Communications of the ACM, 58(9), 2015 3. K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making Sense of Performance in Data Analytics Frameworks. NSDI 2015, USENIX. 4. A.S Tanenbaum and M. Van Steen. Distributed Systems: Principles and Paradigms. Pearson, 2nd edition, 2013. 5. BTrace: dynamic tracing tool for Java. https://kenai.com/projects/btrace 6. H2O: Platform for Fast Scalable Machine Learning. http://www.h2o.ai/ |