Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Balancing Keyword-Based Data and Queries in Distributed Storage Systems

Thesis title in Czech:	Vyvažování dat a dotazů založených na klíčových slovech v distribuovaných úložných systémech
Thesis title in English:	Balancing Keyword-Based Data and Queries in Distributed Storage Systems
Key words:	distribuovaný systém, horizontální dělení dat, vyvažování zátěže
English key words:	distributed system, sharding, load balancing
Academic year of topic announcement:	2019/2020
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Department of Distributed and Dependable Systems (32-KDSS)
Supervisor:	doc. RNDr. Pavel Parízek, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	30.10.2019
Date of assignment:	02.11.2019
Confirmed by Study dept. on:	19.11.2019
Date and time of defence:	16.09.2020 09:00
Date of electronic submission:	27.07.2020
Date of submission of printed version:	30.07.2020
Date of proceeded defence:	16.09.2020
Opponents:	RNDr. Filip Zavoral, Ph.D.

Guidelines

Practical performance of search over distributed storage systems greatly depends on the distribution of data to individual nodes and load balancing of queries. Many storage systems used in practice are characterized by very frequent queries and comparatively rare data updates. An illustrative example are online advertising systems, in which the search process is based on matching user queries against keywords (topics). One of the challenges related to day-to-day operation of such systems is to achieve balanced processing of queries over all nodes, when both data and queries are non-uniformly distributed with respect to the keywords. In particular, when all the data (texts of advertisements) related to a particular keyword are stored on a small set of nodes, these few nodes may quickly become overloaded.
The goal of this project is to improve balancing of data and queries over storage nodes within the context of a real advertising system. This involves the following subtasks to be performed:
(1) thorough inspection of the existing advertising system and identification of specific problems that negatively impact balancing,
(2) analysis of several possible approaches to addressing these problems, based on different data storage technologies and distribution algorithms,
(3) selection of a practical solution with desired features, and
(4) creating of a prototype implementation as a proof of concept and for the purpose of experimental evaluation.
Important criteria include extensible architecture, scalability, performance, and easy maintenance.

References

1. J. Shute et al. F1: The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business. SIGMOD 2012, ACM.
2. A Gupta and J. Shute. High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads. BIRTE 2015, LNBIP 337.
3. N. Murphy et al. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016.
4. RocksDB key-value store. https://rocksdb.org/, https://github.com/facebook/rocksdb/.