Balancing Keyword-Based Data and Queries in Distributed Storage Systems
Thesis title in Czech: | Vyvažování dat a dotazů založených na klíčových slovech v distribuovaných úložných systémech |
---|---|
Thesis title in English: | Balancing Keyword-Based Data and Queries in Distributed Storage Systems |
Key words: | distribuovaný systém, horizontální dělení dat, vyvažování zátěže |
English key words: | distributed system, sharding, load balancing |
Academic year of topic announcement: | 2019/2020 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Department of Distributed and Dependable Systems (32-KDSS) |
Supervisor: | doc. RNDr. Pavel Parízek, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 30.10.2019 |
Date of assignment: | 02.11.2019 |
Confirmed by Study dept. on: | 19.11.2019 |
Date and time of defence: | 16.09.2020 09:00 |
Date of electronic submission: | 27.07.2020 |
Date of submission of printed version: | 30.07.2020 |
Date of proceeded defence: | 16.09.2020 |
Opponents: | RNDr. Filip Zavoral, Ph.D. |
Guidelines |
Practical performance of search over distributed storage systems greatly depends on the distribution of data to individual nodes and load balancing of queries. Many storage systems used in practice are characterized by very frequent queries and comparatively rare data updates. An illustrative example are online advertising systems, in which the search process is based on matching user queries against keywords (topics). One of the challenges related to day-to-day operation of such systems is to achieve balanced processing of queries over all nodes, when both data and queries are non-uniformly distributed with respect to the keywords. In particular, when all the data (texts of advertisements) related to a particular keyword are stored on a small set of nodes, these few nodes may quickly become overloaded.
The goal of this project is to improve balancing of data and queries over storage nodes within the context of a real advertising system. This involves the following subtasks to be performed: (1) thorough inspection of the existing advertising system and identification of specific problems that negatively impact balancing, (2) analysis of several possible approaches to addressing these problems, based on different data storage technologies and distribution algorithms, (3) selection of a practical solution with desired features, and (4) creating of a prototype implementation as a proof of concept and for the purpose of experimental evaluation. Important criteria include extensible architecture, scalability, performance, and easy maintenance. |
References |
1. J. Shute et al. F1: The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business. SIGMOD 2012, ACM.
2. A Gupta and J. Shute. High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads. BIRTE 2015, LNBIP 337. 3. N. Murphy et al. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. 4. RocksDB key-value store. https://rocksdb.org/, https://github.com/facebook/rocksdb/. |