Balancing Keyword-Based Data and Queries in Distributed Storage Systems
Název práce v češtině: | Vyvažování dat a dotazů založených na klíčových slovech v distribuovaných úložných systémech |
---|---|
Název v anglickém jazyce: | Balancing Keyword-Based Data and Queries in Distributed Storage Systems |
Klíčová slova: | distribuovaný systém, horizontální dělení dat, vyvažování zátěže |
Klíčová slova anglicky: | distributed system, sharding, load balancing |
Akademický rok vypsání: | 2019/2020 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Katedra distribuovaných a spolehlivých systémů (32-KDSS) |
Vedoucí / školitel: | doc. RNDr. Pavel Parízek, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 30.10.2019 |
Datum zadání: | 02.11.2019 |
Datum potvrzení stud. oddělením: | 19.11.2019 |
Datum a čas obhajoby: | 16.09.2020 09:00 |
Datum odevzdání elektronické podoby: | 27.07.2020 |
Datum odevzdání tištěné podoby: | 30.07.2020 |
Datum proběhlé obhajoby: | 16.09.2020 |
Oponenti: | RNDr. Filip Zavoral, Ph.D. |
Zásady pro vypracování |
Practical performance of search over distributed storage systems greatly depends on the distribution of data to individual nodes and load balancing of queries. Many storage systems used in practice are characterized by very frequent queries and comparatively rare data updates. An illustrative example are online advertising systems, in which the search process is based on matching user queries against keywords (topics). One of the challenges related to day-to-day operation of such systems is to achieve balanced processing of queries over all nodes, when both data and queries are non-uniformly distributed with respect to the keywords. In particular, when all the data (texts of advertisements) related to a particular keyword are stored on a small set of nodes, these few nodes may quickly become overloaded.
The goal of this project is to improve balancing of data and queries over storage nodes within the context of a real advertising system. This involves the following subtasks to be performed: (1) thorough inspection of the existing advertising system and identification of specific problems that negatively impact balancing, (2) analysis of several possible approaches to addressing these problems, based on different data storage technologies and distribution algorithms, (3) selection of a practical solution with desired features, and (4) creating of a prototype implementation as a proof of concept and for the purpose of experimental evaluation. Important criteria include extensible architecture, scalability, performance, and easy maintenance. |
Seznam odborné literatury |
1. J. Shute et al. F1: The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business. SIGMOD 2012, ACM.
2. A Gupta and J. Shute. High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads. BIRTE 2015, LNBIP 337. 3. N. Murphy et al. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. 4. RocksDB key-value store. https://rocksdb.org/, https://github.com/facebook/rocksdb/. |