Low-resource Text Classification
Název práce v češtině: | Klasifikace textu s omezeným množstvím dat |
---|---|
Název v anglickém jazyce: | Low-resource Text Classification |
Klíčová slova: | klasifikace textu|omezené množství dat|BERT |
Klíčová slova anglicky: | text classification|low-resource|BERT |
Akademický rok vypsání: | 2019/2020 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | RNDr. Milan Straka, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 03.07.2020 |
Datum zadání: | 22.04.2021 |
Datum potvrzení stud. oddělením: | 18.05.2021 |
Datum a čas obhajoby: | 02.09.2021 09:00 |
Datum odevzdání elektronické podoby: | 22.07.2021 |
Datum odevzdání tištěné podoby: | 22.07.2021 |
Datum proběhlé obhajoby: | 02.09.2021 |
Oponenti: | Mgr. Martin Popel, Ph.D. |
Zásady pro vypracování |
In natural laguage processing, unsupervised techniques for pretraining "language understanding" from raw text are common. Recently, the BERT model (Devlin et al., 2018) and its variants (Liu et al., 2019; Lan et al., 2019) have demonstrated substantial improvements in many down-stream NLP tasks by pretraining the so-called contextualized embeddings from raw text. One of the such hugely improved tasks is text classification.
Apart from the accuracy improvements in high-resource text classification, the extensive pretraining allows text classification to be utilized in low-resource settings, and even in cross-lingual scenario (Lewis et al., 2019). The goal of the thesis is to evaluate text classification in the low-resource settings, analyzing the dependence of the system accuracy on the training data size and its quality. The focus will be on Czech text classification tasks, performing evaluation on a diverse set of data, using for example Czech sentiment data (Habernal et al., 2013), Czech Text Document Corpus 2.0 (Král et al. 2018) or similar. |
Seznam odborné literatury |
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov: RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692 - Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov: Unsupervised Cross-lingual Representation Learning at Scale. https://arxiv.org/abs/1911.02116 - Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. https://arxiv.org/abs/1909.11942 - Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Riedel, Holger Schwenk: MLQA: Evaluating Cross-lingual Extractive Question Answering. https://arxiv.org/abs/1910.07475 - P. Král, L. Lenc, Czech Text Document Corpus v 2.0, 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan, 7-12 May 2018, pp. 4345-4348, European Language Resources Association (ELRA), ISBN: 979-10-95546-00-9. - Ivan Habernal, Tomáš Ptáček, Josef Steinberger: Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. http://www.aclweb.org/anthology/W13-1609 |