Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Low-resource Text Classification

Thesis title in Czech:	Klasifikace textu s omezeným množstvím dat
Thesis title in English:	Low-resource Text Classification
Key words:	klasifikace textu\|omezené množství dat\|BERT
English key words:	text classification\|low-resource\|BERT
Academic year of topic announcement:	2019/2020
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	RNDr. Milan Straka, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	03.07.2020
Date of assignment:	22.04.2021
Confirmed by Study dept. on:	18.05.2021
Date and time of defence:	02.09.2021 09:00
Date of electronic submission:	22.07.2021
Date of submission of printed version:	22.07.2021
Date of proceeded defence:	02.09.2021
Opponents:	Mgr. Martin Popel, Ph.D.

Guidelines

In natural laguage processing, unsupervised techniques for pretraining "language understanding" from raw text are common. Recently, the BERT model (Devlin et al., 2018) and its variants (Liu et al., 2019; Lan et al., 2019) have demonstrated substantial improvements in many down-stream NLP tasks by pretraining the so-called contextualized embeddings from raw text. One of the such hugely improved tasks is text classification.

Apart from the accuracy improvements in high-resource text classification, the extensive pretraining allows text classification to be utilized in low-resource settings, and even in cross-lingual scenario (Lewis et al., 2019).

The goal of the thesis is to evaluate text classification in the low-resource settings, analyzing the dependence of the system accuracy on the training data size and its quality. The focus will be on Czech text classification tasks, performing evaluation on a diverse set of data, using for example Czech sentiment data (Habernal et al., 2013), Czech Text Document Corpus 2.0 (Král et al. 2018) or similar.

References

- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805

- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov: RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692

- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov: Unsupervised Cross-lingual Representation Learning at Scale. https://arxiv.org/abs/1911.02116

- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. https://arxiv.org/abs/1909.11942

- Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Riedel, Holger Schwenk: MLQA: Evaluating Cross-lingual Extractive Question Answering. https://arxiv.org/abs/1910.07475

- P. Král, L. Lenc, Czech Text Document Corpus v 2.0, 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan, 7-12 May 2018, pp. 4345-4348, European Language Resources Association (ELRA), ISBN: 979-10-95546-00-9.

- Ivan Habernal, Tomáš Ptáček, Josef Steinberger: Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. http://www.aclweb.org/anthology/W13-1609