Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Comparison of approaches to text classification
Thesis title in Czech: Porovnání přístupů ke klasifikaci textu
Thesis title in English: Comparison of approaches to text classification
Key words: NLP, klasifikace textu, strojové učení, klasifikace recenzí
English key words: NLP, text classification, machine learning, review classification
Academic year of topic announcement: 2018/2019
Thesis type: Bachelor's thesis
Thesis language: angličtina
Department: Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor: RNDr. Jiří Hana, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 01.11.2018
Date of assignment: 01.11.2018
Confirmed by Study dept. on: 27.03.2019
Date and time of defence: 05.09.2019 09:00
Date of electronic submission:18.07.2019
Date of submission of printed version:19.07.2019
Date of proceeded defence: 05.09.2019
Opponents: doc. Mgr. Barbora Vidová Hladká, Ph.D.
 
 
 
Guidelines
Compare approaches to text classification based on machine learning. Special attention should be paid to an evaluation of the usefulness of various features, ranging from simple (length of text, bag-of-words) to more complicated ones derived from syntax, detected entities, etc.

For training and testing, use the current Yelp challenge dataset of reviews. The data contain several candidate target variables (usefulness of review, rating), select one or more of them.

The comparison should include
- Comparison of basic algorithms (their results, speed, ...)
- Evaluation of impact of training data size
- Evaluation of various text features
- Comparison of text features with metadata features
References
Jurafsky, Daniel a Martin, James H. 2015. Speech and Language Processing. 2015.
Raschka, Sebastian and Vahid Mirjalili 2017. Python Machine Learning
Mai, Jens-Erik 2011. The modernity of classification. Journal of Documentation67. 4: 710-730.
Sebastiani, Fabrizio 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1–47.
Preliminary scope of work
Compare approaches to text classification based on machine learning. Special attention should be paid to an evaluation of the usefulness of various features, ranging from simple (length of text, bag-of-words) to more complicated ones derived from syntax, detected entities, etc.
Preliminary scope of work in English
Compare approaches to text classification based on machine learning. Special attention should be paid to an evaluation of the usefulness of various features, ranging from simple (length of text, bag-of-words) to more complicated ones derived from syntax, detected entities, etc.
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html