Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Genres classification by means of machine learning
Thesis title in Czech: Klasifikace žánrů pomocí strojového učení
Thesis title in English: Genres classification by means of machine learning
Key words: Strojové učení, zpracování přirozeného jazyka, klasifikace žánrů, vnoření slov, paragraph vector
English key words: Machine learning, natural language processing, genre classification, word embeddings, paragraph vector
Academic year of topic announcement: 2017/2018
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Theoretical Computer Science and Mathematical Logic (32-KTIML)
Supervisor: Mgr. Roman Neruda, CSc.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 10.05.2018
Date of assignment: 04.06.2018
Confirmed by Study dept. on: 18.07.2018
Date and time of defence: 13.09.2018 00:00
Date of electronic submission:20.07.2018
Date of submission of printed version:20.07.2018
Date of proceeded defence: 13.09.2018
Opponents: Mgr. Marta Vomlelová, Ph.D.
 
 
 
Guidelines
The goal of the thesis is to compare several approaches to text processing and classification and apply them on the task of literary genre classification. The student will propose and design a model based on machine learning that can predict genres given a short part from an English text. A corpus of selected texts from project Gutenberg will be used for training and testing the model. As part of the thesis, the dataset will be explored, and interesting text and language properties as well as typical structures for different genres will be identified. A practical implementation of the proposed algorithms in suitable environment (such as Python, scikit-learn, and TensorFlow) is expected.
References
Ian Goodfellow, Yoshua Bengio, Aaron Courville: Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org

Peter Flach: Machine learning. Cambridge University Press, 2012.

Quoc Le, Tomáš Mikolov: Distributed Representations of Sentences and Documents. CoRR journal, 2014. http://arxiv.org/abs/1405.4053v2

Tomáš Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: Efficient Estimation of Word Representations in Vector Space. CoRR journal, 2013. http://arxiv.org/abs/1301.3781v3

Yoon Kim: Convolutional Neural Networks for Sentence Classification. CoRR journal, 2014. http://arxiv.org/abs/1408.5882v2
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html