Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Mining texts at the discourse level

Thesis title in Czech:	Dolování textu na úrovni diskursu
Thesis title in English:	Mining texts at the discourse level
Key words:	dobývání informací z textu, výstavba diskurzu, formální konceptuální analýza
English key words:	text mining, discourse structure, formal concept analysis
Academic year of topic announcement:	2013/2014
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Pavel Pecina, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	13.02.2014
Date of assignment:	13.02.2014
Confirmed by Study dept. on:	26.02.2014
Date and time of defence:	08.09.2014 00:00
Date of electronic submission:	30.07.2014
Date of submission of printed version:	31.07.2014
Date of proceeded defence:	08.09.2014
Opponents:	Mgr. Michal Novák, Ph.D.

Guidelines

The goal of this thesis is to set the basis of a new approach to text mining in order to extract knowledge in a given domain. This approach combines two formal methods, one on discourse modelling coming from Natural Language Processing (NLP) and the other on Formal Concept Analysis, a classification method used in Data Mining (DM). It aims at showing that there exist alternatives to current numerical methods based on a low-semantic representation of texts (bag of words ...) widely used in Text Mining, in Information Retrieval or in Knowledge Extraction from Texts. It should favour “deep” semantic methods so to be able to synthetise the content of a set of texts. The domain of experiment could be the study of Rare Disease. Thus, the result of the process could be considered as a summary of a collection of texts.

This thesis subject is aimed at mining a collection of textual documents on a given domain for discovering recurrent parts of documents that could be used for completing and enriching domain knowledge. Texts or part of texts should be represented by a set of discourse representations. Classification of texts should be performed using pattern structures in formal concept analysis where similarity between two texts is defined in accordance with an algebra on discourse relations.

References

Amblard, M., Pogodalla, S. Modeling the Dynamic Effects of Discourse: Principles and Frameworks. In Rebuschi, M.; BATT, M.; Heinzmann, G.; Lihoreau, F.; Musiol, M.; Trognon, A. (Eds.) Interdisciplinary Works in Logic, Epistemology, Psychology and Linguistics, Dialogue, Rationality, and Formalism, Logic, Argumentation & Reasoning, Vol. 3, Dordrecht, Springer. 2014

Charlotte Roze Towards a Discourse Relation Algebra for Comparing Discourse Structures, Constraints In Discourse (CID 2011), Agay, France. 2011.

M. Kaytoue-Uberall, S.O. Kuznetsov, A. Napoli, and S. Duplessis. Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis. Information Science, 2010.

Adrien Coulet, Florent Domenach, Mehdi Kaytoue and Amedeo Napoli: Using Pattern Structures for Analyzing Ontology-based Annotations. In ICFCA 2013