Mining texts at the discourse level
Thesis title in Czech: | Dolování textu na úrovni diskursu |
---|---|
Thesis title in English: | Mining texts at the discourse level |
Key words: | dobývání informací z textu, výstavba diskurzu, formální konceptuální analýza |
English key words: | text mining, discourse structure, formal concept analysis |
Academic year of topic announcement: | 2013/2014 |
Thesis type: | diploma thesis |
Thesis language: | angličtina |
Department: | Institute of Formal and Applied Linguistics (32-UFAL) |
Supervisor: | doc. RNDr. Pavel Pecina, Ph.D. |
Author: | hidden - assigned and confirmed by the Study Dept. |
Date of registration: | 13.02.2014 |
Date of assignment: | 13.02.2014 |
Confirmed by Study dept. on: | 26.02.2014 |
Date and time of defence: | 08.09.2014 00:00 |
Date of electronic submission: | 30.07.2014 |
Date of submission of printed version: | 31.07.2014 |
Date of proceeded defence: | 08.09.2014 |
Opponents: | Mgr. Michal Novák, Ph.D. |
Guidelines |
The goal of this thesis is to set the basis of a new approach to text mining in order to extract knowledge in a given domain. This approach combines two formal methods, one on discourse modelling coming from Natural Language Processing (NLP) and the other on Formal Concept Analysis, a classification method used in Data Mining (DM). It aims at showing that there exist alternatives to current numerical methods based on a low-semantic representation of texts (bag of words ...) widely used in Text Mining, in Information Retrieval or in Knowledge Extraction from Texts. It should favour “deep” semantic methods so to be able to synthetise the content of a set of texts. The domain of experiment could be the study of Rare Disease. Thus, the result of the process could be considered as a summary of a collection of texts.
This thesis subject is aimed at mining a collection of textual documents on a given domain for discovering recurrent parts of documents that could be used for completing and enriching domain knowledge. Texts or part of texts should be represented by a set of discourse representations. Classification of texts should be performed using pattern structures in formal concept analysis where similarity between two texts is defined in accordance with an algebra on discourse relations. |
References |
Amblard, M., Pogodalla, S. Modeling the Dynamic Effects of Discourse: Principles and Frameworks. In Rebuschi, M.; BATT, M.; Heinzmann, G.; Lihoreau, F.; Musiol, M.; Trognon, A. (Eds.) Interdisciplinary Works in Logic, Epistemology, Psychology and Linguistics, Dialogue, Rationality, and Formalism, Logic, Argumentation & Reasoning, Vol. 3, Dordrecht, Springer. 2014
Charlotte Roze Towards a Discourse Relation Algebra for Comparing Discourse Structures, Constraints In Discourse (CID 2011), Agay, France. 2011. M. Kaytoue-Uberall, S.O. Kuznetsov, A. Napoli, and S. Duplessis. Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis. Information Science, 2010. Adrien Coulet, Florent Domenach, Mehdi Kaytoue and Amedeo Napoli: Using Pattern Structures for Analyzing Ontology-based Annotations. In ICFCA 2013 |