Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 368)
Detail práce
   Přihlásit přes CAS
Classifying columns in CSV files using Linked Data knowledge bases
Název práce v češtině: Klasifikace sloupců v CSV souborech využívající Linked Data znalostní báze
Název v anglickém jazyce: Classifying columns in CSV files using Linked Data knowledge bases
Klíčová slova: linked data, znalostní báze, datová kvalita, otevřená data
Klíčová slova anglicky: linked data, knowledge base, data quality, open data
Akademický rok vypsání: 2016/2017
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Katedra softwarového inženýrství (32-KSI)
Vedoucí / školitel: doc. Mgr. Martin Nečaský, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 26.02.2017
Datum zadání: 26.02.2017
Datum potvrzení stud. oddělením: 09.03.2017
Zásady pro vypracování
In the recent days, governmental organizations publish their data as open data (most typically as CSV files). To fully exploit the potential of such data, the publication process should be improved, so that data are not published as open data but as Linked Open Data [2] in RDF data format [1]. To leverage CSV files to Linked Data, it is necessary to 1) classify CSV columns based on its content and context against existing knowledge bases, 2) assign globally unique HTTP URIs identifiers to the particular cell values according to Linked Data principles - such identifiers should be reused from one of the existing knowledge bases, and 3) discover relations between columns based on the evidence for the relations in the existing knowledge bases.

For example, if the published CSV file would contain names of movies in the first column and names of the directors of these movies in the second column, the leveraging of CSV files to Linked Data should automatically 1) classify first and second column as containing instances of classes 'Movie' and 'Director', 2) convert string cell values in the movie and director columns to HTTP URLs, e.g., instead of having 'Matrix' as the name of the movie, there should be rather URL 'https://www.wikidata.org/wiki/Q83495' pointing to WikiData knowledge graph with bunch of further information and links, and 3) discover relations between columns, such as relation 'isDirectedBy' between first and second column.

The current approaches, such as [3,10], typically rely on the evidences for the cell values within the existing Linked Data knowledge bases/ Linked Open Data Cloud [5]. One of the best tools for leveraging CSV files to Linked Data data is Odalic [10], which uses internally improved version of the TableMiner+ algorithm [3]. Nevertheless, even this tool/algorithm has couple of known issues with respect to the classification/disambiguation phase, including: (A) performance/precision/recall issues of classification/disambiguation steps, (B) issues with a user feedback to the classification/disambiguation results of the algorithm, (C) too many false positives in case of lower evidence for the disambiguated cells/classified columns.

To address (A), special heuristics, focused on classification and disambiguation, should be proposed for the Odalic algorithm. For example, the algorithm for disambiguation takes into account for each cell the context of the disambiguated cell, such as the row and column in the table the cell is part of. Nevertheless, such approach causes performance bottleneck of the whole process and should be adjusted in a way, which does not decreases precision/recall much.

To address (B), the way how user feedback (to classification and disambiguation) is handled by the Odalic algorithm in the consequent executions of the algorithm should be improved, e.g., the way how manual change to a disambiguation of one cell value is propagated to all other similar values.

To address (C), the Odalic algorithm should be extended to work reasonably in case of lower evidence for the disambiguated cells/classified columns. For example, when there are too few records in the CSV file, or too few cells were disambiguated, there is insufficient evidence for the proper classification or disambiguation.

The goal of this thesis is to evaluate Odalic algorithm on top of CSV files obtained from the two Austrian open data catalogs [4,6]. In particular, the goal is to evaluate two parts of the Odalic algorithm: (1) the classification of CSV columns and (2) disambiguation of cell values within the CSV files. In both cases, the classification and disambiguation should be realized against Linked Data concepts/resources available in the knowledge base used in the ADEQUATe project [11]. Based on the results of the evaluation, problems should be clearly listed (taking into account issues A-C above) and discussed. Improvements addressing (subset of) these problems should be suggested and implemented as an extension to Odalic algorithm [10].


Seznam odborné literatury
[1] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation 10 February 2004. [http://www.w3.org/TR/2004/REC-rdf-primer-20040210/]
[2] Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3): 1-22 (2009) [http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf]
[3] Ziqi Zhang, Effective and Efficient Semantic Table Interpretation using TableMiner+, Submitted to Semantic Web Journal. [http://www.semantic-web-journal.net/system/files/swj1111.pdfTableMiner+]
[4] Official Austrian open data catalog [https://www.data.gv.at/]
[5] Richard Cyganiak et al. Linked Open Data Cloud. Online: http://lod-cloud.net/
[6] Open data catalog, https://www.opendataportal.at/
[7] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, Vol. 6 No. 2, pp 167–195, 2015.
[8] Wikidata.org Online: https://www.wikidata.org/
[9] Linked Open Vocabularies. Online. http://lov.okfn.org/dataset/lov/
[10] Project Odalic. Online. https://github.com/odalic
[11] ADEQUATe project. Online. http://www.adequate.at/
 
Univerzita Karlova | Informační systém UK