Thesis (Selection of subject)Thesis (Selection of subject)(version: 368)
Thesis details
   Login via CAS
Využití znalostních bází ke zlepšení kvality zveřejněných otevřených dat
Thesis title in Czech: Využití znalostních bází ke zlepšení kvality zveřejněných otevřených dat
Thesis title in English: Využití znalostních bází ke zlepšení kvality zveřejněných otevřených dat
Key words: znalostní báze, datová kvalita, otevřená data, linked data
English key words: knowledge base, data quality, open data, linked data
Academic year of topic announcement: 2015/2016
Thesis type: diploma thesis
Thesis language: angličtina
Department: Department of Software Engineering (32-KSI)
Supervisor: RNDr. Tomáš Knap, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 13.05.2016
Date of assignment: 27.05.2016
Confirmed by Study dept. on: 08.06.2016
Guidelines
In the recent days, governmental organizations publish their data as open data (most typically as CSV files). To fully exploit the potential of such data, the publication process should be improved, so that data are not published as open data but as Linked Open Data [2] in RDF data format [1]. To leverage (convert) CSV files to Linked Data, it is necessary to 1) classify CSV columns based on its content and context against existing knowledge bases 2) assign globally unique HTTP URIs identifiers to the particular cell values according to Linked Data principles - such identifiers should be reused from one of the existing knowledge bases, and 3) discover relations between columns based on the evidence for the relations in the existing knowledge bases.

For example, if the published CSV file would contain names of the movies in the first column and names of the directors of these movies in the second column, the leveraging of CSV files to Linked Data should automatically 1) classify first and second column as containing instances of classes 'Movie' and 'Director', 2) convert string cell values in the movie and director columns to HTTP URLs, e.g., instead of having 'Matrix' as the name of the movie, there should be rather URL 'https://www.wikidata.org/wiki/Q83495' pointing to WikiData knowledge graph with bunch of further information and links, and 3) discover relations between columns, such as relation 'isDirectedBy' between first and second column.

We can distinguish two types of knowledge bases used to realize points 1) - 3) above to classify, disambiguate, or discover evidences for relations in the data - general purpose knowledge bases (e.g. DBpedia.org [7], WikiData [8], further datasets in the Linked Open Data Cloud [5]) and focused knowledge bases (an example could be a dataset with all schools in the given country, all streets in the given city etc.). The choice of focused knowledge bases depends heavily on the processed data. In case of this thesis, the data, on which both these types of knowledge bases will be evaluated on, are tabular data (CSV files) obtained from the Austrian open data catalogs [4, 6].

First goal of this thesis is to analyse available knowledge bases which may be used to realize points 1) - 3) above to classify, disambiguate, or discover evidences for relations in the data - the author should analyze, e.g., the ways how these knowledge bases may be use (which APIs they provide), to which extent the knowledge bases are hierarchically organised. After such analysis is made, second goal of this thesis is to add support for at least two general knowledge bases and at least 3-5 focused knowledge bases (relevant for the data obtained from two Austrian open data catalogs [4,6]) to one of the existing tools realizing (semi)automatic leveraging of tabular open data to Linked open data, such as TableMiner+ [3]. General knowledge bases are typically already available as Linked Open Data. Focused knowledge bases may need to be prepared first - e.g., list of schools may need to be extracted from certain datasets in [4,6] and manually converted to Linked Data. Finally, after extending one of the existing tools, e.g., TableMiner+, with the support for the above mentioned knowledge bases, the efficiency (precision/recall) of the algorithm classifying, disambiguating, discovering relations should be evaluated on top of the tabular data obtained from [4,6] and with respect to various sets of knowledge bases - so the author should measure the efficiency of general knowledge base A, compare with efficiency of general knowledge base B, compare with efficiency of general knowledge base B using also focused knowledge bases, etc.. Author should also suggest extensions to the algorithms for classifying, disambiguating, discovering relations, which will better use the knowledge within the knowledge bases.
References
[1] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation 10 February 2004. [http://www.w3.org/TR/2004/REC-rdf-primer-20040210/]
[2] Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3): 1-22 (2009) [http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf]
[3] Ziqi Zhang, Effective and Efficient Semantic Table Interpretation using TableMiner+, Submitted to Semantic Web Journal. [http://www.semantic-web-journal.net/system/files/swj1111.pdfTableMiner+]
[4] Official national Austrian open data catalog [https://www.data.gv.at/]
[5] Richard Cyganiak et al. Linked Open Data Cloud. Online: http://lod-cloud.net/
[6] Open data catalog, https://www.opendataportal.at/
[7] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, Vol. 6 No. 2, pp 167–195, 2015.
[8] Wikidata.org Online: https://www.wikidata.org/
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html