Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 301)
Detail práce
Discovering and Creating Relations among CSV Columns Using Linked Data Knowledge Bases
Název práce v češtině: Hledání a vytváření relací mezi sloupci v CSV souborech s využitím Linked Dat
Název v anglickém jazyce: Discovering and Creating Relations among CSV Columns Using Linked Data Knowledge Bases
Klíčová slova: CSV, linked data, otevřená data, relace, sémantická interpretace tabulek
Klíčová slova anglicky: CSV, linked data, open data, relations, semantic table interpretation
Akademický rok vypsání: 2015/2016
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Katedra softwarového inženýrství (32-KSI)
Vedoucí / školitel: doc. Mgr. Martin Nečaský, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 13.04.2016
Datum zadání: 27.05.2016
Datum potvrzení stud. oddělením: 08.06.2016
Datum a čas obhajoby: 04.02.2019 09:00
Datum odevzdání elektronické podoby:03.01.2019
Datum odevzdání tištěné podoby:04.01.2019
Datum proběhlé obhajoby: 04.02.2019
Oponenti: RNDr. Martin Svoboda, Ph.D.
Zásady pro vypracování
In the recent days, governmental organizations publish their data as open data (most typically as CSV files). To fully exploit the potential of such data, the publication process should be improved, so that data are not published as open data but as Linked Open Data [2] in RDF data format [1]. To leverage (convert) CSV files to Linked Data, it is necessary to 1) classify CSV columns based on its content and context against existing knowledge bases, 2) assign globally unique HTTP URIs identifiers to the particular cell values according to Linked Data principles - such identifiers should be reused from one of the existing knowledge bases, and 3) discover relations between columns based on the evidence for the relations in the existing knowledge bases.

For example, if the published CSV file would contain names of the movies in the first column and names of the directors of these movies in the second column, the leveraging of CSV files to Linked Data should automatically 1) classify first and second column as containing instances of classes 'Movie' and 'Director', 2) convert string cell values in the movie and director columns to HTTP URLs, e.g., instead of having 'Matrix' as the name of the movie, there should be rather URL 'https://www.wikidata.org/wiki/Q83495' pointing to WikiData knowledge graph with bunch of further information and links, and 3) discover relations between columns, such as relation 'isDirectedBy' between first and second column.

We can distinguish two types of knowledge bases used to realize points 1) - 3) above to classify, disambiguate, or discover evidences for relations in the data - general purpose knowledge bases (e.g. DBpedia.org [7], WikiData [8], further datasets in the Linked Open Data Cloud [5]) and focused knowledge bases (an example could be a dataset with all schools in the given country, all streets in the given city etc.). The choice of focused knowledge bases depends heavily on the processed data. In case of this thesis, the data, on which both these types of knowledge bases will be evaluated on, are tabular data (CSV files) obtained from two Austrian open data catalogs [4, 6].

The current approaches, such as [3, 10], typically rely on the evidences for the relations within the existing Linked Data knowledge bases/ Linked Open Data Cloud [5]. One of the best tools for leveraging CSV files to Linked Data data is TableMiner+ [3]. Nevertheless, even this tool has issues/drawbacks with respect to discovering relations among data from [4, 6], including: (A) no usage of focused knowledge bases, (B) inappropriate comparison of potentially relevant relations in the knowledge base, and (C) not taking into account user feedback and recommendations for relations based on the similarity (in terms of the structure) between processed files.

To address (A), focused knowledge bases must be used in the thesis. Focused knowledge bases need to be prepared first - e.g., list of schools and their properties may need to be extracted from certain datasets in [4,6] and manually converted to Linked Data. As part of this thesis, at least 3-5 focused knowledge bases should be prepared for the evaluation of the relations discovery; this includes also preparation of the relations between the knowledge bases.

To address (B), comparison of potentially relevant relations in the knowledge bases has to be improved. Currently, when searching for relation evidence in the knowledge base, the values within potentially related CSV columns are compared with RDF triples containing subject column of the CSV table as the subject of the triple. Nevertheless, object of the triple in the knowledge base may be a resource (URI), not just plain literal, and in that case different comparison is needed than just comparing similarity of the cell value and URI of the resource. Also the algorithm should take into account that subject column in the CSV file may be also an object of the triple, not just subject. Furthermore, TableMiner+ selects the best matching relation not just based on the comparison of the cell value and object of the triple in the knowledge bases, but also by comparing CSV column title and name/URI of the candidate predicate in the given knowledge base. Nevertheless, in case of two knowledge bases giving evidence for the given relation, the selected predicate should not be taken by just comparing the similarity of the predicate's name and the column title (which may be misleading), but rather by consulting Linked Open Data cloud [5] and selecting more widely used predicate for these situations.

To address (C), the algorithm for relation discovery should take into account feedback provided by the user when observing results of the relation discovery algorithm. Furthermore, the algorithm should take into account relations discovered when processing similar files (with similar structure) . For example, if a file A contains relations X, Y, Z and is similar (in terms of the structure) to file B, which contains relations X, Y, it is probable that file B also contains relation Z and such relation should be suggested.

The particular goal of this thesis is to evaluate TableMiner+ regarding discovering of relations (e.g., the relation 'isDirectedBy' as described above) between CSV columns of the CSV files obtained from the two Austrian open data catalogs [4,6]. Based on the results of the evaluation, problems of the relation discovery part of the algorithm should be clearly listed (including also the issues A-C above) and discussed. Improvements addressing these problems should be suggested and realized as an extension of TableMiner+ [3]. The author should also discuss the applicability of the improvements to other tools, such as TAIPAN [10].

In case of this thesis, it is not the goal to extensively evaluate knowledge bases, so author may use a fixed set of knowledge bases (mixture of at least one global knowledge base and focused knowledge bases) and rather change other parameters of the relation discovery part of the algorithm. But it is expected that the author will prepare at least 3-5 focused knowledge bases for the experiments.

Seznam odborné literatury
[1] Frank Manola and Eric Miller. RDF Primer. W3C Recommendation 10 February 2004. [http://www.w3.org/TR/2004/REC-rdf-primer-20040210/]
[2] Christian Bizer, Tom Heath, Tim Berners-Lee: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3): 1-22 (2009) [http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf]
[3] Ziqi Zhang, Effective and Efficient Semantic Table Interpretation using TableMiner+, Submitted to Semantic Web Journal. [http://www.semantic-web-journal.net/system/files/swj1111.pdfTableMiner+]
[4] Official Austrian open data catalog [https://www.data.gv.at/]
[5] Richard Cyganiak et al. Linked Open Data Cloud. Online: http://lod-cloud.net/
[6] Open data catalog, https://www.opendataportal.at/
[7] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, Christian Bizer: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, Vol. 6 No. 2, pp 167–195, 2015.
[8] Wikidata.org Online: https://www.wikidata.org/
[9] Linked Open Vocabularies. Online. http://lov.okfn.org/dataset/lov/
[10] Ivan Ermilov. Web Tables Automatic Property Mapping - TAIPAN. Online: https://github.com/AKSW/TAIPAN
Univerzita Karlova | Informační systém UK