Thesis (Selection of subject)Thesis (Selection of subject)(version: 393)
Thesis details
   Login via CAS
   
Automatizované generování metadat pro strukturované datasety
Thesis title in Czech: Automatizované generování metadat pro strukturované datasety
Thesis title in English: Automatic metadata generation for structured datasets
Key words: strukturovaná data|metadata|generování metadat|CSV|JSON|XML|RDF|Schema.org|DCAT-AP-CZ|VoID
English key words: structured data|metadata|metadata generation|CSV|JSON|XML|RDF|Schema.org|DCAT-AP-CZ|VoID
Academic year of topic announcement: 2024/2025
Thesis type: diploma thesis
Thesis language:
Department: Department of Software Engineering (32-KSI)
Supervisor: doc. RNDr. Jakub Klímek, Ph.D.
Author: hidden - assigned and confirmed by the Study Dept.
Date of registration: 03.09.2024
Date of assignment: 03.09.2024
Confirmed by Study dept. on: 08.09.2024
Guidelines
In recent years, we see a steady increase in focus on data quality and its value.
These efforts show that metadata can be often incorrect, lacking or inconsistent.
One of the reasons behind this is the lack of automated tools and resulting necessity of metadata being created manually, which is both time consuming and prone to errors.
However, today's technologies offer ways to automate this process, make it easier, faster and more robust; such as schema extraction, LLMs [5], etc.
The goal of this thesis is to analyze existing methods, and to propose and experimentally verify an approach to automating the extraction and generation of metadata from datasets.
This approach will take into account datasets in open formats, namely CSV, JSON, XML, RDF [1].
The implementation will be able to use the provided data set to generate metadata and export them in one of the supported metadata standards - namely DCAT-AP-CZ [2], Schema.org [3], and VoID [3] at minimum.
The student will:
- analyze existing solutions and approaches for extracting or generating metadata
- propose, implement, test and evaluate a solution for generating metadata from structured data in open formats
- document the analysis, solution and its implementation
References
[1] RDF, W3C, https://www.w3.org/TR/rdf11-concepts/
[2] DCAT-AP-CZ, DIA, https://ofn.gov.cz/rozhraní-katalogů-otevřených-dat/2021-01-11/
[3] Schema.org, https://schema.org/
[4] VoID, https://www.w3.org/TR/void/
[5] Wayne et al. - A Survey of Large Language Models, https://arxiv.org/abs/2303.18223
[6] DataChartRenderer, https://dspace.cuni.cz/handle/20.500.11956/148352
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html