Předměty

Introduction to Text Processing and Analysis - AMLV00067

Anglický název:	Introduction to Text Processing and Analysis
Zajišťuje:	Ústav českého národního korpusu (21-UCNK)
Fakulta:	Filozofická fakulta
Platnost:	od 2019
Semestr:	zimní
Body:	0
E-Kredity:	4
Způsob provedení zkoušky:	zimní s.:
Rozsah, examinace:	zimní s.:1/1, Z [HT]
Počet míst:	neurčen / neurčen (neurčen)
Minimální obsazenost:	neomezen
4EU+:	ne
Virtuální mobilita / počet míst pro virtuální mobilitu:	ne
Kompetence:
Stav předmětu:	nevyučován
Jazyk výuky:	angličtina
Způsob výuky:	prezenční
Způsob výuky:	prezenční
Úroveň:
Poznámka:	předmět je možno zapsat mimo plán povolen pro zápis po webu

Garant:	Mgr. Pavel Vondřička, Ph.D. Mgr. Lucie Lukešová, Ph.D.

Rozvrh Nástěnka

Anotace - angličtina

Poslední úprava: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)

The main objective of the course is to provide beginner-level digital linguistics students with all the necessary
information on text processing and analysis. Starting with basic topics, such as characteristics of a plain text format
and the difference between data and metadata, the course goes on to explain the specifics of XML and different
types of text annotation, to introduce the process of tokenization, segmentation and morphological analysis, to
describe the limits and possibilities of syntactic and semantic tagging and, finally, to summarize the principles of
CQL and corpus querying, including the use of regular expressions and querying parallel corpora.

Cíl předmětu - angličtina

Poslední úprava: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)

To understand how computers work with textual data;
To distinguish between different data formats and extract textual content from them (e.g. using OCR);
To understand the specifics of plain text and XML formats;
To understand the principles and issues of text annotation, incl. morphological analysis, syntactic and semantic tagging;
To learn about available resources for text processing and analysis, including taggers and concordancers;
To be able to analyse existing as well as own corpora in a variety of available corpus-based tools;
To build complex CQL queries, including regular expressions and logical operators.

Sylabus - angličtina

Poslední úprava: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)

File formats related to textual data
Plain text: Encoding, data and metadata
Extensible Markup Language or XML
Regular expressions
Tokenization and corpus-data formats
Morphological analysis: principles and tools
Syntactic and semantic annotation
Corpus exploration and analysis
Querying corpus data with CQL
Text alignment and parallel corpora