Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Language Data Resources - NPFX070

Title:	Zdroje jazykových dat
Guaranteed by:	Student Affairs Department (32-STUD)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2024
Semester:	winter
E-Credits:	5
Hours per week, examination:	winter s.:1/2, MC [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	cancelled
Language:	Czech
Teaching methods:	full-time
Is provided by:	NPFL070

Guarantor:	prof. Ing. Zdeněk Žabokrtský, Ph.D. Mgr. Martin Popel, Ph.D.
Class:	Informatika Mgr. - Matematická lingvistika
Classification:	Informatics > Computer and Formal Linguistics
Pre-requisite :	{NXXX011, NXXX012, NXXX013, NXXX070, NXXX071}
Incompatibility :	NPFL070
Interchangeability :	NPFL070

Opinion survey results Schedule Noticeboard

Annotation -

The goal of the course is to provide students with the survey of the field of Language Data Resources. Selected types of linguistic annotations will be described, with emphasis on annotating corpus data and lexical data. Students will gain practice in using software tools for processing such data, especially in the programming language Python. Leading projects for English, Czech, and some other languages will be used for illustration.

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2019)

Course completion requirements -

To pass the course, you need to get at least 50% of the total points from the written test and submit all homework assignments.

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1. The final grade is assigned according to the following table:

1: ≥ 90%

2: ≥ 70%

3: ≥ 50%

4: < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

For details, see https://ufal.mff.cuni.cz/courses/npfl070#grading

Last update: Popel Martin, Mgr., Ph.D. (12.06.2019)

Literature -

Selected papers from related conferences (e.g. LREC, ACL) and journals (e.g. LRE)

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2019)

Syllabus -

1. Introduction

motivation for building language data resources

typology of language data, usage

principles of annotation

using annotated data for evaluation in Natural Language Processing tasks

2. Corpora

corpus typology, tag sets

example corpora, Czech National Corpus

parallel corpora

searching in corpora

3. Treebanks

constituency and dependency syntactic structures, convertibility

deep syntactic trees

treebank examples

4. Computer lexicography

types of lexical information

examples of lexical data (inflectional and derivational lexicons, wordnets, valency lexicons, translation lexicons etc.)

5. Other types of language data resources

named entity corpora, sentiment corpora, dialog corpora, etc.

6. Authors’ rights perspective on building language data resources; licenses

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (25.01.2019)