Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Language Data Resources - NPFL070

Title:	Zdroje jazykových dat
Guaranteed by:	Institute of Formal and Applied Linguistics (32-UFAL)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2020
Semester:	winter
E-Credits:	4
Hours per week, examination:	winter s.:1/2, MC [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	taught
Language:	Czech, English
Teaching methods:	full-time
Teaching methods:	full-time
Additional information:	https://ufal.mff.cuni.cz/courses/npfl070

Guarantor:	doc. Ing. Zdeněk Žabokrtský, Ph.D. Mgr. Martin Popel, Ph.D.
Class:	Informatika Mgr. - Matematická lingvistika
Classification:	Informatics > Computer and Formal Linguistics
Is co-requisite for:	NPFL076
Is incompatible with:	NPFX070
Is interchangeable with:	NPFX070

Opinion survey results Examination dates WS schedule Noticeboard

Annotation -

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)

The goal of the course is to provide students with the survey of the field of Language Data Resources. Selected types of linguistic annotations will be described, with emphasis on annotating corpus data and lexical data. Students will gain practice in using software tools for processing such data, especially in the programming language Python. Leading projects for English, Czech, and some other languages will be used for illustration.

Course completion requirements -

Last update: Mgr. Martin Popel, Ph.D. (12.06.2019)

To pass the course, you need to get at least 50% of the total points from the written test and submit all homework assignments.

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1. The final grade is assigned according to the following table:

1: ≥ 90%

2: ≥ 70%

3: ≥ 50%

4: < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

For details, see https://ufal.mff.cuni.cz/courses/npfl070#grading

Literature -

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)

Selected papers from related conferences (e.g. LREC, ACL) and journals (e.g. LRE)

Syllabus -

Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)

1. Introduction

motivation for building language data resources

typology of language data, usage

principles of annotation

using annotated data for evaluation in Natural Language Processing tasks

2. Corpora

corpus typology, tag sets

example corpora, Czech National Corpus

parallel corpora

searching in corpora

3. Treebanks

constituency and dependency syntactic structures, convertibility

deep syntactic trees

treebank examples

4. Computer lexicography

types of lexical information

examples of lexical data (inflectional and derivational lexicons, wordnets, valency lexicons, translation lexicons etc.)

5. Other types of language data resources

named entity corpora, sentiment corpora, dialog corpora, etc.

6. Authors’ rights perspective on building language data resources; licenses