SubjectsSubjects(version: 944)
Course, academic year 2023/2024
   Login via CAS
Language Data Resources - NPFX070
Title: Zdroje jazykových dat
Guaranteed by: Student Affairs Department (32-STUD)
Faculty: Faculty of Mathematics and Physics
Actual: from 2022
Semester: winter
E-Credits: 5
Hours per week, examination: winter s.:1/2, MC [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: Czech
Teaching methods: full-time
Teaching methods: full-time
Is provided by: NPFL070
Guarantor: doc. Ing. Zdeněk Žabokrtský, Ph.D.
Mgr. Martin Popel, Ph.D.
Class: Informatika Mgr. - Matematická lingvistika
Classification: Informatics > Computer and Formal Linguistics
Pre-requisite : {NXXX011, NXXX012, NXXX013, NXXX070, NXXX071}
Incompatibility : NPFL070
Interchangeability : NPFL070
Annotation -
Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)
The goal of the course is to provide students with the survey of the field of Language Data Resources. Selected types of linguistic annotations will be described, with emphasis on annotating corpus data and lexical data. Students will gain practice in using software tools for processing such data, especially in the programming language Python. Leading projects for English, Czech, and some other languages will be used for illustration.
Course completion requirements -
Last update: Mgr. Martin Popel, Ph.D. (12.06.2019)

To pass the course, you need to get at least 50% of the total points from the written test and submit all homework assignments.

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1. The final grade is assigned according to the following table:

1: ≥ 90%

2: ≥ 70%

3: ≥ 50%

4: < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

For details, see

Literature -
Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)
  • Selected papers from related conferences (e.g. LREC, ACL) and journals (e.g. LRE)

Syllabus -
Last update: doc. Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)

1. Introduction

  • motivation for building language data resources
  • typology of language data, usage
  • principles of annotation
  • using annotated data for evaluation in Natural Language Processing tasks

2. Corpora

  • corpus typology, tag sets
  • example corpora, Czech National Corpus
  • parallel corpora
  • searching in corpora

3. Treebanks

  • constituency and dependency syntactic structures, convertibility
  • deep syntactic trees
  • treebank examples

4. Computer lexicography

  • types of lexical information
  • examples of lexical data (inflectional and derivational lexicons, wordnets, valency lexicons, translation lexicons etc.)

5. Other types of language data resources

  • named entity corpora, sentiment corpora, dialog corpora, etc.

6. Authors’ rights perspective on building language data resources; licenses

Charles University | Information system of Charles University |