SubjectsSubjects(version: 850)
Course, academic year 2019/2020
   Login via CAS
Language Data Resources - NPFL070
Title in English: Zdroje jazykových dat
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2019 to 2019
Semester: winter
E-Credits: 5
Hours per week, examination: winter s.:1/2 MC [hours/week]
Capacity: unlimited
Min. number of students: unlimited
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Guarantor: doc. Ing. Zdeněk Žabokrtský, Ph.D.
Mgr. Martin Popel, Ph.D.
Class: Informatika Mgr. - Matematická lingvistika
Classification: Informatics > Computer and Formal Linguistics
Annotation -
Last update: Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)
The goal of the course is to provide students with the survey of the field of Language Data Resources. Selected types of linguistic annotations will be described, with emphasis on annotating corpus data and lexical data. Students will gain practice in using software tools for processing such data, especially in the programming language Python. Leading projects for English, Czech, and some other languages will be used for illustration.
Course completion requirements -
Last update: Mgr. Martin Popel, Ph.D. (12.06.2019)

To pass the course, you need to get at least 50% of the total points from the written test and submit all homework assignments.

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1. The final grade is assigned according to the following table:

1: ≥ 90%

2: ≥ 70%

3: ≥ 50%

4: < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

For details, see https://ufal.mff.cuni.cz/courses/npfl070#grading

Literature -
Last update: Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)
  • Selected papers from related conferences (e.g. LREC, ACL) and journals (e.g. LRE)

Syllabus -
Last update: Mgr. Barbora Vidová Hladká, Ph.D. (25.01.2019)

1. Introduction

  • motivation for building language data resources
  • typology of language data, usage
  • principles of annotation
  • using annotated data for evaluation in Natural Language Processing tasks

2. Corpora

  • corpus typology, tag sets
  • example corpora, Czech National Corpus
  • parallel corpora
  • searching in corpora

3. Treebanks

  • constituency and dependency syntactic structures, convertibility
  • deep syntactic trees
  • treebank examples

4. Computer lexicography

  • types of lexical information
  • examples of lexical data (inflectional and derivational lexicons, wordnets, valency lexicons, translation lexicons etc.)

5. Other types of language data resources

  • named entity corpora, sentiment corpora, dialog corpora, etc.

6. Authors’ rights perspective on building language data resources; licenses

 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html