SubjectsSubjects(version: 845)
Course, academic year 2018/2019
   Login via CAS
Automatic Text Data Processing - NPFL098
Title in English: Automatické zpracování textových dat
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2018
Semester: summer
E-Credits: 6
Hours per week, examination: summer s.:2/2 C+Ex [hours/week]
Capacity: unlimited
Min. number of students: unlimited
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Additional information:
Note: course is intended for doctoral students only
Guarantor: Mgr. Pavel Straňák, Ph.D.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Annotation -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)
An introductory course for automatic text processing using the most common and efficient tools and methods. The skills acquired during the course will benefit any scientific work that involves large texts and they are also required for serious study of computational linguistics.
Course completion requirements -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

Verbal exam.

Precondition to the exam is completing a course credit.

Course credit is composed of: attendence and activity in class, submittiong all homeworks, and achieving >50% points for the homeworks.

Literature -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

Learning Perl, 7th Edition (or at least 5th)

Learning the bash Shell

Linux Pocket Guide

Requirements to the exam -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

Exams test knowledge of the content explained in the lectures.

Syllabus -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)

We will use large texts from the students' field of study to demonstrate the

most important methods of text processing required to acquire non-trivial

information or verify hypotheses.

  • An impact of large text data: properties of big data
  • unix shell and basic commands
  • more commands for text processing
  • text editors
  • searching via regular expressions
  • using regular expressions for text maniplation
  • formulation and verification of hypotheses, application on data, precission, recall
  • example applications: stripping diacritics, sentence segmentation, tokenisation
  • rule-based part of speech tagging
  • corpus acquisition
  • NLP workfow engines: GATE, OpenNLP, Treex,
  • automatic complex analysis of a corpus
  • visualisation of the analysis and results
Charles University | Information system of Charles University |