SubjectsSubjects(version: 945)
Course, academic year 2016/2017
   Login via CAS
Automatic Text Data Processing - NPFL098
Title: Automatické zpracování textových dat
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2013 to 2017
Semester: summer
E-Credits: 6
Hours per week, examination: summer s.:2/2, C+Ex [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: Czech, English
Teaching methods: full-time
Teaching methods: full-time
Note: course is intended for doctoral students only
Guarantor: Mgr. Pavel Straňák, Ph.D.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Annotation -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)
An introductory course for automatic text processing using the most common and efficient tools and methods. The skills acquired during the course will benefit any scientific work that involves large texts and they are also required for serious study of computational linguistics.
Literature -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

http://ufal.mff.cuni.cz/~stranak/2012/index.html

Learning Perl, Fifth Edition

Learning the bash Shell

Linux Pocket Guide

Syllabus -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)

We will use large texts from the students' field of study to demonstrate the

most important methods of text processing required to acquire non-trivial

information or verify hypotheses.

  • An impact of large text data: properties of big data
  • unix shell and basic commands
  • more commands for text processing
  • text editors
  • searching via regular expressions
  • using regular expressions for text maniplation
  • formulation and verification of hypotheses, application on data, precission, recall
  • example applications: stripping diacritics, sentence segmentation, tokenisation
  • rule-based part of speech tagging
  • corpus acquisition
  • NLP workfow engines: GATE, OpenNLP, Treex,
  • automatic complex analysis of a corpus
  • visualisation of the analysis and results
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html