SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Automatic Text Data Processing - NPFL098
Title: Automatické zpracování textových dat
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2022
Semester: summer
E-Credits: 6
Hours per week, examination: summer s.:2/2, C+Ex [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: cancelled
Language: Czech, English
Teaching methods: full-time
Teaching methods: full-time
Additional information: http://ufal.mff.cuni.cz/courses/npfl098
Note: course is intended for doctoral students only
Guarantor: Mgr. Pavel Straňák, Ph.D.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Incompatibility : NPFL131
Interchangeability : NPFL131
Is incompatible with: NPFL131
Is interchangeable with: NPFL131
Annotation -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)
An introductory course for automatic text processing using the most common and efficient tools and methods. The skills acquired during the course will benefit any scientific work that involves large texts and they are also required for serious study of computational linguistics.
Course completion requirements -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

Verbal exam.

Precondition to the exam is completing a course credit.

Course credit is composed of: attendence and activity in class, submitting all homeworks, and achieving >50% points for the homeworks.

Literature -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

http://ufal.mff.cuni.cz/courses/npfl098

Learning Perl, 7th Edition (or at least 5th)

Learning the bash Shell

Linux Pocket Guide

Requirements to the exam -
Last update: Mgr. Pavel Straňák, Ph.D. (10.06.2019)

Exams test knowledge of the content explained in the lectures.

Syllabus -
Last update: Mgr. Pavel Straňák, Ph.D. (10.05.2013)

We will use large texts from the students' field of study to demonstrate the

most important methods of text processing required to acquire non-trivial

information or verify hypotheses.

  • An impact of large text data: properties of big data
  • unix shell and basic commands
  • more commands for text processing
  • text editors
  • searching via regular expressions
  • using regular expressions for text maniplation
  • formulation and verification of hypotheses, application on data, precission, recall
  • example applications: stripping diacritics, sentence segmentation, tokenisation
  • rule-based part of speech tagging
  • corpus acquisition
  • NLP workfow engines: GATE, OpenNLP, Treex,
  • automatic complex analysis of a corpus
  • visualisation of the analysis and results
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html