SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Introduction to Text Processing and Analysis - AMLV00067
Title: Introduction to Text Processing and Analysis
Guaranteed by: Institute of the Czech National Corpus (21-UCNK)
Faculty: Faculty of Arts
Actual: from 2019
Semester: winter
Points: 0
E-Credits: 4
Examination process: winter s.:
Hours per week, examination: winter s.:1/1, C [HT]
Capacity: unknown / unknown (unknown)
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
Key competences:  
State of the course: not taught
Language: English
Teaching methods: full-time
Teaching methods: full-time
Note: course can be enrolled in outside the study plan
enabled for web enrollment
Guarantor: Mgr. Pavel Vondřička, Ph.D.
Mgr. Lucie Lukešová, Ph.D.
Schedule   Noticeboard   
Last update: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)
The main objective of the course is to provide beginner-level digital linguistics students with all the necessary
information on text processing and analysis. Starting with basic topics, such as characteristics of a plain text format
and the difference between data and metadata, the course goes on to explain the specifics of XML and different
types of text annotation, to introduce the process of tokenization, segmentation and morphological analysis, to
describe the limits and possibilities of syntactic and semantic tagging and, finally, to summarize the principles of
CQL and corpus querying, including the use of regular expressions and querying parallel corpora.
Aim of the course
Last update: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)
  1. To understand how computers work with textual data;

  2. To distinguish between different data formats and extract textual content from them (e.g. using OCR);

  3. To understand the specifics of plain text and XML formats;

  4. To understand the principles and issues of text annotation, incl. morphological analysis, syntactic and semantic tagging;

  5. To learn about available resources for text processing and analysis, including taggers and concordancers;

  6. To be able to analyse existing as well as own corpora in a variety of available corpus-based tools;

  7. To build complex CQL queries, including regular expressions and logical operators.

Last update: Mgr. Lucie Lukešová, Ph.D. (12.07.2018)
  1. File formats related to textual data

  2. Plain text: Encoding, data and metadata

  3. Extensible Markup Language or XML

  4. Regular expressions

  5. Tokenization and corpus-data formats

  6. Morphological analysis: principles and tools

  7. Syntactic and semantic annotation

  8. Corpus exploration and analysis

  9. Querying corpus data with CQL

  10. Text alignment and parallel corpora

Charles University | Information system of Charles University |