Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Language Technologies for Research in Humanities - NPFL152

Title:	Jazykové technologie pro výzkum v humanitních oborech
Guaranteed by:	Institute of Formal and Applied Linguistics (32-UFAL)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2025
Semester:	summer
E-Credits:	3
Hours per week, examination:	summer s.:1/2, C [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
State of the course:	taught
Language:	Czech
Teaching methods:	full-time
Additional information:	http://ufal.mff.cuni.cz/courses/npfl152

Guarantor:	Mgr. Bc. Pavel Straňák, Ph.D. Mgr. Jan Štěpánek, Ph.D.
Teacher(s):	Mgr. Bc. Pavel Straňák, Ph.D. Mgr. Jan Štěpánek, Ph.D.
Class:	Informatika Mgr. - Matematická lingvistika
Classification:	Informatics > Computer and Formal Linguistics
Incompatibility :	NPFL098, NPFL131
Interchangeability :	NPFL098, NPFL131

Opinion survey results SS schedule Noticeboard

Annotation -

You will learn to efficiently use tools and procedures for the automatic processing of large-scale texts in different languages. The skills acquired will facilitate independent scientific work with language dataq in any area of humanities.

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (05.02.2026)

Course completion requirements -

Ensuring a credit is conditional on active participation in teaching, handing over all homework and earning >70% of the points from these tasks.

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (05.02.2026)

Literature -

presentations from the past: http://ufal.mff.cuni.cz/courses/NPFL131

Learning Perl, 8th Edition (use at least 5th Edition)

Pro Git

Learning the bash Shell

Linux Pocket Guide

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (05.02.2026)

Syllabus -

Using large texts, we will learn the basic methods of text processing needed to obtain non-trivial information. For Czech we will use texts of works by Karel Čapek, for Classical Chinese selected texts from https://github.com/kanripo, for other languages works according to the focus of the students.

importance and statistical properties of Big Data

unix shell; most basic commands

more unix commands and basic Perl to manipulate texts

text editors

quantitative analysis of text

comparing texts and visualizing differences

search using regular expressions

using regular expressions to batch edit text

diacritic removal, sentence segmentation, tokenization

getting information on Chinese characters from Unihan database

rule-based automatic part of speech identification

creating your own corpus

"NLP workflow engines" - GATE, OpenNLP, Treex

calling REST APIs

UDPipe and select the appropriate model if there are more than one for the language

visualization of analysis and results

Last update: Vidová Hladká Barbora, doc. Mgr., Ph.D. (05.02.2026)