Subjects

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Data Science in R for Students of Humanities - NPFL112

Title:	Analýza dat v R pro studenty humanitních oborů
Guaranteed by:	Institute of Formal and Applied Linguistics (32-UFAL)
Faculty:	Faculty of Mathematics and Physics
Actual:	from 2025
Semester:	both
E-Credits:	3
Hours per week, examination:	2/0, Ex [HT]
Capacity:	unlimited
Min. number of students:	unlimited
4EU+:	no
Virtual mobility / capacity:	no
Key competences:	critical thinking, data literacy
State of the course:	taught
Language:	Czech, English
Teaching methods:	full-time
Additional information:	https://ufal.mff.cuni.cz/courses/r-for-humanities/english
Note:	you can enroll for the course in winter and in summer semester

Guarantor:	Mgr. Silvie Cinková, Ph.D.
Teacher(s):	Mgr. Silvie Cinková, Ph.D.
Class:	Informatika Mgr. - Matematická lingvistika
Classification:	Informatics > Computer and Formal Linguistics

Opinion survey results WS schedule SS schedule Noticeboard

Annotation -

The humanities have seen an irreversible paradigm shift towards Digital Humanities, based on automatic quantitative analysis of (big) data. We will teach you: - to clean and structure data into neat tables; - to discover trends, recurring patterns, and outliers - basics of modern data visualization We use the open-source programming language R along with its advanced RStudio IDE and tidyverse, the globally popular collection of professional data-scientific tools.

Last update: Kuboň Vladislav, doc. RNDr., Ph.D. (05.06.2018)

Course completion requirements -

The course is completed with an examination without a final test. Instead, the grading is based on your obligation fulfillment like so:

Grade C: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), one home assignment submitted in time and approved by the teacher.

Grade B: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), two home assignments submitted in time and approved by the teacher.

Grade A: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), three home assignments submitted in time and approved by the teacher.

For your limit count only DataCamp XP that you acquire in DataCamp courses listed for home assignments and in your current term. Should you have completed them in the past, you must negotiate an alternative list of Data Camp courses with the teacher in advance.

Your free DataCamp license is valid for six months since the course start and cannot be extended. You must complete your assignments within that period. No alternative assignments can be negotiated.

Last update: Cinková Silvie, Mgr., Ph.D. (23.05.2025)

Literature -

Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. O'Reilly. Momentálně zdarma online: http://r4ds.had.co.nz/

Garrett Grolemund. 2014. Hands-On Programming with R. O'Reilly.

Nina Zumel and John Mount. 2014 Practical Data Science with R. Manning.

Julia Silge and David Robinson: Text Mining with R. A tidy approach. 2017. O'Reilly.

Stefan Th. Gries. 2013. Statistics for Linguistics with R. A practical introduction. De Gruyter.

Stefan Th. Gries. 2009. Quantitative Corpus Linguistics with R. De Gruyter. Routledge.

Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer.

Natalia Levshina. 2015. How to do Linguistics with R. Data exploration and statistical analysis. John Benjamins.

Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. 2015. Wiley.

Last update: Kuboň Vladislav, doc. RNDr., Ph.D. (05.06.2018)

Requirements to the exam - Czech

Předmět je zakončen zkouškou. Zkouška neobsahuje žádný závěrečný test, ale skládá se z hodnocení studentovy práce za celý semestr podle následujících kritérií:

Dobře: 30 000 XP na DataCampu, aktivní přítomnost v hodinách (nebo ekvivalent v DataCamp XP: zameškaná hodina = 1000 XP navíc), 1 samostatný domácí úkol odevzdaný v termínu.

Velmi dobře: 30 000 XP na DataCampu, aktivní přítomnost v hodinách (nebo ekvivalent v DataCamp XP: zameškaná hodina = 1000 XP navíc), 2 samostatné domácí úkoly odevzdané v termínu.

Výborně: 30 000 XP na DataCampu, aktivní přítomnost v hodinách (nebo ekvivalent v DataCamp XP: zameškaná hodina = 1000 XP navíc), 3 samostatné domácí úkoly odevzdané v termínu.

Do limitu XP z DataCampu se počítají jenom body z aktuálního semestru a z předepsaných kurzů (pokud je student již vypracoval někdy v minulosti, je povinen domluvit si individuální alternativní zadání s vyučujícím).

Termín splnění studijních povinností zadaných na platformě DataCamp je omezen platností licence (přesně 6 měsíců od první rozvržené hodiny v semestru). Náhradní plnění mimo DataCamp není možné.

Last update: Cinková Silvie, Mgr., Ph.D. (23.05.2025)

Syllabus -

1. Basic concepts of R, advantages of R in data analysis as a subdiscipline of programming

2. Tables, vectors, loading a table file, vector as a table column, variable types as vector classes, selection (subsetting) of elements, rows and columns in base R

3. ggplot2 graphics library, mapping variables to aesthetic scales, types of graphs and scales (geom_, scale_ functions)

4. Data wrangling - dplyr library: selection and manipulation of rows (filter, slice, arrange) and columns (select, rename, mutate, if_else, case_when)

5. Data wrangling - groups (group_by, across, rowwise), aggregation (count, summarize)

6. Table joins (SQL-like)

7. "tidy data" concept, conversion between "wider" and "longer" table format for use with dplyr and ggplot2, tidyr (pivot_longer, pivot_wider, unite and separate)

8. Operations on strings, regular expressions incl. "look-around"

9. The concept of iteration in R: vectorization, loop, apply family functions and map family functions from the purrr library in common user situations

10. Text mining with the help of automatic syntactic annotation, interaction with the API of the UDPipe syntactic parser

Favorite datasets: gapminder (https://www.gapminder.org/data/), built-in datasets iris, diamonds, corpora

Last update: Cinková Silvie, Mgr., Ph.D. (22.05.2023)

Entry requirements -

English, basic computer literacy, frustration tolerance and discipline for regular homeworks. No programming skills required.

Last update: Cinková Silvie, Mgr., Ph.D. (23.05.2025)