SubjectsSubjects(version: 978)
Course, academic year 2025/2026
   Login via CAS
   
Data Mining - NDBI023
Title: Dobývání znalostí
Guaranteed by: Department of Theoretical Computer Science and Mathematical Logic (32-KTIML)
Faculty: Faculty of Mathematics and Physics
Actual: from 2020
Semester: summer
E-Credits: 5
Hours per week, examination: summer s.:2/2, C+Ex [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: taught
Language: English, Czech
Teaching methods: full-time
Guarantor: doc. RNDr. Iveta Mrázová, CSc.
RNDr. František Mráz, CSc.
Teacher(s): RNDr. František Mráz, CSc.
doc. RNDr. Iveta Mrázová, CSc.
Class: Informatika Mgr. - Teoretická informatika
Informatika Mgr. - Softwarové systémy
Classification: Informatics > Database Systems, Theoretical Computer Science
Is incompatible with: NDBX023
Is interchangeable with: NDBX023
Annotation -
A rapid development in the area of data mining is motivated by the necessity to "translate" huge amounts of processed and stored data into meaningful information easy to use in practice. This lecture is focused on understanding principal concepts and techniques applicable to data mining. Basic principles of their application to novel solutions of practical tasks will be used to solve a student project as a part of the subject. Possible application areas comprise mainly business and Web applications, but others as well. Knowledge BSc. in CS of mathematical principles and programming is assumed.
Last update: Hric Jan, RNDr. (28.05.2020)
Aim of the course -

Understand the main principles of data mining methods and learn how to apply these methods in practice.

Last update: Mrázová Iveta, doc. RNDr., CSc. (27.05.2020)
Course completion requirements -

A) The labs

Step by step, in an accompanying Moodle course, there will be published a homework assignment and quizzes.

The student shall submit the solution of the homework assignment in the form of a Jupyter notebook. The deadline for the submission corresponds to the time by which you click the button "Submit assignment" in the Moodle system. After clicking this button, you will no longer be able to edit your submission, but you can request (via email) that your teacher return the assignment to the draft state. The teacher will grade the submitted assignment on a scale of 0 to 10 points.

Warning: If N≥2 participants of the course submit solutions that are very similar or identical, all these solutions will be considered as a single solution. The solution will be graded by B points according to its quality, and all students who submit it will obtain only the integer part of the value B/N points.

Quizzes:

Beside the assignments, you will solve 5 on-line quizzes for at most 25 points altogether. Each quiz will have set up also a deadline. In contrast to assignments, it will be not possible to solve any quiz after its deadline.

For obtaining credits for the seminar, it is necessary:

  1. To solve the homework assignment. WARNING: Late solution submission will incur a 1-point penalty for each week of delay after the deadline.
  2. To prepare and to present a term project in the lab in the last week of the term or on a date (during the following exam period) that will be set up in the lab within the last week of the semester. The subject of the project will be discussed in the lab in the middle of the term. Each project will be graded up to 15 points according to its quality.

During labs, it is possible to earn additional points, for example, 1 point for demonstrating a solution to a problem assigned during the lab.

Except for the additional points, it is possible to obtain up to 50 points. All points obtained during labs will account for up to 40% of the final exam score. However, when a student obtains more than 50 points within labs (including additional points), these points will still account for only 40% of the final exam score.

Continuous work throughout the whole term is required to obtain the credits. Therefore, there will be no additional possibilities to acquire them later.

B) The lecture

The lecture will be given once a week according to the schedule. As mentioned above, points acquired within the labs will account for up to 40% of the final exam score. Furthermore, during the first lecture, a date will be scheduled for an online test, which will be administered during a lab session. The date for the online test will be published in the accompanying Moodle course.

This test will contribute 0-15% toward the final score. The exam at the end of this term will make up the remaining 45% of the final score. The following table gives the final grade according to the achieved score:

grade 1 grade 2 grade 3 failure
100%–86% 85%–71% 70%–56% less than 56%

Last update: Mráz František, RNDr., CSc. (14.01.2026)
Literature -

  1. Aggarwal C. C.: Data Mining: The Textbook, Springer, 2015
  2. Berka P.: Dobývání znalostí z databází, Academia, 2003
  3. Liu B.: Web Data Mining, Springer, 2007
  4. Murphy K. P.: Machine Learning: A Probabilistic Perspective, The MIT Press, 2012

Last update: Mrázová Iveta, doc. RNDr., CSc. (27.05.2020)
Requirements to the exam -

The exam consists of a written and oral part. The written part precedes the oral part. Failing the written part results in failing the entire exam; i.e., the exam will be classified as grade 4 (failed), and the oral part will not be conducted. When failing the oral part, the next (reparative) attempt will consist again of both the written and oral parts. The final grade of the exam is determined based on the points awarded for the written and oral parts of the exam, as well as the points obtained for the student’s work throughout the semester – see Course Completion Requirements.

The written part of the exam consists of two questions related to the lecture syllabus and/or material covered during the lab classes.

The requirements for the exam correspond to the syllabus of the lecture within the extent presented in the classes. To participate in the exam, it is necessary to have obtained the credits for the labs.

Last update: Mráz František, RNDr., CSc. (14.01.2026)
Syllabus -

  1. Introduction to the area of data mining

    • Motivation for data mining and its importance for practice, an overview of frequent data mining tasks, main data mining methodologies.
    • Main principles of machine learning – supervised training, self-organization, semi-supervised learning, training set, test set and validation set, generalization and overfitting, Occam´s razor.

  2. Fundamental paradigms of the data mining process

    • Data gathering, preparation and preprocessing – sampling, variability and confidence, discretization of numeric attributes and handling nonnumerical variables, replacement of missing and empty values, series variables.
    • Transformation, reduction and cleaning of the data – relationships among the attributes (similarity measures, hypothesis testing, correlation, regression, discriminant and cluster analysis), dimensionality reduction.
    • Validation of the obtained results – cross-validation, overall accuracy, confusion matrix, learning curve, lift curve, ROC curve, combination of models (bagging, boosting).

  3. Techniques for association rule mining

    • Market basket analysis – frequent itemsets, association rules, their formulation and main characteristics.
    • Generation of frequent item combinations – algorithm apriori, "frequent-pattern-growth"-techniques (FP-Growth and TD-FP-Growth), combinational data analysis.
    • Constraint-based search for interesting rules (specification of time, items, etc.).

  4. Methods for cluster analysis

    • The k-means algorithm, the choice of a suitable metric, evaluation of the obtained results (cluster validity), representation and visualization of the found clusters.
    • Clustering based on the fuzzy set approach (FCM-clustering), neural approach and hierarchical clustering.
    • Advanced concepts – scalable techniques (CLARANS, BIRCH, CURE), outlier analysis.

  5. Approaches to data classification and prediction

    • Decision trees and their induction – algorithms ID3, C4.5, CART and CHAID.
    • Probabilistic classifiers – Bayessian models and techniques for their training and inference.
    • Nature-inspired models – artificial neural networks of the perceptron type, SVM-machines, ELM-networks, genetic algorithms.

Last update: Mrázová Iveta, doc. RNDr., CSc. (27.05.2020)
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html