SubjectsSubjects(version: 845)
Course, academic year 2018/2019
   Login via CAS
Application of Big Data Technologies in Data Science - NDBI047
Title in English: Aplikace Big Data technologií v Data Science
Guaranteed by: Department of Software Engineering (32-KSI)
Faculty: Faculty of Mathematics and Physics
Actual: from 2018
Semester: summer
E-Credits: 4
Hours per week, examination: summer s.:1/2 C+Ex [hours/week]
Capacity: unlimited
Min. number of students: unlimited
State of the course: taught
Language: Czech
Teaching methods: full-time
Guarantor: doc. RNDr. Irena Holubová, Ph.D.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Database Systems
Annotation -
Last update: RNDr. Michal Kopecký, Ph.D. (12.05.2018)
Practically oriented course following the introductory lecture (NDBI040) on Big Data Technologies. The aim is to teach students how to use Big Data technologies from the Hadoop and Spark family to analyze and process Big Data. The course is taught by professionals from company Profinit and it is based on their experience from real-world Data Science projects in banking, telecommunication and IoT.
Course completion requirements -
Last update: RNDr. Michal Kopecký, Ph.D. (12.05.2018)

During the semester students get access to the Hadoop Cluster Metacentrum and learn how to create large computational Map/Reduce tasks. The resulting grades will correspond to the combination of a test and a homework based on a non-trivial analysis of a larger data set.

Literature -
Last update: RNDr. Michal Kopecký, Ph.D. (12.05.2018)
  • Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale 4th Edition, by Tom White, 4nd edition, Oreilly’s, 2015
  • Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst, Dean Abbott, Wiley 2014
  • Big Data a NoSQL databáze, Irena Holubová, Jiří Kosek, Karel Minařík, David Novák, Grada 2015

Syllabus -
Last update: RNDr. Michal Kopecký, Ph.D. (12.05.2018)
  • Lecture 1: Introduction to Hadoop, benefits of Big Data technologies in Data Science tasks

Practicals 1 + 2: First steps on a cluster, basic tools

  • Lecture 2: Storage, distributed HDFS data storage, Hive technology

Practicals 3 + 4: HDFS, Hive, HQL

  • Lecture 3. Apache Spark, Map/Reduce programs in RAM

Practicals 5 + 6: Spark RDD and Spark Data Frame paradigm

  • Lecture 4: Stream data processing, algorithms and technologies

Practicals 7 + 8: Spark Streaming, Kafka

  • Lecture 5: Data Science, modeling of features in the context of Big Data

Practicals 9 + 10: Feature modeling, Spark ML, GraphX

  • Lecture 6: Methodology of preparation of the credit test

Practicals 11 + 12: Work with PCs, credit test

Charles University | Information system of Charles University |