SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Information Retrieval Systems - NDBI010
Title: Dokumentografické informační systémy
Guaranteed by: Department of Software Engineering (32-KSI)
Faculty: Faculty of Mathematics and Physics
Actual: from 2020
Semester: summer
E-Credits: 3
Hours per week, examination: summer s.:2/0, Ex [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: cancelled
Language: Czech
Teaching methods: full-time
Teaching methods: full-time
Additional information: http://www.ms.mff.cuni.cz/~kopecky/vyuka/dis/
Guarantor: RNDr. Michal Kopecký, Ph.D.
Class: Informatika Mgr. - Softwarové systémy
Classification: Informatics > Database Systems
Incompatibility : NDBI043
Interchangeability : NDBI043
Is incompatible with: NDBI043
Is pre-requisite for: NSWI107
Is interchangeable with: NDBI043
Annotation -
Last update: Ing. Ladislav Kopecký (16.04.2005)
String matching algorithms. Searching and data retrieval from text databases. Architecture of text retrieval systems. Text compression. Corrections of texts in a natural language.
Literature - Czech
Last update: Ing. Ladislav Kopecký (10.04.2005)

Pokorný J., Snášel V., Húsek D.: Dokumentograficé informační systémy. Skripta UK, 1999

Melichar B.: Textové informační systémy. Skripta ČVUT, 1994

Syllabus -
Last update: Ing. Ladislav Kopecký (03.05.2005)

Introduction

  • History and evolution of text retrieval systems
  • Differences between factographical and text retrieval systems

Pattern matching algorithms

  • Brute-force algorithm
  • Left-to-rights algorithms
  • Knuth-Morris-Pratt algorithm
  • Aho-Corasick algorithm
  • Regular expressions and finite state automata
  • Right-to-left algorithms
  • Boyer-Moore algorithm
  • Commentz-Walter algorithm
  • Butzilowsky?s two-way finite state jump automata

Architecture of text retrieval systems

  • Boolean text retrieval systems
  • Vector-based text retrieval systems
  • Signature-based text retrieval methods
  • Inductive methods, spreading algorithms
  • Systems based on fuzzy logic

Document indexing

  • Automatical document indexing
  • Selection of appropriate terms
  • Term importance assignment
  • Implementation of text retrieval systems
  • Clustering algorithms for vector-based systems
  • Concepts in vector-based systems

Algorithms for approximate pattern matching

  • Hamming and Levenshtein metrics
  • Construction of finite state automata for approximate pattern matching
  • Corrections of texts in a natural language.

Textual data compression

  • Entropy a redundancy
  • Compression of integer numbers
  • Statical versus adaptive algorithms
  • Huffman encoding, word-based compression
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html