SubjectsSubjects(version: 953)
Course, academic year 2023/2024
   Login via CAS
Corpus Linguistics - Introduction - NPFL065
Title: Korpusová lingvistika - úvod
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2012
Semester: winter
E-Credits: 3
Hours per week, examination: winter s.:0/2, C [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: cancelled
Language: Czech
Teaching methods: full-time
Teaching methods: full-time
Guarantor: prof. PhDr. František Čermák, DrSc.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Is co-requisite for: NPFL066
Annotation -
An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. In theory, the following topics are studied: the concept of a corpus; language corpus as a source of knowledge of language; modern computer technologies; corpus typology from various perspectives; representativeness of a text corpus (statistical methods of corpus processing, the text reception and production perspective); administrative markup of texts included in the corpus; structural markup and linguistic annotation of texts (tagging, lemmatization).
Last update: T_UFAL (19.05.2004)
Literature - Czech

Aijmer K., Altenberg B. (eds.) (1991): English Corpus Linguistics. Studies in Honour of Jan Svartvik. Longman, London.

Allan K. (1986): Linguistic Meaning 1-2. Routledge, London.

Atkins S., Clear J., Ostler N. (1992): Corpus Design Criteria. Literary and Linguistic Computing, Vol. 7, No. 1, s. 1-16.

Atkins B.T.S., Zampolli A. (eds.) (1994): Computational Approaches to the Lexicon. Oxford (= 5. Pisa International Summer School on Computational Lexicology and Lexicography).

Barnbrook G. (1996): Language and Computers. Edinburgh University Press, Edinburgh. *Biber D. (1993): Representativeness in Corpus Design. Literary and Linguistic Computing Vol. 8, No. 4, s. 243-258.

Biber D., Conrad S., Reppen R. (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge.

Boguraev B., Briscoe T. (1989): Computational Lexicography for Natural Language Processing. Longman, London - New York.

Burnard L. (1993): A Gentle Introduction to SGML. TEI P2.

Burnard L. (1993): A Gentle Introduction to XML. http://www.tei-c.org/Guidelines2/gentleintro.html

Church K.W., Hanks P. (1990): Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16/1, s. 22-29.

Čermák F. (1995), Komputační lexikografie. In: F. Čermák, R. Blatná (eds.): Manuál lexikografie. H+H, Praha.

Čermák F. (1995): Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: s. 119-140 (+ bibliografie tamtéž).

Čermák F., Blatná R. (eds.) (1995): Manuál lexikografie. H+H, Praha.

Čermák F., Klímová J., Petkevič V. (eds.) (2000): Studie z korpusové lingvistiky. Nakladatelství Karolinum, Univerzita Karlova, Praha. (Poekladový sborník vybraných studií)

Fillmore C.J., Atkins B.T.S. (1994): Starting where the dictionaries stop: the challenge of corpus lexicography. In: Atkins B.T.S., Zampolli A. (eds.), Computational Approaches to the Lexicon.

Garside R., Leech, G., McEnery A. (1997): Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman, London - New York.

Hajič J., Hladká B. (1997): Morfologické značkování korpusu českých textů stochastickou metodou. Slovo a slovesnost 4/1997, s. 288-304.

Hajič J., Hajičová E., Panevová J., Sgall P. (1998): Syntax v Českém národním korpusu. Slovo a slovesnost 3/1998, s. 168-177.

Hajič J., Hladká B. (1998): Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. Proceedings from COLING-ACL'98. Montreal, s. 483-490.

Halliday M.A.K. (1991): Language as system and language as instance: The corpus as a theoretical construct. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Halliday M.A.K. (1991): Corpus studies and probabilistic grammar. In: Aijmer et al., s. 30-43.

Kennedy G. (1998): An Introduction to Corpus Linguistics. Longman, London.

Leech G. (1991): The State of the Art in Corpus Linguistics. In: Aijmer et al., s. 8-29.

Leech G. (1993): Corpus Annotation Schemes. Literary and Linguistic Computing 8:4, s. 275-281.

McEnery A., Wilson A. (1996): Corpus Linguistics. Edinburgh University Press, Edinburgh.

Nelson W. Francis (1991): Language Corpora B.C. In: Svartvik J. (ed.), Directions in Corpus Linguistics. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Oakes M.P. (1998): Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.

Oliva K., Hnátková M., Květoň P., Petkevič V. (2000): The Linguistic Basis of a Rule-Based Tagger of Czech. In: Sojka P., I. Kopeček, K. Pala (eds.): Proceedings of the Text, Speech and Dialogue conference TSD 2000 held in Brno 2000. LNAI 1902, Springer-Verlag Berlin Heidelberg, s. 3-8.

Šulc M. (1999): Korpusová lingvistika. První vstup. Nakladatelství Karolinum, Univerzita Karlova, Praha.

Publikace vysázené tučně existují v českém překladu a jsou zahrnuty ve výše uvedeném sborníku Čermák F., Klímová J., Petkevič V. (2000): Studie z korpusové lingvistiky.

Last update: T_UFAL (19.05.2004)
Syllabus -

An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. The seminar in this term focuses on theoretical issues, primarily on the following topics:

  • the concept of a corpus
  • language corpus as a source of knowledge of language (huge basis of language data which can be quickly and effectively retrieved and evaluated, e.g. as compared to out-of-date files of paper excerpts on cards)
  • modern computer technologies for the corpus buildup and its exploitation
  • corpus typology from various perspectives (synchronic x diachronic, multilingual parallel x multilingual comparable x monolingual, corpus of written x spoken language, dialect corpus etc., Czech language corpora [e.g. the Czech National Corpus project, Prague Dependency Treebank] and corpora of other languages)
  • representativeness of a text corpus (reception and production of texts and their genre representativeness in a corpus, balanced corpus from the viewpoint of linguistic phenomena etc., statistical methods of corpus processing)
  • administrative markup of texts included in the corpus (database of texts and their identification, origin, type)
  • structural markup and linguistic annotation of texts (paragraph and sentence segmentation, tokenization; morphological analysis including lemmatization, subsequent disambiguation, buildup of syntaktic/semantic structures, e.g. trees)

Last update: T_UFAL (19.05.2004)
 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html