SubjectsSubjects(version: 875)
Course, academic year 2020/2021
  
Corpus Linguistics - Introduction - NPFL065
Title: Korpusová lingvistika - úvod
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2012
Semester: winter
E-Credits: 3
Hours per week, examination: winter s.:0/2 C [hours/week]
Capacity: unlimited
Min. number of students: unlimited
State of the course: cancelled
Language: Czech
Teaching methods: full-time
Guarantor: prof. PhDr. František Čermák, DrSc.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Annotation -
Last update: T_UFAL (19.05.2004)
An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. In theory, the following topics are studied: the concept of a corpus; language corpus as a source of knowledge of language; modern computer technologies; corpus typology from various perspectives; representativeness of a text corpus (statistical methods of corpus processing, the text reception and production perspective); administrative markup of texts included in the corpus; structural markup and linguistic annotation of texts (tagging, lemmatization).
Literature - Czech
Last update: T_UFAL (19.05.2004)

Aijmer K., Altenberg B. (eds.) (1991): English Corpus Linguistics. Studies in Honour of Jan Svartvik. Longman, London.

Allan K. (1986): Linguistic Meaning 1-2. Routledge, London.

Atkins S., Clear J., Ostler N. (1992): Corpus Design Criteria. Literary and Linguistic Computing, Vol. 7, No. 1, s. 1-16.

Atkins B.T.S., Zampolli A. (eds.) (1994): Computational Approaches to the Lexicon. Oxford (= 5. Pisa International Summer School on Computational Lexicology and Lexicography).

Barnbrook G. (1996): Language and Computers. Edinburgh University Press, Edinburgh. *Biber D. (1993): Representativeness in Corpus Design. Literary and Linguistic Computing Vol. 8, No. 4, s. 243-258.

Biber D., Conrad S., Reppen R. (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge.

Boguraev B., Briscoe T. (1989): Computational Lexicography for Natural Language Processing. Longman, London - New York.

Burnard L. (1993): A Gentle Introduction to SGML. TEI P2.

Burnard L. (1993): A Gentle Introduction to XML. http://www.tei-c.org/Guidelines2/gentleintro.html

Church K.W., Hanks P. (1990): Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16/1, s. 22-29.

Čermák F. (1995), Komputační lexikografie. In: F. Čermák, R. Blatná (eds.): Manuál lexikografie. H+H, Praha.

Čermák F. (1995): Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: s. 119-140 (+ bibliografie tamtéž).

Čermák F., Blatná R. (eds.) (1995): Manuál lexikografie. H+H, Praha.

Čermák F., Klímová J., Petkevič V. (eds.) (2000): Studie z korpusové lingvistiky. Nakladatelství Karolinum, Univerzita Karlova, Praha. (Poekladový sborník vybraných studií)

Fillmore C.J., Atkins B.T.S. (1994): Starting where the dictionaries stop: the challenge of corpus lexicography. In: Atkins B.T.S., Zampolli A. (eds.), Computational Approaches to the Lexicon.

Garside R., Leech, G., McEnery A. (1997): Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman, London - New York.

Hajič J., Hladká B. (1997): Morfologické značkování korpusu českých textů stochastickou metodou. Slovo a slovesnost 4/1997, s. 288-304.

Hajič J., Hajičová E., Panevová J., Sgall P. (1998): Syntax v Českém národním korpusu. Slovo a slovesnost 3/1998, s. 168-177.

Hajič J., Hladká B. (1998): Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. Proceedings from COLING-ACL'98. Montreal, s. 483-490.

Halliday M.A.K. (1991): Language as system and language as instance: The corpus as a theoretical construct. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Halliday M.A.K. (1991): Corpus studies and probabilistic grammar. In: Aijmer et al., s. 30-43.

Kennedy G. (1998): An Introduction to Corpus Linguistics. Longman, London.

Leech G. (1991): The State of the Art in Corpus Linguistics. In: Aijmer et al., s. 8-29.

Leech G. (1993): Corpus Annotation Schemes. Literary and Linguistic Computing 8:4, s. 275-281.

McEnery A., Wilson A. (1996): Corpus Linguistics. Edinburgh University Press, Edinburgh.

Nelson W. Francis (1991): Language Corpora B.C. In: Svartvik J. (ed.), Directions in Corpus Linguistics. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Oakes M.P. (1998): Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.

Oliva K., Hnátková M., Květoň P., Petkevič V. (2000): The Linguistic Basis of a Rule-Based Tagger of Czech. In: Sojka P., I. Kopeček, K. Pala (eds.): Proceedings of the Text, Speech and Dialogue conference TSD 2000 held in Brno 2000. LNAI 1902, Springer-Verlag Berlin Heidelberg, s. 3-8.

Šulc M. (1999): Korpusová lingvistika. První vstup. Nakladatelství Karolinum, Univerzita Karlova, Praha.

Publikace vysázené tučně existují v českém překladu a jsou zahrnuty ve výše uvedeném sborníku Čermák F., Klímová J., Petkevič V. (2000): Studie z korpusové lingvistiky.

Syllabus -
Last update: T_UFAL (19.05.2004)

An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. The seminar in this term focuses on theoretical issues, primarily on the following topics:

  • the concept of a corpus
  • language corpus as a source of knowledge of language (huge basis of language data which can be quickly and effectively retrieved and evaluated, e.g. as compared to out-of-date files of paper excerpts on cards)
  • modern computer technologies for the corpus buildup and its exploitation
  • corpus typology from various perspectives (synchronic x diachronic, multilingual parallel x multilingual comparable x monolingual, corpus of written x spoken language, dialect corpus etc., Czech language corpora [e.g. the Czech National Corpus project, Prague Dependency Treebank] and corpora of other languages)
  • representativeness of a text corpus (reception and production of texts and their genre representativeness in a corpus, balanced corpus from the viewpoint of linguistic phenomena etc., statistical methods of corpus processing)
  • administrative markup of texts included in the corpus (database of texts and their identification, origin, type)
  • structural markup and linguistic annotation of texts (paragraph and sentence segmentation, tokenization; morphological analysis including lemmatization, subsequent disambiguation, buildup of syntaktic/semantic structures, e.g. trees)

 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html