Předměty

Poslední úprava: T_UFAL (19.05.2004)

Úvod do nejmodernějšího odvětví matematické/počítačové lingvistiky, které se zabývá počítačovými korpusy přirozených jazyků. Na teoretické rovině jde konkrétně o tato témata: pojem korpusu; jazykový korpus jako zdroj poznání jazyka; moderní počítačové technologie; typologie korpusů z různých hledisek; reprezentativnost neboli vyváženost jazykového korpusu (statistické metody zpracování korpusu, hledisko recepce a produkce textů); správní značkování textů zařazovaných do korpusu; strukturní a lingvistické značkování textů (tagging, lemmatizace).

Poslední úprava: T_UFAL (19.05.2004)

An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. In theory, the following topics are studied: the concept of a corpus; language corpus as a source of knowledge of language; modern computer technologies; corpus typology from various perspectives; representativeness of a text corpus (statistical methods of corpus processing, the text reception and production perspective); administrative markup of texts included in the corpus; structural markup and linguistic annotation of texts (tagging, lemmatization).

Poslední úprava: T_UFAL (19.05.2004)

Aijmer K., Altenberg B. (eds.) (1991): English Corpus Linguistics. Studies in Honour of Jan Svartvik. Longman, London.

Allan K. (1986): Linguistic Meaning 1-2. Routledge, London.

Atkins S., Clear J., Ostler N. (1992): Corpus Design Criteria. Literary and Linguistic Computing, Vol. 7, No. 1, s. 1-16.

Atkins B.T.S., Zampolli A. (eds.) (1994): Computational Approaches to the Lexicon. Oxford (= 5. Pisa International Summer School on Computational Lexicology and Lexicography).

Barnbrook G. (1996): Language and Computers. Edinburgh University Press, Edinburgh. *Biber D. (1993): Representativeness in Corpus Design. Literary and Linguistic Computing Vol. 8, No. 4, s. 243-258.

Biber D., Conrad S., Reppen R. (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge.

Boguraev B., Briscoe T. (1989): Computational Lexicography for Natural Language Processing. Longman, London - New York.

Burnard L. (1993): A Gentle Introduction to SGML. TEI P2.

Burnard L. (1993): A Gentle Introduction to XML. http://www.tei-c.org/Guidelines2/gentleintro.html

Church K.W., Hanks P. (1990): Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16/1, s. 22-29.

Čermák F. (1995), Komputační lexikografie. In: F. Čermák, R. Blatná (eds.): Manuál lexikografie. H+H, Praha.

Čermák F. (1995): Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: s. 119-140 (+ bibliografie tamtéž).

Čermák F., Blatná R. (eds.) (1995): Manuál lexikografie. H+H, Praha.

Čermák F., Klímová J., Petkevič V. (eds.) (2000): Studie z korpusové lingvistiky. Nakladatelství Karolinum, Univerzita Karlova, Praha. (Poekladový sborník vybraných studií)

Fillmore C.J., Atkins B.T.S. (1994): Starting where the dictionaries stop: the challenge of corpus lexicography. In: Atkins B.T.S., Zampolli A. (eds.), Computational Approaches to the Lexicon.

Garside R., Leech, G., McEnery A. (1997): Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman, London - New York.

Hajič J., Hladká B. (1997): Morfologické značkování korpusu českých textů stochastickou metodou. Slovo a slovesnost 4/1997, s. 288-304.

Hajič J., Hajičová E., Panevová J., Sgall P. (1998): Syntax v Českém národním korpusu. Slovo a slovesnost 3/1998, s. 168-177.

Hajič J., Hladká B. (1998): Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. Proceedings from COLING-ACL'98. Montreal, s. 483-490.

Halliday M.A.K. (1991): Language as system and language as instance: The corpus as a theoretical construct. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Halliday M.A.K. (1991): Corpus studies and probabilistic grammar. In: Aijmer et al., s. 30-43.

Kennedy G. (1998): An Introduction to Corpus Linguistics. Longman, London.

Leech G. (1991): The State of the Art in Corpus Linguistics. In: Aijmer et al., s. 8-29.

Leech G. (1993): Corpus Annotation Schemes. Literary and Linguistic Computing 8:4, s. 275-281.

McEnery A., Wilson A. (1996): Corpus Linguistics. Edinburgh University Press, Edinburgh.

Nelson W. Francis (1991): Language Corpora B.C. In: Svartvik J. (ed.), Directions in Corpus Linguistics. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Oakes M.P. (1998): Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.

Oliva K., Hnátková M., Květoň P., Petkevič V. (2000): The Linguistic Basis of a Rule-Based Tagger of Czech. In: Sojka P., I. Kopeček, K. Pala (eds.): Proceedings of the Text, Speech and Dialogue conference TSD 2000 held in Brno 2000. LNAI 1902, Springer-Verlag Berlin Heidelberg, s. 3-8.

Šulc M. (1999): Korpusová lingvistika. První vstup. Nakladatelství Karolinum, Univerzita Karlova, Praha.

Publikace vysázené tučně existují v českém překladu a jsou zahrnuty ve výše uvedeném sborníku Čermák F., Klímová J., Petkevič V. (2000): Studie z korpusové lingvistiky.

Poslední úprava: T_UFAL (19.05.2004)

Úvod do nejmodernějšího odvětví matematické/počítačové lingvistiky, které se zabývá počítačovými korpusy přirozených jazyků. Seminář se v tomto semestru zaměřuje na teoretické problémy, zejména pak na tato témata:

pojem korpusu

jazykový korpus jako zdroj poznání přirozeného jazyka (obrovská základna jazykových dat, jež se rychle a efektivně vyhodnocují, např. oproti předchozím kartotékám papírových excerpt)

moderní počítačové technologie pro budování korpusu a jeho vytěžování

typologie korpusů z různých hledisek (synchronní x diachronní, vícejazyčný paralelní x vícejazyčný srovnatelný x jednojazyčný, korpus psaného x mluveného jazyka, nářeční aj., korpusy českého jazyka [např. projekt Český národní korpus, Pražský závislostní korpus] a jiných jazyků)

reprezentativnost neboli vyváženost jazykového korpusu (hledisko recepce a produkce textů a jejich žánrové zastoupenosti v korpusu, vyváženost z hlediska lingvistických jevů ad., statistické metody zpracování korpusu)

správní značkování textů zařazovaných do korpusu (databáze textů a jejich identifikace, původ, typ)

strukturní a lingvistické značkování textů (segmentace do odstavců, vět, korpusových pozic; morfologická analýza včetně lemmatizace, následná disambiguace, budování syntaktických/sémantických struktur, např. stromů)

Poslední úprava: T_UFAL (19.05.2004)

An introduction to the modern branch of computational linguistics which concerns itself with corpora of natural languages. The seminar in this term focuses on theoretical issues, primarily on the following topics:

the concept of a corpus

language corpus as a source of knowledge of language (huge basis of language data which can be quickly and effectively retrieved and evaluated, e.g. as compared to out-of-date files of paper excerpts on cards)

modern computer technologies for the corpus buildup and its exploitation

corpus typology from various perspectives (synchronic x diachronic, multilingual parallel x multilingual comparable x monolingual, corpus of written x spoken language, dialect corpus etc., Czech language corpora [e.g. the Czech National Corpus project, Prague Dependency Treebank] and corpora of other languages)

representativeness of a text corpus (reception and production of texts and their genre representativeness in a corpus, balanced corpus from the viewpoint of linguistic phenomena etc., statistical methods of corpus processing)

administrative markup of texts included in the corpus (database of texts and their identification, origin, type)

structural markup and linguistic annotation of texts (paragraph and sentence segmentation, tokenization; morphological analysis including lemmatization, subsequent disambiguation, buildup of syntaktic/semantic structures, e.g. trees)