SubjectsSubjects(version: 945)
Course, academic year 2023/2024
   Login via CAS
Corpus Linguistics - Applications - NPFL066
Title: Korpusová lingvistika - aplikace
Guaranteed by: Institute of Formal and Applied Linguistics (32-UFAL)
Faculty: Faculty of Mathematics and Physics
Actual: from 2012
Semester: summer
E-Credits: 3
Hours per week, examination: summer s.:0/2, C [HT]
Capacity: unlimited
Min. number of students: unlimited
4EU+: no
Virtual mobility / capacity: no
State of the course: cancelled
Language: Czech
Teaching methods: full-time
Teaching methods: full-time
Guarantor: prof. PhDr. František Čermák, DrSc.
Class: Informatika Mgr. - volitelný
Classification: Informatics > Computer and Formal Linguistics
Co-requisite : NPFL065
Annotation -
Last update: T_UFAL (11.05.2005)
The seminar focuses on practical issues in corpus linguistics and follows up with the Corpus Linguistics - Introduction seminar. The following topics are discussed via essays and seminar papers: corpus design and build-up (methods of language material acquisition, conversion of language data to the unified SGML and XML format; annotation of texts included in the corpus; linguistic (morphological, syntactic, semantic) tagging of corpus texts, lemmatization; linguistic exploitation of corpus material; practical exploitation of the corpus, techniques for language data retrieval in the corpus.
Literature - Czech
Last update: T_UFAL (19.05.2004)

Aijmer K., Altenberg B. (eds.) (1991): English Corpus Linguistics. Studies in Honour of Jan Svartvik. Longman, London.

Allan K. (1986): Linguistic Meaning 1-2. Routledge, London.

Atkins S., Clear J., Ostler N. (1992): Corpus Design Criteria. Literary and Linguistic Computing, Vol. 7, No. 1, s. 1-16.

Atkins B.T.S., Zampolli A. (eds.) (1994): Computational Approaches to the Lexicon. Oxford (= 5. Pisa International Summer School on Computational Lexicology and Lexicography).

Barnbrook G. (1996): Language and Computers. Edinburgh University Press, Edinburgh. *Biber D. (1993): Representativeness in Corpus Design. Literary and Linguistic Computing Vol. 8, No. 4, s. 243-258.

Biber D., Conrad S., Reppen R. (1998): Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge.

Boguraev B., Briscoe T. (1989): Computational Lexicography for Natural Language Processing. Longman, London - New York.

Burnard L. (1993): A Gentle Introduction to SGML. TEI P2.

Burnard L. (1993): A Gentle Introduction to XML. http://www.tei-c.org/Guidelines2/gentleintro.html

Church K.W., Hanks P. (1990): Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16/1, s. 22-29.

Čermák F. (1995), Komputační lexikografie. In: F. Čermák, R. Blatná (eds.): Manuál lexikografie. H+H, Praha.

Čermák F. (1995): Jazykový korpus: Prostředek a zdroj poznání. Slovo a slovesnost 56: s. 119-140 (+ bibliografie tamtéž).

Čermák F., Blatná R. (eds.) (1995): Manuál lexikografie. H+H, Praha.

Čermák F., Klímová J., Petkevič V. (eds.) (2000): Studie z korpusové lingvistiky. Nakladatelství Karolinum, Univerzita Karlova, Praha. (Poekladový sborník vybraných studií)

Fillmore C.J., Atkins B.T.S. (1994): Starting where the dictionaries stop: the challenge of corpus lexicography. In: Atkins B.T.S., Zampolli A. (eds.), Computational Approaches to the Lexicon.

Garside R., Leech, G., McEnery A. (1997): Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman, London - New York.

Hajič J., Hladká B. (1997): Morfologické značkování korpusu českých textů stochastickou metodou. Slovo a slovesnost 4/1997, s. 288-304.

Hajič J., Hajičová E., Panevová J., Sgall P. (1998): Syntax v Českém národním korpusu. Slovo a slovesnost 3/1998, s. 168-177.

Hajič J., Hladká B. (1998): Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. Proceedings from COLING-ACL'98. Montreal, s. 483-490.

Halliday M.A.K. (1991): Language as system and language as instance: The corpus as a theoretical construct. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Halliday M.A.K. (1991): Corpus studies and probabilistic grammar. In: Aijmer et al., s. 30-43.

Kennedy G. (1998): An Introduction to Corpus Linguistics. Longman, London.

Leech G. (1991): The State of the Art in Corpus Linguistics. In: Aijmer et al., s. 8-29.

Leech G. (1993): Corpus Annotation Schemes. Literary and Linguistic Computing 8:4, s. 275-281.

McEnery A., Wilson A. (1996): Corpus Linguistics. Edinburgh University Press, Edinburgh.

Nelson W. Francis (1991): Language Corpora B.C. In: Svartvik J. (ed.), Directions in Corpus Linguistics. In: Svartvik J. (ed.): Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82. Stockholm 4.-6. August 1991. Mouton, Berlin, s. 17-32.

Oakes M.P. (1998): Statistics for Corpus Linguistics. Edinburgh University Press, Edinburgh.

Oliva K., Hnátková M., Květoň P., Petkevič V. (2000): The Linguistic Basis of a Rule-Based Tagger of Czech. In: Sojka P., I. Kopeček, K. Pala (eds.): Proceedings of the Text, Speech and Dialogue conference TSD 2000 held in Brno 2000. LNAI 1902, Springer-Verlag Berlin Heidelberg, s. 3-8.

Šulc M. (1999): Korpusová lingvistika. První vstup. Nakladatelství Karolinum, Univerzita Karlova, Praha.

Publikace vysázené tučně existují v českém překladu a jsou zahrnuty ve výše uvedeném sborníku Čermák F., Klímová J., Petkevič V. (2000): Studie z korpusové lingvistiky.

Syllabus -
Last update: T_UFAL (11.05.2005)

The seminar focuses on practical issues in corpus linguistics and follows up with the Corpus Linguistics - Introduction seminar. The following topics are discussed via essays and seminar papers:

  • corpus design and build-up (methods of language material acquisition: electronic form, scanning, manual transcription, copyright issues, conversion of language data to the unified SGML [= Standard Generalized Markup Language] and XML [= Extended Markup Language] format, cleanup of input data - nontext and foreign language parts excluded)
  • annotation of texts included in the corpus
  • linguistic annotation of corpus texts (morphological, syntactic, semantic), mainly the issues of morphological analysis, lemmatization and disambiguation (various mthods: statistical, rule-based)
  • linguistic exploitation of corpus material (practical exploitation of statistical and frequency data: collocation and exploitation of various statistical measures such as mutual information - MI-score; t-score etc.)
  • practical exploitation of the corpus, techniques for language data retrieval in the corpus, specification of pertinent lexically, morphologically, syntactically oriented queries

 
Charles University | Information system of Charles University | http://www.cuni.cz/UKEN-329.html