Témata prací (Výběr práce)Témata prací (Výběr práce)(verze: 385)
Detail práce
   Přihlásit přes CAS
Speech-Informed Inverse Text Normalization
Název práce v češtině: Normalizace automatických transkriptů s ohledem na zdrojový zvukový signál
Název v anglickém jazyce: Speech-Informed Inverse Text Normalization
Klíčová slova: normalizace automatických transkriptů|multimodalita|automatické rozpoznávání řeči|zpracování přirozeného jazyka|hluboké učení
Klíčová slova anglicky: inverse text normalization|multimodality|automatic speech recognition|natural language processing|deep learning
Akademický rok vypsání: 2021/2022
Typ práce: diplomová práce
Jazyk práce: angličtina
Ústav: Ústav formální a aplikované lingvistiky (32-UFAL)
Vedoucí / školitel: doc. RNDr. Ondřej Bojar, Ph.D.
Řešitel: skrytý - zadáno a potvrzeno stud. odd.
Datum přihlášení: 19.01.2022
Datum zadání: 26.01.2022
Datum potvrzení stud. oddělením: 01.02.2022
Datum a čas obhajoby: 10.06.2024 09:00
Datum odevzdání elektronické podoby:02.05.2024
Datum odevzdání tištěné podoby:02.05.2024
Datum proběhlé obhajoby: 10.06.2024
Oponenti: Mgr. Ondřej Plátek
 
 
 
Zásady pro vypracování
When the output of automatic speech recognition systems (ASR) is to be processed in downstream tasks, we often face a serious format mismatch: ASR emits individual recognized words while the downstream components are typically trained on well formed sentences including casing and punctuation. Further text notation conventions also apply, e.g. for numbers or abbreviations.

The goal of the thesis is to design, implement and evaluate a deep neural network model that will process speech recordings and their text transcription created automatically by a first-stage ASR system and it will normalize it to a particular format. At minimum, word capitalization and punctuation should be predicted. Additional aspects of text normalization, e.g. presenting numbers or other units in their common notation, is also possible.
Optionally, the system may attempt to correct the text for errors in the automatic transcription or even errors in the speech itself, i.e. to carry out some tasks from the area of speech reconstruction.

An inherent part of the thesis is a careful evaluation of the proposed system in several small variations and also against the baseline processing only ASR output and with no access to the speech signal. The evaluation can rely on automatic metrics (e.g. a custom definition of word error rate considering punctuation) or the performance in a downstream task (e.g. machine translation) but a small manual qualitative assessment is also required.

The system will be limited to independent processing of reasonably short spans of sound (e.g. 8 seconds). A very useful possible extension of the topic is to adapt the system for the streaming use case, i.e. for the situation where the input comes as unlimited stream of sound and the system is expected to produce a continuous stream of text.
Seznam odborné literatury
Courtland, M., Faulkner, A., & McElvain, G. (2020). Efficient Automatic Punctuation Restoration Using Bidirectional Transformers with Robust Inference. IWSLT.

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.

Ihori, M., Makishima, N., Tanaka, T., Takashima, A., Orihashi, S., & Masumura, R. (2021). MAPGN: Masked Pointer-Generator Network for Sequence-to-Sequence Pre-Training. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7563–7567.

Öktem, A., Farrús, M., & Wanner, L. (2017). Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech. SLSP.

Straka, M., Náplava, J., Straková, J., & Samuel, D. (2021). RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. TSD.

Sunkara, M., Ronanki, S., Bekal, D., Bodapati, S., & Kirchhoff, K. (2020). Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech. INTERSPEECH.

Erin Fitzgerald, Frederick Jelinek, and Robert Frank. (2009). What lies beneath: Semantic and syntactic analysis of manually reconstructed spontaneous speech. ACL/IJCNLP, pages 746–754.
 
Univerzita Karlova | Informační systém UK