Speech-Informed Inverse Text Normalization
Název práce v češtině: | Normalizace automatických transkriptů s ohledem na zdrojový zvukový signál |
---|---|
Název v anglickém jazyce: | Speech-Informed Inverse Text Normalization |
Klíčová slova: | normalizace automatických transkriptů|multimodalita|automatické rozpoznávání řeči|zpracování přirozeného jazyka|hluboké učení |
Klíčová slova anglicky: | inverse text normalization|multimodality|automatic speech recognition|natural language processing|deep learning |
Akademický rok vypsání: | 2021/2022 |
Typ práce: | diplomová práce |
Jazyk práce: | angličtina |
Ústav: | Ústav formální a aplikované lingvistiky (32-UFAL) |
Vedoucí / školitel: | doc. RNDr. Ondřej Bojar, Ph.D. |
Řešitel: | skrytý![]() |
Datum přihlášení: | 19.01.2022 |
Datum zadání: | 26.01.2022 |
Datum potvrzení stud. oddělením: | 01.02.2022 |
Datum a čas obhajoby: | 10.06.2024 09:00 |
Datum odevzdání elektronické podoby: | 02.05.2024 |
Datum odevzdání tištěné podoby: | 02.05.2024 |
Datum proběhlé obhajoby: | 10.06.2024 |
Oponenti: | Mgr. Ondřej Plátek |
Zásady pro vypracování |
When the output of automatic speech recognition systems (ASR) is to be processed in downstream tasks, we often face a serious format mismatch: ASR emits individual recognized words while the downstream components are typically trained on well formed sentences including casing and punctuation. Further text notation conventions also apply, e.g. for numbers or abbreviations.
The goal of the thesis is to design, implement and evaluate a deep neural network model that will process speech recordings and their text transcription created automatically by a first-stage ASR system and it will normalize it to a particular format. At minimum, word capitalization and punctuation should be predicted. Additional aspects of text normalization, e.g. presenting numbers or other units in their common notation, is also possible. Optionally, the system may attempt to correct the text for errors in the automatic transcription or even errors in the speech itself, i.e. to carry out some tasks from the area of speech reconstruction. An inherent part of the thesis is a careful evaluation of the proposed system in several small variations and also against the baseline processing only ASR output and with no access to the speech signal. The evaluation can rely on automatic metrics (e.g. a custom definition of word error rate considering punctuation) or the performance in a downstream task (e.g. machine translation) but a small manual qualitative assessment is also required. The system will be limited to independent processing of reasonably short spans of sound (e.g. 8 seconds). A very useful possible extension of the topic is to adapt the system for the streaming use case, i.e. for the situation where the input comes as unlimited stream of sound and the system is expected to produce a continuous stream of text. |
Seznam odborné literatury |
Courtland, M., Faulkner, A., & McElvain, G. (2020). Efficient Automatic Punctuation Restoration Using Bidirectional Transformers with Robust Inference. IWSLT.
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460. Ihori, M., Makishima, N., Tanaka, T., Takashima, A., Orihashi, S., & Masumura, R. (2021). MAPGN: Masked Pointer-Generator Network for Sequence-to-Sequence Pre-Training. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7563–7567. Öktem, A., Farrús, M., & Wanner, L. (2017). Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech. SLSP. Straka, M., Náplava, J., Straková, J., & Samuel, D. (2021). RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. TSD. Sunkara, M., Ronanki, S., Bekal, D., Bodapati, S., & Kirchhoff, K. (2020). Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech. INTERSPEECH. Erin Fitzgerald, Frederick Jelinek, and Robert Frank. (2009). What lies beneath: Semantic and syntactic analysis of manually reconstructed spontaneous speech. ACL/IJCNLP, pages 746–754. |