Thesis (Selection of subject)

Your browser does not support JavaScript, or its support is disabled. Some features may not be available.

Speech-Informed Inverse Text Normalization

Thesis title in Czech:	Normalizace automatických transkriptů s ohledem na zdrojový zvukový signál
Thesis title in English:	Speech-Informed Inverse Text Normalization
Key words:	normalizace automatických transkriptů\|multimodalita\|automatické rozpoznávání řeči\|zpracování přirozeného jazyka\|hluboké učení
English key words:	inverse text normalization\|multimodality\|automatic speech recognition\|natural language processing\|deep learning
Academic year of topic announcement:	2021/2022
Thesis type:	diploma thesis
Thesis language:	angličtina
Department:	Institute of Formal and Applied Linguistics (32-UFAL)
Supervisor:	doc. RNDr. Ondřej Bojar, Ph.D.
Author:	hidden - assigned and confirmed by the Study Dept.
Date of registration:	19.01.2022
Date of assignment:	26.01.2022
Confirmed by Study dept. on:	01.02.2022
Date and time of defence:	10.06.2024 09:00
Date of electronic submission:	02.05.2024
Date of submission of printed version:	02.05.2024
Date of proceeded defence:	10.06.2024
Opponents:	Mgr. Ondřej Plátek

Guidelines

When the output of automatic speech recognition systems (ASR) is to be processed in downstream tasks, we often face a serious format mismatch: ASR emits individual recognized words while the downstream components are typically trained on well formed sentences including casing and punctuation. Further text notation conventions also apply, e.g. for numbers or abbreviations.

The goal of the thesis is to design, implement and evaluate a deep neural network model that will process speech recordings and their text transcription created automatically by a first-stage ASR system and it will normalize it to a particular format. At minimum, word capitalization and punctuation should be predicted. Additional aspects of text normalization, e.g. presenting numbers or other units in their common notation, is also possible.
Optionally, the system may attempt to correct the text for errors in the automatic transcription or even errors in the speech itself, i.e. to carry out some tasks from the area of speech reconstruction.

An inherent part of the thesis is a careful evaluation of the proposed system in several small variations and also against the baseline processing only ASR output and with no access to the speech signal. The evaluation can rely on automatic metrics (e.g. a custom definition of word error rate considering punctuation) or the performance in a downstream task (e.g. machine translation) but a small manual qualitative assessment is also required.

The system will be limited to independent processing of reasonably short spans of sound (e.g. 8 seconds). A very useful possible extension of the topic is to adapt the system for the streaming use case, i.e. for the situation where the input comes as unlimited stream of sound and the system is expected to produce a continuous stream of text.

References

Courtland, M., Faulkner, A., & McElvain, G. (2020). Efficient Automatic Punctuation Restoration Using Bidirectional Transformers with Robust Inference. IWSLT.

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.

Ihori, M., Makishima, N., Tanaka, T., Takashima, A., Orihashi, S., & Masumura, R. (2021). MAPGN: Masked Pointer-Generator Network for Sequence-to-Sequence Pre-Training. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7563–7567.

Öktem, A., Farrús, M., & Wanner, L. (2017). Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech. SLSP.

Straka, M., Náplava, J., Straková, J., & Samuel, D. (2021). RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. TSD.

Sunkara, M., Ronanki, S., Bekal, D., Bodapati, S., & Kirchhoff, K. (2020). Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech. INTERSPEECH.

Erin Fitzgerald, Frederick Jelinek, and Robert Frank. (2009). What lies beneath: Semantic and syntactic analysis of manually reconstructed spontaneous speech. ACL/IJCNLP, pages 746–754.