De novo generation of small molecules from mass spectra via discrete diffusion model
| Thesis title in Czech: | De novo generování malých molekul z hmotnostních spekter pomocí modelu diskrétní difúze |
|---|---|
| Thesis title in English: | De novo generation of small molecules from mass spectra via discrete diffusion model |
| Key words: | Difuzní modely|Generování molekulových grafů|Tandemová hmotnostní spektra|Podmíněné generování |
| English key words: | Diffusion Models|Molecular graph generation|Tandem mass spectra|Conditional generation |
| Academic year of topic announcement: | 2024/2025 |
| Thesis type: | Bachelor's thesis |
| Thesis language: | angličtina |
| Department: | Department of Software and Computer Science Education (32-KSVI) |
| Supervisor: | Josef Šivic |
| Author: | hidden - assigned and confirmed by the Study Dept. |
| Date of registration: | 27.03.2025 |
| Date of assignment: | 10.04.2025 |
| Confirmed by Study dept. on: | 10.04.2025 |
| Date and time of defence: | 20.06.2025 09:00 |
| Date of electronic submission: | 07.05.2025 |
| Date of submission of printed version: | 07.05.2025 |
| Date of proceeded defence: | 20.06.2025 |
| Opponents: | doc. RNDr. David Hoksza, Ph.D. |
| Guidelines |
| The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, predicting molecular structures directly from mass spectra, rather than selecting from a reference database, remains a significant bottleneck in mass spectrometry. This thesis focuses on adapting the DiGress discrete diffusion model [1] for small molecule generation from mass spectra, incorporating mass spectrometry data using DreaMS [2] and MassSpecGym [3]. The objectives of the thesis are:
1. Review the related work in de novo generation of small molecules from mass spectra. 2. Train the DiGress model on a dataset of 4 million biologically relevant molecules from [3] to develop a pre-trained model capable of generating valid molecular structures given their molecular formulae. 3. Establish a baseline before integrating any mass spectra information by evaluating the model trained in 1. on the MassSpecGym “de novo bonus chemical formulae" challenge. In this challenge molecules are generated based solely on their chemical formulae regardless of their mass spectra. This establishes a baseline before integrating any mass spectra information. 4. Extend the DiGress model to incorporate additional inputs in the form of DreaMS embeddings of mass spectra [2]. Then, fine-tune the pre-trained model using DreaMS embeddings as conditioning and re-evaluate its performance on the MassSpecGym benchmark. |
| References |
| [1] Vignac et al., 2022, „DiGress: Discrete Denoising diffusion for graph generation“, https://doi.org/10.48550/arXiv.2209.14734, ICLR 2023.
[2] Bushuiev et al., 2025, „Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra“, https://doi.org/10.26434/chemrxiv-2023-kss3r-v2, Nat Biotechnology. [3] Bushuiev et al., 2024, „MassSpecGym: A benchmark for the discovery and identification of molecules“, https://doi.org/10.48550/arXiv.2410.23326, NeurIPS 2024. |
- assigned and confirmed by the Study Dept.