De novo generation of small molecules from mass spectra via discrete diffusion model
| Název práce v češtině: | De novo generování malých molekul z hmotnostních spekter pomocí modelu diskrétní difúze |
|---|---|
| Název v anglickém jazyce: | De novo generation of small molecules from mass spectra via discrete diffusion model |
| Klíčová slova: | Difuzní modely|Generování molekulových grafů|Tandemová hmotnostní spektra|Podmíněné generování |
| Klíčová slova anglicky: | Diffusion Models|Molecular graph generation|Tandem mass spectra|Conditional generation |
| Akademický rok vypsání: | 2024/2025 |
| Typ práce: | bakalářská práce |
| Jazyk práce: | angličtina |
| Ústav: | Katedra softwaru a výuky informatiky (32-KSVI) |
| Vedoucí / školitel: | Josef Šivic |
| Řešitel: | skrytý - zadáno a potvrzeno stud. odd. |
| Datum přihlášení: | 27.03.2025 |
| Datum zadání: | 10.04.2025 |
| Datum potvrzení stud. oddělením: | 10.04.2025 |
| Datum a čas obhajoby: | 20.06.2025 09:00 |
| Datum odevzdání elektronické podoby: | 07.05.2025 |
| Datum odevzdání tištěné podoby: | 07.05.2025 |
| Datum proběhlé obhajoby: | 20.06.2025 |
| Oponenti: | doc. RNDr. David Hoksza, Ph.D. |
| Zásady pro vypracování |
| The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, predicting molecular structures directly from mass spectra, rather than selecting from a reference database, remains a significant bottleneck in mass spectrometry. This thesis focuses on adapting the DiGress discrete diffusion model [1] for small molecule generation from mass spectra, incorporating mass spectrometry data using DreaMS [2] and MassSpecGym [3]. The objectives of the thesis are:
1. Review the related work in de novo generation of small molecules from mass spectra. 2. Train the DiGress model on a dataset of 4 million biologically relevant molecules from [3] to develop a pre-trained model capable of generating valid molecular structures given their molecular formulae. 3. Establish a baseline before integrating any mass spectra information by evaluating the model trained in 1. on the MassSpecGym “de novo bonus chemical formulae" challenge. In this challenge molecules are generated based solely on their chemical formulae regardless of their mass spectra. This establishes a baseline before integrating any mass spectra information. 4. Extend the DiGress model to incorporate additional inputs in the form of DreaMS embeddings of mass spectra [2]. Then, fine-tune the pre-trained model using DreaMS embeddings as conditioning and re-evaluate its performance on the MassSpecGym benchmark. |
| Seznam odborné literatury |
| [1] Vignac et al., 2022, „DiGress: Discrete Denoising diffusion for graph generation“, https://doi.org/10.48550/arXiv.2209.14734, ICLR 2023.
[2] Bushuiev et al., 2025, „Emergence of molecular structures from repository-scale self-supervised learning on tandem mass spectra“, https://doi.org/10.26434/chemrxiv-2023-kss3r-v2, Nat Biotechnology. [3] Bushuiev et al., 2024, „MassSpecGym: A benchmark for the discovery and identification of molecules“, https://doi.org/10.48550/arXiv.2410.23326, NeurIPS 2024. |
- zadáno a potvrzeno stud. odd.