Date of Award
Fall 2023
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computational Biology and Bioinformatics
First Advisor
Krishnswamy, Smita
Abstract
Recently, the study of biomolecules has undergone a shift toward high-throughput experimentation, leveraging advancements across various fields to produce datasets of unprecedented scale. Within these datasets, there are undiscovered truths that carry major implications for our understanding of biology and our ability to treat disease. However, innovative approaches are needed to distill signals from noise and enable the next generation of discoveries. Deep learning methods have excelled in their ability to learn from complex and large datasets in many domains, but their potential for biomolecular data has yet to be fully realized. To do so requires the ability to integrate powerful biological priors into model architectures and develop regularizations that reflect the needs of biomolecular tasks. In this thesis, we aim to provide insight into the way deep learning architectures can be designed to generate a greater understanding of biomolecular structure and function. This work is divided into three parts. In Part 1, we explore ways of encoding biomolecular structure by combining the strong theoretical foundation of graph signal processing with the expressiveness of neural networks. Part 2 introduces an approach to learning meaningful and well-organizing embedding spaces from the internal representations of large models trained on biomolecular sequences. Highly efficient exploration and generation from these latent embedding spaces open a new route to designing biomolecules and interpreting their function. Next, in Part 3, we build on the findings of Part 2 to propose a multi-modal architecture for effectively integrating data from two sources and generating interpretable visualizations of how these modes interact. Overall, this work proposes a path to gaining novel insights into biomolecular data by more fully leveraging the learned representations of powerful, regularized models
Recommended Citation
Castro, Egbert, "Learning Data-Driven Representations for Biomolecular Analysis, Optimization, and Design" (2023). Yale Graduate School of Arts and Sciences Dissertations. 1232.
https://elischolar.library.yale.edu/gsas_dissertations/1232