"Learning Data-Driven Representations for Biomolecular Analysis, Optimi" by Egbert Castro

Date of Award

Fall 2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Biology and Bioinformatics

First Advisor

Krishnswamy, Smita

Abstract

Recently, the study of biomolecules has undergone a shift toward high-throughput experimentation, leveraging advancements across various fields to produce datasets of unprecedented scale. Within these datasets, there are undiscovered truths that carry major implications for our understanding of biology and our ability to treat disease. However, innovative approaches are needed to distill signals from noise and enable the next generation of discoveries. Deep learning methods have excelled in their ability to learn from complex and large datasets in many domains, but their potential for biomolecular data has yet to be fully realized. To do so requires the ability to integrate powerful biological priors into model architectures and develop regularizations that reflect the needs of biomolecular tasks. In this thesis, we aim to provide insight into the way deep learning architectures can be designed to generate a greater understanding of biomolecular structure and function. This work is divided into three parts. In Part 1, we explore ways of encoding biomolecular structure by combining the strong theoretical foundation of graph signal processing with the expressiveness of neural networks. Part 2 introduces an approach to learning meaningful and well-organizing embedding spaces from the internal representations of large models trained on biomolecular sequences. Highly efficient exploration and generation from these latent embedding spaces open a new route to designing biomolecules and interpreting their function. Next, in Part 3, we build on the findings of Part 2 to propose a multi-modal architecture for effectively integrating data from two sources and generating interpretable visualizations of how these modes interact. Overall, this work proposes a path to gaining novel insights into biomolecular data by more fully leveraging the learned representations of powerful, regularized models

Share

COinS