Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data
Date of Award
Spring 1-1-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computational Biology and Bioinformatics
First Advisor
Kluger, Yuval
Abstract
High-dimensional biological data, including single-cell omics, are prevalent in modern biology for characterizing a wide range of biological processes and phenomena. Although our capacity to generate such data has expanded at an unprecedented pace, significant challenges remain in extracting the informative features and the underlying biological signals. In particular, these datasets not only exhibit high dimensionality but also suffer from issues including inherent noise, low signal-to-noise ratio, and intrinsic heterogeneity. While machine learning models hold great promise for analyzing such data, their complexity and sensitivity to data characteristics often impede biological interpretability. To address these interconnected challenges, this thesis develops three rigorous frameworks that prioritize interpretable, data-driven modeling across diverse biological scenarios. In chapter 1, we introduce Locally Sparse Interpretable Network (LSPIN), a novel neural network model that directly addresses the challenge of heterogeneity in supervised learning tasks. As a dual-network model, LSPIN identifies the most predictive features for each sample via a gating network while predicts the outcome via its prediction network. We show that LSPIN achieves state-of-the-art performance on various real-world datasets while maintaining interpretability. This architecture proves particularly valuable for survival analysis and marker gene identification, where understanding feature importance at the sample level provides crucial biological insights. In chapter 2, we present Biwhitened Principal Component Analysis (BiPCA), a mathematically grounded model for processing high-dimensional omics count data. BiPCA is specifically designed to handle the complex noise structure in the data. It reveals the underlying rank of the data through a rigorous noise standardization procedure termed biwhitening, then optimally denoises the data to recover the underlying signals. In addition, BiPCA is highly adaptive—capable of accommodating a wide range of distributions from different modalities—and can further assess how well it fits the data. We demonstrate its application on more than 100 datasets and highlight the benefits of its accurate rank estimation to data analysis. We also show its superior performance on single-cell downstream tasks such as enhancing marker gene expressions, preserving cell neighborhood, and mitigating batch effect. In chapter 3, we address the growing need for multi-modal analysis by developing mmDUFS (Multi-modal Differentiable Unsupervised Feature Selection). mmDUFS bridges the gap between single-modality approaches by identifying both shared biological processes and modality-specific signals through novel graph operators. In addition, mmDUFS identifies the informative biological features associated with these processes to provide further interpretability. The method significantly outperforms existing approaches in synthetic benchmarks and revealed novel biological insights in single-cell multi-omics applications. Together, these three frameworks—LSPIN, BiPCA, and mmDUFS—illustrate how interpretable, data-driven modeling can tackle the core challenges of high-dimensional biological data. LSPIN employs sample-specific gating to identify the most predictive features, BiPCA accurately selects the signals and optimally denoises data to reveal the underlying structures, and mmDUFS extends the capabilities to multi-modal data, capturing both shared and modality-specific signals. By prioritizing versatile and rigorous methodology, each framework advances our ability to extract meaningful biological information from complex datasets, thereby opening new avenues for discovery in modern, data-intensive biomedical research.
Recommended Citation
Yang, Junchen, "Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data" (2025). Yale Graduate School of Arts and Sciences Dissertations. 1580.
https://elischolar.library.yale.edu/gsas_dissertations/1580