Yale Graduate School of Arts and Sciences Dissertations

Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data

Junchen Yang, Yale University Graduate School of Arts and SciencesFollow

Date of Award

Spring 1-1-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Biology and Bioinformatics

First Advisor

Kluger, Yuval

Abstract

High-dimensional biological data, including single-cell omics, are prevalent in modern biology for characterizing a wide range of biological processes and phenomena. Although our capacity to generate such data has expanded at an unprecedented pace, significant challenges remain in extracting the informative features and the underlying biological signals. In particular, these datasets not only exhibit high dimensionality but also suffer from issues including inherent noise, low signal-to-noise ratio, and intrinsic heterogeneity. While machine learning models hold great promise for analyzing such data, their complexity and sensitivity to data characteristics often impede biological interpretability. To address these interconnected challenges, this thesis develops three rigorous frameworks that prioritize interpretable, data-driven modeling across diverse biological scenarios. In chapter 1, we introduce Locally Sparse Interpretable Network (LSPIN), a novel neural network model that directly addresses the challenge of heterogeneity in supervised learning tasks. As a dual-network model, LSPIN identifies the most predictive features for each sample via a gating network while predicts the outcome via its prediction network. We show that LSPIN achieves state-of-the-art performance on various real-world datasets while maintaining interpretability. This architecture proves particularly valuable for survival analysis and marker gene identification, where understanding feature importance at the sample level provides crucial biological insights. In chapter 2, we present Biwhitened Principal Component Analysis (BiPCA), a mathematically grounded model for processing high-dimensional omics count data. BiPCA is specifically designed to handle the complex noise structure in the data. It reveals the underlying rank of the data through a rigorous noise standardization procedure termed biwhitening, then optimally denoises the data to recover the underlying signals. In addition, BiPCA is highly adaptiveâ€”capable of accommodating a wide range of distributions from different modalitiesâ€”and can further assess how well it fits the data. We demonstrate its application on more than 100 datasets and highlight the benefits of its accurate rank estimation to data analysis. We also show its superior performance on single-cell downstream tasks such as enhancing marker gene expressions, preserving cell neighborhood, and mitigating batch effect. In chapter 3, we address the growing need for multi-modal analysis by developing mmDUFS (Multi-modal Differentiable Unsupervised Feature Selection). mmDUFS bridges the gap between single-modality approaches by identifying both shared biological processes and modality-specific signals through novel graph operators. In addition, mmDUFS identifies the informative biological features associated with these processes to provide further interpretability. The method significantly outperforms existing approaches in synthetic benchmarks and revealed novel biological insights in single-cell multi-omics applications. Together, these three frameworksâ€”LSPIN, BiPCA, and mmDUFSâ€”illustrate how interpretable, data-driven modeling can tackle the core challenges of high-dimensional biological data. LSPIN employs sample-specific gating to identify the most predictive features, BiPCA accurately selects the signals and optimally denoises data to reveal the underlying structures, and mmDUFS extends the capabilities to multi-modal data, capturing both shared and modality-specific signals. By prioritizing versatile and rigorous methodology, each framework advances our ability to extract meaningful biological information from complex datasets, thereby opening new avenues for discovery in modern, data-intensive biomedical research.

Recommended Citation

Yang, Junchen, "Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data" (2025). Yale Graduate School of Arts and Sciences Dissertations. 1580.
https://elischolar.library.yale.edu/gsas_dissertations/1580

Download

COinS

Yale Graduate School of Arts and Sciences Dissertations

Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links

Yale Graduate School of Arts and Sciences Dissertations

Interpretable and Data-Driven Machine Learning Models for Analyzing High-Dimensional Biological Data

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Share

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links