Date of Award

Spring 2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational Biology and Bioinformatics

First Advisor

Krishnaswamy, Smita

Abstract

In recent years, modern technologies have enabled the collection of exponentially larger quantities of data in the biomedical domain and elsewhere. In particular, the advent of single-cell genomics has allowed for the collection of datasets containing hundreds of thousands of cells measured in tens of thousands of dimensions. This rapid expansion of common datasets beyond the possibility of manual annotation brings forth the need for large-scale exploratory data analysis. In this thesis, we will explore the problem of dimensionality reduction for visualization of high-dimensional datasets. Visualization of high-dimensional data is an essential task in exploratory data analysis, as the low-dimensional visualization of the data is used to understand, interrogate and present the results of many other analyses applied to the data. However, the repertoire of existing algorithms used for this task suffer from various algorithmic flaws leading to sub-optimal visualizations, including the trade-off between representing both local and global structure; the inherent sacrifices that must be made to reduce a dataset of intrinsic dimension greater than three to a form which can be interpreted by the human eye; and the computational complexity of the computations as the datasets increase in scale. Here, we use the framework provided by diffusion maps to present a new dimensionality reduction algorithm called PHATE, which seeks to address all three of these issues. In order to make the PHATE algorithm scalable, we present an approximation of the diffusion map through discrete partitions of the data called Compression-based Fast Diffusion Maps. Further, we use the insights gained from visualizing single-cell genomics data to present a manifold alignment algorithm called Harmonic Alignment, which allows for the correction of systemic differences between experiments, or the fusion of datasets collected from the same biological system using different assays. And finally, we present an extension of PHATE to longitudinal data, and demonstrate its utility for the purpose of machine learning interpretability by visualizing the hidden units of a neural network in training. While many open problems remain, the presentation of the methods herein chart a path towards a more systematic understanding of how we visualize high-dimensional data for exploratory data analysis.

COinS