Latent Space Construction for Analyzing Large Genomic Data Sets

Date of Award

Fall 10-1-2021

Document Type


Degree Name

Doctor of Philosophy (PhD)


Public Health

First Advisor

Kane, Michael


This dissertation presents a new framework for extracting signal from high-dimensional data using a latent-space construction with applications in genomic data sets. This construction follows the standard approach of considering the rank of a matrix containing useful information (signal) with the addition of noise. While the rank estimation usually depends on ill-posed methods that may result in an overabundance of noise or an under-abundance of signal, our approach provides accurate and robust estimates of the detectable signal, including distributional bounds for the magnitude of signal required for it to be detectable in the presence of noise. Along with the proposed framework, we explore its extension in methodologies and applications on genomic data sets. The proposed framework is used to identify a shared low-dimensional signal subspace when integrating multiple data sets and removing batch effects. This dimension-estimation procedure is the basis for a second innovation, a model performing unsupervised clustering and subspace identification simultaneously. In conjunction with this model, we propose a sample-wise prototype score measuring the “representativeness” or “prototypicalness” of samples to their belonged clusters. When applied to genomic data set, the proposed framework improves current procedures in quality control, dimension reduction, and cell type identification.

This document is currently not available here.