Deconvolution: from Transcriptomics to Immunomics

Date of Award

Fall 10-1-2021

Document Type


Degree Name

Doctor of Philosophy (PhD)


Public Health

First Advisor

Zhao, Hongyu


The developments of sequencing technologies in the past two decades have enabled exciting findings and rapid accumulations of many kinds of -omics data. Each type of -omics data is designated to measure a specific characteristic of the biological systems, e.g. DNA-sequencing for mutations and RNA-sequencing for gene expression. With proper statistical and computational methods, these data can be integrated to answer new questions. This data reproposing is possible because the target signals are already encoded in the data and require proper tools to decode. In this regard, these methods are often called ``deconvolution" methods. In this dissertation, I will present three computational deconvolution methods along this line. The first two methods aim at inferring cell type proportions and cell type-specific gene expressions from bulk gene expression data. The third project brings this idea towards the field of immunomics, and presents a novel computational framework to decode TCR (T cell receptor)-MHC (Major Histocompatibility Complex) association from TCR repertoire sequencing data. Chapter 1 introduces a Non-negative Matrix Factorization (NMF)-based deconvolution method NITUMID for estimation of tumor and immune cell proportions from bulk tumor gene expression data. NITUMID addresses two main challenges in the existing cell type proportion deconvolution methods: simultaneously estimating tumor and immune cell proportions, as well as accounting for variable mRNA levels across cell types. In Chapter 2, I asked a question that goes beyond the scope of conventional deconvolution: how to differentiate changes due to cell type-specific gene expression profiles and cell type proportions? To this end, I developed an iterative algorithm SCADIE that can simultaneously estimate cell type-specific gene expression profiles and cell type proportions, and perform cell type-specific differential expression analysis at the group level. Through comprehensive simulations and real data analyses, I demonstrated that it is possible to attribute the changes in bulk gene expression levels across a set of conditions to both cell type proportional changes and cell type-specific gene expression profile changes. In Chapter 3, I brought deconvolution concept from transcriptomic data analysis and applied it to decode the TCR-MHC associations in the field of immunomics. The rationale behind this application is that TCR-MHC association shapes the TCR frequency change during thymic selection, and TCR repertoire data can provide insights into both pre-selection and post-selection frequencies. Through mathematical modelings and comprehensive parameter estimations, I proposed an estimation framework that produces inferences highly consistent with the literature. I believe with further accumulations of high quality TCR repertoire sequencing data, this framework can generate much richer insights into the important problem of TCR-MHC association.

This document is currently not available here.