Date of Award
Spring 1-1-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Public Health
First Advisor
Zhao, Hongyu
Abstract
DNA methylation is one of the most widely studied epigenetic modifications that capture the cumulative effects of environmental exposures and inheritable effects and regulates gene expression. DNA methylation occurs at CpG sites, where the cytosine and guanine bases are connected by a phosphate group. Over the past decades, DNA methylation have been useful resources to understand the etiology and mechanisms of complex diseases. Multiple studies have linked DNA methylation with diseases to identify differentially methylated CpGs that associated with complex traits. In addition, another research direction is to focus on the single nucleotide polymorphisms (SNPs) associated with DNA methylation levels, which are known as methylation quantitative trait loci (meQTL). Studies on meQTLs have shed lights on the complex interplay between the genome and the methylome.However, most studies to date used bulk samples composed of distinct cell types. Bulk DNA methylation samples reflect the aggregated methylation profile across all cell types, which provide no insights for the individual cell types. The high cost and technical limitations for both cell sorting and single-cell DNA methylation approaches hinder the collection of large-scale, cell-type-specific (CTS) methylation profiles and limit our ability to move current studies from the “bulk level†to the “cell type level.†Given the difficulty in generating large-scale CTS methylome data and the broad availability of many bulk methylation datasets, one alternative solution is to develop statistical methods to infer CTS signals from bulk data. In chapter 1, we propose a hierarchical Bayesian interaction model (HBI) to infer CTS meQTLs from bulk methylation data, with the optional step to incorporate priors derived from CTS data in a small group of samples. We show through simulations that HBI improves the estimation of CTS genetic effects. Applying our method to the real data, we systematically characterize the genome-wide SNP-CpG associations in multiple cell types of peripheral blood mononuclear cells (PBMC). Through colocalization and enrichment analyses, we demonstrate the utility of HBI to improve the annotation of functional genetic variants and enhance the understanding of the cellular specificity of complex traits. Although we demonstrate the advantages of HBI over other state-of-the-art methods, all the current methods require the knowledge of the cell type proportions of the individuals in the bulk data. The biological ground truth of this sample-level cell type proportions is rarely available and needs to be estimated using computational algorithms. Those computationally estimated proportions are always noisy and may introduce additional uncertainty in the deconvolution step. Thus, in the second chapter we introduce Uncertainty-aware Bayesian Deconvolution (UBD), a method to incorporate the uncertainty quantification for cell type proportions when deconvoluting the bulk profiles into cell-type-specific profiles. Through simulations and real data applications, we systematically evaluate the performance of UBD to identify CTS signals. Comparing with the original framework without incorporating uncertainty, we demonstrate the utility of UBD to infer sample-level CTS profiles with higher accuracy and reveal more CTS signals in differential analysis and QTL studies. The methods proposed in the first two chapters both rely on the directly measured DNA methylation data. In the third chapter, we aim to build prediction models for DNA methylation from genotypes, which could be applied to a broad range of datasets without direct measurements of methylation. To date, the published methods on genotype-predicted methylations have mostly utilized datasets from populations of European ancestry to train the prediction models, which limits the generalizability to nonâ€European populations. To address this gap, we proposed a novel model called Local-Ancestry Methylation Predictor with Preselection step (LAMPP). We innovatively incorporate the local ancestry (LA) information to improve the prediction accuracy of DNA methylations in admixed populations (e.g., African Americans). We demonstrate that our method achieves higher prediction accuracy than the conventional model (without LA) and other LA models. The application of our model to an admixed cohort identifies genetically-regulated CpGs that are associated with seven complex traits, and reveals an important region for white blood cell counts in African Americans. Overall, the dissertation first introduces two novel methodologies to deconvolute CTS signals from bulk methylation data. HBI can provide accurate estimates of the CTS meQTLs (chapter 1) and UBD takes a step further to incorporate uncertainty for the cell type proportions when deconvoluting the CTS profiles (chapter 2). Then, it proposes a prediction model in the admixed populations, LAMPP, which can be applied to impute DNA methylations in this understudied populations when the direct measurements of methylation are unavailable (chapter 3). We believe the statistical methods developed in this dissertation offer powerful tools to gain comprehensive insights into the cell-type specificity and genetic regulation of DNA methylations.
Recommended Citation
Cheng, Youshu, "Statistical Methods for Investigating Cell-Type Specificity and Genetic Regulation of DNA Methylation in Complex Traits" (2025). Yale Graduate School of Arts and Sciences Dissertations. 1532.
https://elischolar.library.yale.edu/gsas_dissertations/1532