Date of Award
Doctor of Philosophy (PhD)
Great efforts have been made to understand the mechanism of complex diseases. Besides studying environmental factors and lifestyles, it is imperative to find disease causal genes. With the development of sequencing technology and the rapid accumulation of diverse types of high-throughput biological data, a promising direction to identify disease genes is through data integration. Genes affect diseases through different biological activities. Integrating different biological data can improve the power of gene discovery and the understanding of pathogenic mechanisms. In recent years, a great number of omics databases have become available. Methods for multi-omics data integration have successfully improved the statistical power of gene identification and mapping genetic risk factors to specific cell types or epigenomic functions. Many traits have been found to share common genetic factors. Researchers have discovered the complex relationship between multiple traits and multiple genes. With increased large biobank studies and genome-wide association studies (GWAS), many multi-trait modeling methods have been proposed to test the existence and quantify the shared genetic factors between diseases and improve the statistical power of GWAS. In this dissertation, I have proposed new statistical models and methods for gene identification through data integration. I aimed to combine different biological data and integrate information shared between traits. To be more specific, in the first chapter, a comprehensive and powerful pipeline for integrative data analysis was proposed to identify idiopathic pulmonary fibrosis (IPF) associated genes. By integrating GWAS with transcriptome data and leveraging shared genetic factors between traits, 24 novel genes were identified for IPF susceptibility, which has expanded the understanding of the complex genetic architecture of IPF. In the second chapter, based on the success of multi-trait analysis in the first chapter, I proposed a novel statistical model called MAGAL: Multi-trait analysis of GWAS summary statistics using local genetic correlation. The goal was to leverage the local pleiotropic effect to increase the statistical power of identifying trait-gene associations and dissect disease heterogeneity using shared genetic factors across traits. MAGAL identified 144 candidate genes associated with chronic obstructive pulmonary disease (COPD) and showed improved power compared to previous methods. Integrative analysis of lung eQTL, bulk, and single-cell expression data prioritized 22 genes and suggested novel disease-related pathways. Genetic risk scores constructed by shared genetic factors between COPD and eosinophil percentage identified subgroups with heterogeneous phenotype characteristics and indicated new COPD subtypes. In the third chapter, based on the success of combining GWAS and transcriptomics data in the first chapter, I proposed a new statistical framework called INSECT: Integrative analysis of exomics and single-cell transcriptomics for gene prioritization. The aim was to prioritize disease causal genes using exome-wide association study and single-cell expression data. Thirty-one variants in coding regions and one gene were found to be significantly associated with COPD. INSECT identified five significantly associated cell types and prioritized 1047 genes with the highest probability of being disease causal genes. Our results highlight the importance of mitochondrial dysfunction in COPD and the shared mechanisms between COPD and cancers.
Chen, Ming, "Statistical Methods for Identifying Genetic Risk Factors of Lung Diseases" (2021). Yale Graduate School of Arts and Sciences Dissertations. 313.