Improving Risk Factor Identification of Human Complex Traits in Omics Data

Weimiao Wu, Yale University Graduate School of Arts and Sciences


With recent advances in various high throughput technologies, the rise of omics data offers a promise of personalized health care with its potential to expand both the depth and the width of the identification of risk factors that are associated with human complex traits. In genomics, the introduction of repeated measures and the increased sequencing depth provides an opportunity for deeper investigation of disease dynamics for patients. In transcriptomics, high throughput single-cell assays provide cellular level gene expression depicting cell-to-cell heterogeneity. The cell-level resolution of gene expression data brought the opportunities to promote our understanding of cell function, disease pathogenesis, and treatment response for more precise therapeutic development. Along with these advances are the challenges posed by the increasingly complicated data sets. In genomics, as repeated measures of phenotypes are crucial for understanding the onset of disease and its temporal pattern, longitudinal designs of omics data and phenotypes are being increasingly introduced. However, current statistical tests for longitudinal outcomes, especially for binary outcomes, depend heavily on the correct specification of the phenotype model. As many diseases are rare, efficient designs are commonly applied in epidemiological studies to recruit more cases. Despite the enhanced efficiency in the study sample, this non-random ascertainment sampling can be a major source of model misspecification that may lead to inflated type I error and/or power loss in the association analysis. In transcriptomics, the analysis of single-cell RNA-seq data is facing its particular challenges due to low library size, high noise level, and prevalent dropout events. The purpose of this dissertation is to provide the methodological foundation to tackle the aforementioned challenges. We first propose a set of retrospective association tests for the identification of genetic loci associated with longitudinal binary traits. These tests are robust to different types of phenotype model misspecification and ascertainment sampling design which is common in longitudinal cohorts. We then extend these retrospective tests to variant-set tests for genetic rare variants that have low detection power by incorporating the variance component test and burden test into the retrospective test framework. Finally, we present a novel gene-graph based imputation method to impute dropout events in single-cell transcriptomic data to recover true gene expression level by borrowing information from adjacent genes in the gene graph.