Incorporating Ancestry Information into Association Studies Provides New Insights into the Genetic and Epigenetic Architecture of Smoking Trajectory

Date of Award

Spring 2022

Document Type


Degree Name

Doctor of Philosophy (PhD)


Statistics and Data Science

First Advisor

Zhao, Hongyu


Tobacco smoking is a leading cause of morbidity and mortality and the most preventable cause of a wide range of medical conditions and diseases. Studies showed that smoking was associated with both the genetic predisposition and DNA methylation signatures. To date, association studies have identified thousands of genetic variations and epigenetic alterations associated with smoking behavior. However, a large number of existing studies focused on a single ancestry group, mostly samples of European descent. Furthermore, a majority of the commonly used phenotypes including contrasts of smoking status (e.g. smoking initiation, smoking cessation) and quantitative phenotypes (e.g. pack-year, number of cigarettes per day) were derived from self-reported responses to questionnaires. Phenotypes based on a single measure of self-reported health behavior were subject to underreporting due to social desirability bias and overlooked the time-varying determinant of tobacco use. The electronic health records (EHR) provide an opportunity to develop longitudinal phenotypes that account for variation in smoking behavior over time. In chapter 1, we report a large genome-wide association study (GWAS) of smoking trajectory in a multi-ancestry cohort. We identified 18 loci for smoking trajectory contrasting mostly current versus mostly never in European Americans, one locus in African Americans, and one in Hispanic Americans. Functional co-localization integrating multi-omics data prioritized several dozen genes for the identified significant loci, adding biological insights into the genetic vulnerability for smoking behavior over time. We further focused on the admixed African American (AA) samples and integrated DNA methylation data to better understand the epigenetic mechanism of smoking trajectory. Like in GWAS, population structure is a widely recognized confounding factor in epigenome-wide association study (EWAS). However, existing studies handled genetic admixture through stratified EWAS by ancestry groups or adjusting it as a covariate in the association model, overlooking the within-population heterogeneity across admixed individuals. In chapter 2, we present a thorough comparison on the utility of three ancestry variables, i.e., self-reported race, global ancestry, and local ancestry (LA) in accounting for interindividual variability in DNA methylation. We demonstrate the merits of local ancestry in better capturing the heterogeneity in genetic admixture and identifying ancestry-associated DNA methylation in AA samples. Furthermore, we incorporated LA into the meQTL mapping and identified significantly different meQTL effects in the context of an African or European ancestry background. Inspired by findings in chapter 2, we integrated LA to account for within-population heterogeneity in the EWAS and meQTL identification of smoking trajectory. In chapter 3, we report a smoking trajectory EWAS with LA adjustment that identified 64 CpG sites significantly associated with smoking trajectory, including 15 novel methylation sites and two novel genes ZNF782 and SNAP23 for smoking. We showed that smoking trajectory EWAS with LA adjustment identified more significant CpG sites at a comparable inflation level and demonstrated a higher replication rate than the EWAS without LA adjustment. We then introduce a multi-marker Bayesian model with a sparse prior to identify ancestry-associated meQTLs. Our meQTL model allowed the genetic effects of the risk allele to be different by ancestry background. The identified ancestry-associated meQTLs suggest the genetic admixture plays an important role influencing smoking-associated DNA methylation in admixed samples. Taken together, our multi-ancestry study sheds light on the unique and shared genetic and epigenetic architecture underlying smoking trajectory and other smoking phenotypes developed from single measure of the self-reported health behaviors. We demonstrate the utility of local ancestry in capturing genetic admixture and accounting for within-population heterogeneity and provide a framework for the application of local ancestry estimates in association studies. Our findings have important implications for the conduct of genetic and epigenetic research in admixed populations.

This document is currently not available here.