Making the Most of Polygenic Risk Scores: Risk Decomposition, Cross-Population Prediction, and Clinical Application

Date of Award

Spring 2022

Document Type


Degree Name

Doctor of Philosophy (PhD)


Computational Biology and Bioinformatics

First Advisor

Zhao, Hongyu


An essential component of human genetics research and precision medicine is to develop robust and accurate disease risk prediction models from genetic data together with other risk factors. Accurate prediction models will have great impacts on disease prevention and early and effective treatment. With the remarkable success achieved by genome-wide association studies (GWAS) in the past 17 years, numerous single-nucleotide polymorphisms (SNPs) associated with complex human traits and diseases have been identified, and various methods, broadly called polygenic risk scores (PRS), have been proposed to utilize GWAS data to predict genetic risk. Compared to risk prediction models that require individual-level data to train, summary statistics-based PRS methods are widely adopted in practice because of their computational efficiency and no need to access individual-level training data. To date, based on GWAS summary statistics derived from large population studies, a suite of PRSs have been developed for a number of diseases with substantial stratification capacity and great potential for clinical use. Despite these progresses, many aspects of PRS need to be further developed. First, PRS using all the markers in the genome reflects a joint effect from different pathways without any distinctions, that is, there is a lack of pathway-level exploration of genetic burden which may offer insights on the interpretation and intervention. Second, most PRS methods are designed for and applied to populations of European descent where most large-scale GWAS have been performed. The PRS in under-represented populations is considerably less predictive due to both the lack of large-scale GWAS in these populations and the challenge of transferring genetic results from European samples due to different linkage disequilibrium, allele frequencies, and genetic architectures. Lastly, in addition to genetic factors, it is also important to consider demographics, lifestyle behaviors, and disease histories to assess an individual’s disease risk. In this dissertation, we have developed and applied PRS to address the above three issues. Firstly, we used a region-based strategy to decompose a global PRS for coronary artery disease, CAD-PRS, into pathway-specific PRS (PS-PRS) to investigate pathway-level genetic burdens for coronary artery disease. Based on samples from UK Biobank, we show that the PS-PRS was able to stratify individuals with high CAD-PRS into 9 subgroups with distinct genetic risk compositions and phenotypic features. Notably, the relative changes of lipoprotein-A and LDL-cholesterol were 45.2% and 17.1%, respectively, in lipids-pathway-subgroup compared to all others. A significant interaction was observed between statin treatment and lipids-PS-PRS among individuals with high overall CAD genetic burden. Results from phenome-wide association studies and interaction analyses together suggest substantial heterogeneity among people with high CAD PRS and our PS-PRS could provide individualized quantifications for the relative importance of underlying pathways. Secondly, we proposed two novel statistical models, xPred and xPred-anno, to predict disease risk in different populations by leveraging GWAS summary statistics across populations. In particular, xPred-anno also incorporates functional annotations to further improve genetic risk prediction. Through comprehensive simulations and real data analyses of multiple complex traits/diseases, we demonstrate that both xPred and xPred-anno could substantially increase prediction accuracy compared to other PRS methods, especially in the under-represented populations with fewer GWAS samples. Thirdly, we investigated the interactions of PRS with lifestyle behaviors and the potential of PRS in predicting recurrent events for cardiovascular diseases. Through comprehensive interaction analyses, we found that individuals with high genetic risk may derive similar relative but greater absolute benefit from lifestyle adherence. By building a meta-PRSCVD upon candidate PRSs of three CVD subtypes, we found that the prediction capacity of PRS was not limited to disease onset but also recurrent events; with that, we highlight the potential role of genomic screening for secondary preventions of CVD, especially among early-onset CAD patients.

This document is currently not available here.