Next Generation Phenotyping in Clinical Trial based on Real-World Data

Date of Award

Fall 10-1-2021

Document Type


Degree Name

Doctor of Philosophy (PhD)


Computational Biology and Bioinformatics

First Advisor

Krumholz, Harlan


Real-world data (RWD) or information generated outside of a traditional clinical trial, including electronic health records (EHRs), insurance claims and billing activity, product and disease registries, medical devices used in the home, and applications (apps) on mobile devices, provide a key analyte that can be used to advance biomedical investigation and drive discovery. While RWD can have significant value, analysis of these data has limitations and often requires new analytic approaches. One of the key areas in utilizing RWD is identifying patients with certain characteristics of interest through a process known as digital phenotyping. Although great benefits and potentials, identifying digital phenotypes using EHRs represents a significant informatics challenge because of the heterogeneity, incompleteness, and dynamic nature of EHR data. A primary area of research next generation phenotyping can improve is patient identification for clinical trials. With the number of available clinical trials globally increased from 3,294 in 2004 to 381,600 in 2021, they have facilitated greater opportunities for patients with more options treating their conditions. Especially for patients who were diagnosed with cancer, there are about 141,506 clinical trials (around 40% of all available trials) in the Cancers and other Neoplasms category. However, it has been repeatedly estimated and reported that less than 5% of adult cancer patients enrolled in cancer trials. Patient recruitment is a key determinant in a clinical trial’s success. The current gold standard, obtaining and organizing these by manual abstraction, is time-consuming and error-prone, which greatly limits the accuracy, scope, and timeliness of information extraction (IE). Automating these manual tasks would be a breakthrough, as it will increase the value of otherwise unavailable data for clinical trial patient screening. Current healthcare information technology systems are typically limited in their ability to support the development and implementation of automated IE methods. This is particularly true in cancer care, where specific clinical vocabularies are needed to capture semantic meaning and temporal context of clinical findings. The rapid evolution of artificial intelligence (AI) techniques, including machine learning (ML) and natural language processing (NLP), have shown significant promise to overcome these obstacles. This dissertation focused to develop a novel, real-time, scalable, AI-driven next generation phenotyping approach to support cohort identification. Automating this process has the potential to improve the efficiency and accuracy of patient matching for clinical trials and other clinical researches. In order to better understand the current available studies in the field and to develop our novel approach, we firstly conducted a literature review of the current clinical trial patient matching tool and application. The included studies were analyzed and compared from different angles including – automation, data source, efficacy, specialty coverage, efficiency, accuracy, rule-based vs. statistical approach, trial-centric vs. patient-centric approach, adoption of Natural Language Processing techniques, and scalability. Second, we developed a novel, scalable, HIPAA and 21 CFR compliant trial-centric pipeline in the Real-world setting to identify patients in real-time for one colorectal clinical trial conducted at a large cancer care center based on automatically ingested and processed both structured and unstructured EHR data in common data model. To validate the performance (accuracy, efficiency and speed) of the system, we firstly conducted a retrospective study screening patients for the clinical trial. The patient-trial matching was evaluated on a weekly basis, the results were sent to the clinical team, and the benchmarking matrices were calculated by comparing to the gold standard manual chart review conducted by the clinical professionals. Third, to ensure high-quality analysis, better characterize and understand the using Real-time EHR data and system, we designed a benchmarking study focusing the completeness of real-time EHR data, and done by comparison of multiple snapshots of the real-time EHR data within a defined timeframe. And we validated that automated benchmarking analysis on Real-time clinical data can help researchers generate insights out of the real-time clinical data and make decisions accordingly. Fourth, to mostly amplify the impact of such automated solutions, we implemented and integrated into the Real-time clinical workflow and further validated the performance in a prospective setting. Also, we designed a comprehensive prospective-retrospective approach to assess the integrity of data used in clinical trial patient matching pipeline and validate our hypothesis of the automated clinical trial patient identification system can maintain the performance in Real-time setting as in retrospective studies. Our developed clinical trial patient matching pipeline achieved 1) efficiency improvement (decrease manual workflow in patient screening), 2) high accuracy (maintain high specificity and precision rate comparing to gold stand manual review), 3) Real-time decision support (screen patients ahead of their visits to provide in-time clinical decision support), 4) integrity assessment (assess data quality of real-time EHR to avoid bias), 5) scalability (build classification algorithms based on common data model), and 6) automation (automatically ingest patient data from EHR data repository). With the developed pipeline, we have learned about how automated clinical trial patient identification system perform compared to manual chart review, and about how data integrity of real-time EHR impact the digital phenotyping and other general clinical studies. Moreover, with the innovative prospective-retrospective approach, we could quickly create a feedback loop in Real-world setting to improve the clinical trial patient identification system and data quality. This study helped provide innovative tools to improve the efficiency and accrual rate in clinical trial patient recruitment and further benefit the advance in medicine.

This document is currently not available here.