Date of Award

January 2014

Document Type

Open Access Thesis

Degree Name

Medical Doctor (MD)



First Advisor

Michael Krauthammer

Second Advisor

Jose Costa

Subject Area(s)



Despite pressure from the federal government for US hospitals to adopt electronic medical records systems (EMR), the benefits of adopting such systems have not been fully realized. One proposed advantage of EMRs involves secondary use, in which personal health information is used for purposes other than direct health care delivery, particularly quality improvement. We sought to determine whether information recorded in the EMR could improve diagnostic pathways used to diagnose respiratory viruses in children, the most common etiology of diagnoses in the pediatric population. These tests potentially represent a source of unnecessary testing. We performed a retrospective observational study analyzing pediatric inpatients receiving respiratory virus testing at Yale-New Haven Children's Hospital between March 2010 to March 2012. Billing data (age, gender, season), laboratory data (sample adequacy, results), and clinical documents were gathered. We used MetaMap, a program distributed by the National Library of Medicine, to identify phrases denoting symptoms and diseases in the admission notes of patients. Identified concepts were added as additional variables to be modeled. Weka, another freely available software that allows for easy incorporation of machine learning algorithms, was used to derive models based on the C4.5 decision tree algorithm that aim to predict whether or not patients should be tested. Orders for pediatric patients accounted for 26.3% of all respiratory virus test orders placed during this time. Negative test results accounted for 69.5% of all tests ordered during the study period. The lengths of stay for all viral diagnoses were not statistically different. Models based on age, gender and season alone, were predictive for influenza (AUC 0.743, SE = 0.126), parainfluenza (AUC 0.686, SE = 0.078), RSV (AUC 0.658, SE = 0.048), and hMPV (AUC 0.713, SE = 0.143). Using MetaMap terms alone, only the model for RSV showed discriminatory ability (AUC 0.661, SE = 0.048). When basic variables were used in conjunction with MetaMap concepts, only the model for RSV showed improved performance (AUC 0.722, SE = 0.051) in comparison to both the basic and MetaMap models. Respiratory virus tests for general admission pediatric inpatients are ordered year-round and are mostly negative. Using models based on decision tree learning, our results showed that test volume could be reduced by about 20-50% for certain tests, as measured by model specificity. Furthermore, clinical concepts obtained via text mining in conjunction with basic variables improved prediction of RSV test results. The tradeoff between the false negative rates required to achieve any substantive specificity may be mitigated by our finding that hospital stays were nearly identical, regardless of the diagnostic outcome. These results support the use of EMR data for the auditing of and improvement of laboratory utilization. In addition, the improvement of predictive modeling for RSV with a simple implementation of text mining support the idea that clinical notes can be used for secondary use.


This is an Open Access Thesis.

Open Access

This Article is Open Access