Welcome to the Day of Data 2014 Posters page.

Poster Session

A Study of the N-D-K Scalability Problem in Large-Scale Image Classification

Carlos E. del-Castillo-Negrete, Yale University
Sreenivas R. Sukumar, Oak Ridge National Lab

Image classification is a extensively studied problem that lies at the heart of computer vision. However, the challenge remains to develop a system that can identify and classify thousands of objects like the human visual system. The accumulation of massive image data sets has permitted the study of this problem at a big-data scale. However current algorithms have been shown to fall short of being practical and accurate at scale. To further understand how these algorithms scale, we developed a library of functions to explore the scalability of the support vector machine (SVM) linear classification algorithm when applied to problems of image classification. Our study provides valuable insights into not only how the SVM algorithm scales up and where it falls short, but also into how to create smarter and more efficient image classifiers that are fine- tuned for the large scale image classification challenge.

Applying novel tree-based frameworks to big data for classification of heart failure patients and prediction of clinical responses

Yan Zhang, Yale University
Nicholas Downing, Yale University
Emily Bucholz, Yale University
Suganthi Balasubramanian, Yale University
Shu-Xia Li, Yale University
Tara Liptak, Yale University
Harlan Krumholz, Yale University
Mark Gerstein, Yale University

Over 5 million Americans suffer from heart failure, a condition with a 5-year survival that eclipses all cancers apart from that of lung cancer. Conventional understanding of heart failure is simplistic: it is viewed as a single syndrome, despite real heterogeneity. In addition, models predicting outcomes focus on dichotomous results, like 30-day readmission. A novel approach to classification of heart failure may improve our ability to target interventions, improve patient experiences, and predict outcomes.

The Healthcare Cost and Utilization Project is a family of administrative claims databases that describes patient demographics, comorbidities, procedures, acute care utilization and outcomes, such as mortality and readmission. Using the California datasets, which allow linkage of hospital admissions to emergency department visits, we sought to (1) develop a new classification tool for heart failure, (2) predict patient response based on previous visits, (3) predict survival time.

In this pilot study, we propose novel tree-based frameworks for the classification of heart failure patients that can also be used to predict clinical response, health care utilization and mortality. The pilot sample contains 822 patients with heart failure who are randomly picked from a total sample of 211284 patients. The median number of encounters per patient was 3 (IQR: 5); each are associated with up to 168 variables. By applying random forest approaches to this pilot sample, we have performed classification of patients with heart failure and identified important predictors of outcomes. Going forward, we will refine the model and apply to the entire data set to produce broadly applicable insights.

Beyond Original Intent – The Use of a Corporation’s Administrative Databases for Academic Research

Martin D. Slade, Yale University
Linda Cantley, Yale University
Baylah Tessier-Sherman, Yale University
Deron Galusha, Yale University
Michael McTague, Yale University

Large corporations maintain a variety of administrative databases as part of their normal operations. These databases, created for distinct functions by separate organizational entities, are generally independent. For instance, a company’s Human Resources organization typically maintains a database containing information such as demographics, job and salary history, and employee status for all employees.. The environmental, health and safety department maintains information regarding work-place exposures and exposure levels for various agents within each job as well as injury and illness surveillance records. The medical department maintains occupational health information including audiometric and pulmonary function test results. As many large corporations are self-insured, they also have medical claims data available by employee that includes diagnosis codes, procedure codes, and prescription drug codes. Additional data maintained by corporations may include production output and quality information, employee contributions to retirement plans and health savings accounts, as well as workers compensation information.

A synergistic partnership between industry and academia allows for linkage between company maintained databases to enable the conduct of research to examine associations between demographic, occupational and social factors not otherwise available to researchers, and the ability to define and test interventions to promote health and safety in the workplace. An almost 20 year relationship between Alcoa, Inc. and Yale University School of Medicine continues to facilitate investigation of root causes of disease and injury risk in a large manufacturing cohort. To date, over 50 peer-reviewed research papers have resulted from this joint venture.

Breadth of Emotion Vocabulary in Middle Schoolers

Marina Ebert, Yale University
Zorana Ivcevic, Yale University
Sherri S. Widen, Yale University
Lance Linke, Yale University
Marc Brackett, Yale University

How many different emotion words can middle schoolers think of to describe major categories of emotional experiences? While most existing ability tests of emotion understanding and vocabulary are based on word recognition, the goal of this study was to assess prompted emotion word generation. Students in 5th-8th grades (N=236) were asked to list all feeling words they can think of to describe five major emotion groups (happiness, calm, sadness, anger and nervousness). They also completed an ability measure of emotion understanding, the Mayer, Salovey, Caruso Emotional Intelligence Test – Youth Version (MSCEIT-YV). When asked to generate emotion descriptors, students produced a range of responses, from specific target emotion words (e.g., joy and pleasure describing the ‘happy’ category), to descriptors of closely associated emotions (e.g., love and pride describing the ‘happy category), to non-emotion descriptors (e.g., laughing or dancing describing the ‘happy’ category). Students produced 1472 unique responses (M=27.3, SD=10.9), with target emotion responses accounting for 22.4 % of responses (M=12.23, SD=4.8). Most target emotion responses were generated for the happiness-related feelings (54 different terms), and the fewest for calm-related feelings (25 terms). Older students and girls performed better on both measures of emotion understanding. Positive correlations were found between the scores on MSCEIT-YV scale and the overall number of target emotion responses, r=.25, p<.01, as well as the overall number of associated emotion responses, r=.19, p<.01. This study offers an important approach to learning about emotion vocabulary by providing an insight into emotion word generation among early adolescents.

Digitally Mapping the Growth of the Railroads in the United States

Michael Weaver, Yale University

As part of my dissertation, I creating digital maps of the extent of the railways in the United States during the late 19th century (1880 to 1910) on a yearly basis. While other researchers have created digital maps of the railways in approximately 10-year intervals, this misses out on the rapid change in the railways in the interim. These previous digitization attempts have relied on using detailed maps created of the railways at a given time. But accurate maps were not made on a yearly basis and only exist for roughly every 10 years. However, during the 19th century, people using the railways to travel or ship freight needed accurate guides to the railways. The Official Guides to the Railways served this role. These Guides were updated and published multiple times a year. These guides contain route maps and time tables for all of the railways. Most importantly, however, each guide includes an Index of Railway Stations. I have obtained 30 volumes of this resource from Stanford, worked with the Scan and Deliver services at Yale to scan the relevant portions, and started to digitize the indices of railway stations. While I have had one volume transcribed by hand, I am working to use OCR to digitize the text of the remaining 30 volumes. Once the text is transcribed, I use several geocoding tools to identify the location of railways stations in each year between 1880 (possibly earlier) and 1910. An example of what this looks like for one year (1910) can be seen here. This presentation would focus on using various techniques to digitize and make innovative use of historical textual resources to create useful data.

Early life environment, fertility and age of menarche: A test of life history predictions using a longitudinal assessment of adversity perception and economic status

Dorsa Amir, Yale University
Matthew R. Jordan, Yale University
Richard G. Bribiescas, Yale University

Perceptions of early life environmental adversity can affect the timing of life history transitions and investment in reproductive effort. These effects are well documented in non-human organisms, but have been challenging to test in humans. Here we present evidence of the effects of variables associated with extrinsic mortality and morbidity on reproductive effort in a contemporary American population. Using a longitudinal database that sampled participants (N ≥ 1,579) at four points during adolescence and early adulthood, variables reflective of perceptions of adversity and risk were significantly associated with age of menarche and early adult fertility. While other factors related to energetics and somatic condition could not be assessed, the results of this study support the hypothesis that perceptions of adversity early in life are associated with differences in reproductive effort. How these factors may covary with energetics and somatic condition remains to be assessed.

Environmental Performance Index

Andrew Moffat, Yale University

The Environmental Performance Index (EPI) ranks how well countries perform on high-priority environmental issues in two broad policy areas: protection of human health from environmental harm and protection of ecosystems.

Exurban Residents’ Perceptions of Naturally Returning Predators: Connecticut Case Study

Margaret E. Sackrider, Yale University
Susan G. Clark, Yale University
Isaac M. Ortega, University of Connecticut - Storrs

As a result of reforestation, growth of exurban areas and wildlife adaptation, it is believed that the public is currently encountering more human-wildlife conflicts than ever before. The key to balancing wildlife conservation and human development is understanding the dynamic relationship between humans and carnivores. Specifically, gaining insight into the complexity of this relationship will aide in the creation of more effective conservation policy and outreach.

Reforestation throughout Connecticut has supported a tremendous population growth of pray species and subsequently the growth of predator populations including coyotes, Canis latrans, and black bears, Ursus americanus. According to some biologists, the state will likely see an increase in mountain lions Felis concolor in the future. Prior to this study there was little knowledge regarding residents' perceptions of these predators, making it impossible to implement management strategies reflexive of residents’ beliefs and opinions.

The goals of this project were to explain the social context regarding large predators in the state of Connecticut, as well as work toward bridging the gap between ecological and social science disciplines. An explanatory sequential grounded theory approach was taken to achieve these goals. A spatial representation of perceived threat was created using black bear sighting reports to Connecticut Department of Energy and Environmental Protection. From this data reporters were randomly selected to participate in interviews. Qualitative data collected through interviews and participant observation was used to create of a case specific survey. Quantitative data was also collected through surveys of randomly identified Connecticut residents.

International Occupational Health Research on an "Invisible" Workforce

Martin D. Slade, Yale University
Rafael Lefkowitz, Yale University

There are many professions in which employees are located in remote locations. International maritime workers make up one such occupation. They are a vulnerable, underserved and neglected population of approximately 1.2 million people with high rates of disease and injury. During their typical nine month deployments, they live in relative isolation with no health care professional on board. To understand the root causes of disease and injury among this remote workforce, strategies to collect information, analyze data, and report results and recommendations have been developed. These strategies, which include gathering of data through an alliance of companies involved in seafaring, have yielded initial results as to the predictors of serious illness and injury on board vessels requiring the repatriation of the employee. These same methods should be applicable to other isolated international workforces.

List of Poster Presenters

Yale University

Managing the Data: The Tell Ziyadeh Archaeological Project

Yukiko Tonoike, Yale University, Dept. of Anthropology
Dawn Brown, Yale University, Dept. of Anthropology, Associate
Frank Hole, Yale University, Dept. of Anthropology

Archaeological research depends on several types of data; material, contextual and analytical. Material data refers to the actual artifacts, features, and sites themselves. Contextual data are location, local geography, chronology, cross-correlations among data sets, historical and ethnographic. Analyses may be geochemical (petrographic, isotopic, pXRF), stylistic, or comparative archaeological. For an effective understanding of archaeological sites, a research project must be based on a research design suited for effective data recovery, analysis, interpretation and synthesis. With the development of digital technology, the amount of data that can be incorporated into each archaeological project has grown exponentially, and making these data accessible for other researchers and creating digital archives is critically important, especially as archaeology is a highly destructive science. In this poster, we use the example of our Tell Ziyadeh project, to show how archaeologists deal with such issues.

Partitioning Bipartite Graphs: A Modified Louvain

Emily Diana


How do we find communities in a graph? How does this change if the graph is bipartite? The Louvain method maximizes links within communities and minimizes those between in order to determine an optimal grouping. Yet, because it may fail when bipartite restrictions are introduced, we have adjusted the null model so as to improve performance in these conditions.


Our Bipartite Louvain is more robust with respect to permutations of vertices than the standard Louvain. For our synthetic examples, Bipartite Louvain typically yields a higher modularity and uncovers the ground truth communities with a higher probability. In the future, we will examine real world data sets with our modified algorithm.

Rotating optical microcavities with broken chiral symmetry

Raktim Sarma, Yale University
Li Ge, The Graduate Center, CUNY
Jan Wiersig, Universitat Magdeburg
Hui Cao, Yale University

We develop a finite difference time domain simulation algorithm to simulate photonic structures in a rotating frame. Using, the algorithm, We numerically compute and demonstrate in open microcavities with broken chiral symmetry, quasi-degenerate pairs of co-propagating modes in a non-rotating cavity evolve to counter-propagating modes with rotation. The emission patterns change dramatically by rotation, due to distinct output directions of CW and CCW waves. By tuning the degree of spatial chirality, we maximize the sensitivity of microcavity emission to rotation. The rotation-induced change of emission is orders of magnitude larger than the Sagnac effect, pointing to a promising direction for ultrasmall optical gyroscopes.

ShelfScan: Streamlining library shelving, expanding quality control

Lauren F. Brown, Yale University
Jason Zentz, Yale University
Osman Din, Yale University

ShelfScan, a web-based application developed in house at Sterling Memorial Library, has streamlined the shelving process at SML and Bass and expanded quality control at multiple libraries by verifying materials scanned with a Bluetooth scanner against the library database.

Prior to ShelfScan, when a book was shelved in the library stacks, it was first opened in order to insert a paper “recently shelved” flag; later it was revisited and reopened to check call number order. This manual accuracy checking did not reveal other anomalies such as incorrect collection, incorrect availability status, or catalog discrepancies. With ShelfScan, books are shelved in a more efficient fashion, and accuracy checking is provided by scanning sections of stacks. Scanning is scheduled at more convenient times because it is now dissociated from the shelving process.

Scanned barcodes are transmitted to a text file, which is uploaded to ShelfScan along with user-input parameters, including the location/collection of the material scanned. ShelfScan builds two virtual files of barcodes: a File Order Table, which holds the records in the order they were scanned, and a Sorted Records Table, which sorts by call number. By comparing these tables and mining data from the library catalog database, the application produces an exception report that identifies incorrectly shelved items and many other types of errors. All history is maintained in an SQL Server database, which allows for the creation of dynamic statistical reports with SQL Server Reporting Services (SSRS). These reports in turn inform decisions about which areas need ongoing scanning.

Stratified Meta-Analysis to Examine Data Biases in Lung Cancer Studies of Refinery Workers

Sherman Selix, Yale University

Petroleum refineries employ a variety of workers who historically experienced different potentials for asbestos exposure depending on job tasks. Associations between petroleum refinery work and lung cancer related to occupational asbestos exposure have been quantified among various locations, corporations, and time periods. To combine the data from several individual refinery studies and examine an overall effect, a systematic review and stratified meta-analysis was employed. Using set search terms among four databases, 112 potential publications were identified, of which 29 qualified for meta-analysis. Risk estimates and confidence intervals were extracted from these publications to construct four separate datasets. Inverse variance weighting assuming random effects was used to combine Standardized Mortality and Incidence data separately for all male and female refinery workers, as well as both standardized and relative risk measurements for the subset of male maintenance workers, who may have been exposed to higher levels. Males in cohorts consisting of all refinery workers, which included both blue and white collar workers, had a meta-Risk Ratio (mRR) of 0.80 (95% CI: 0.75-0.85) when compared to population controls, all female refinery workers had an mRR of 1.27 (95% CI: 0.86-1.87) when compared to population controls, a statistically significant difference. Male maintenance workers exhibited an mRR of 0.88 (95% CI: 0.74-1.04) with population controls, and an mRR of 1.62 (95% CI: 1.30-2.02) when internally compared to other refinery workers. This large differential in risk estimates for the same population could be related to sampling biases in opposite directions: population controls are subject to the “healthy worker effect”, while internal comparisons may differ from maintenance workers in both socio-economics and smoking rates. Due these potentially confounded and conflicting results, no conclusion could be drawn regarding lung cancer risk for refinery workers. Accurate quantification of lung cancer risk for refinery workers will depend on addressing these issues.

Keywords: Biostatistics, Meta-analysis, Risk, Epidemiology, Review, Database, Cancer, Asbestos, Bias, Confounding

The YODA Project: Developing Methods for Sharing Clinical Trial Data

Jessica Ritchie, Yale University
Harlan Krumholz, Yale University
Joseph Ross, Yale University
Cary Gross, Yale University
Beth Hodshon, Yale University


Data sharing and data transparency are becoming the new standard in clinical research to ensure that patients and clinicians possess all necessary information about a drug or device when making treatment decisions. The Yale University Open Data Access (YODA) Project developed a model to facilitate access to participant-level clinical research data to promote independent analysis by external investigators. The YODA Project is currently collaborating with Medtronic, Inc. and Janssen, the pharmaceutical companies of Johnson & Johnson, to facilitate access to their clinical trial program data by external investigators in a manner that is aligned with the following principles: advance science and public health, conduct responsible research, ensure good stewardship of data, and promote transparency. As interest in the release of participant-level data grows, it is critical to carefully consider the approach to releasing data to ensure that the release is executed in a manner that preserves patient confidentiality, facilitates scientific productivity, and respects the values and rights of the participating parties. The YODA Project has identified several key decisions that should be addressed throughout the planning and implementation of the data release process, such as patient confidentiality, data storage, and data use. As varying methods emerge for sharing data, the YODA Project strives to be an innovative leader and to set standards for the field.


Data sharing, data release, clinical trial data, participant-level, data storage, data transparency

Using graph visualization to look at the trajectories of events that lead to readmission

Abbas Shojaee, Yale University
Isuru Ranasinghe, Yale University
Sudhakar Nuti, Yale University
Shu-Xia Li, Yale University
Harlan Krumholz, Yale University

Information on specific sequence of healthcare utilization events in heart failure patients may be useful for identifying distinct subpopulations of patients with HF. Knowledge of patient trajectories may help to improve prediction of future readmission which can be used to tailor management to the individual needs of the patient.

This research introduces a new approach to mining administrative and clinical datasets by incorporating graph networks to identify & visualize the trajectories of sequences of events.

Using Graphs to Characterize Nationwide Physician Referral Networks

Ding Tong, Yale University
Shu-Xia Li, Yale University
Isuru Ranasinghe, Yale University
Sudhakar Nuti, Yale University
Hongyu Zhao, Yale University
Harlan Krumholz, Yale University


Evaluating physician referral network characteristics can help to understand how physicians and hospitals interact to provide patient services within the US healthcare system and ultimately how this may influence patient outcomes.


We used the 2012-2013 national Physician Referral data from the Centers for Medicare & Medicaid Services (CMS), which consists of 73,071,804 pairs of referrals from one health provider to another in calendar year 2012 and the first two quarters of year 2013 within 30 days of care. These referrals are from 642,144 national-wide physicians and 4,811 hospitals. We obtained information for each provider, physician or hospital, from CMS.

We then generated a nationwide referral network. We described the network with graphs and potential important network characteristics using graph theory and social network theory. Further, we described the sub-network by Exponential random graph models (ERGM). The ERGM coefficients from such models can reflect the properties of the network nodes and help illustrate how the network outcomes are influenced.


Our results show that 1) the graphs and characteristics vary substantially across the geographic areas and 2) graphs and the characteristics depicting the same area are strongly associated. The ERGM model shows that physicians in cardiology, diagnostic radiology and geriatric medicine are more likely to send and receive referrals than physicians in family care and internal medicine in certain hospitals.


We demonstrate the use of graph-based approaches to describe and evaluate nationwide physician referral networks. Further work will study how these network characteristics are associated with hospital outcomes.