Presented posters from the 2017 Day of Data are available here. Please note that some posters are only available as abstracts.

Thank you to all of the researchers and volunteers who made the poster session a success!


Subscribe to RSS Feed

A sooting tendency database for accelerating the introduction of biomass-derived fuels

Charles McEnally, Yale University
Lance K. Tan, Yale University
Dhrubajyoti D. Das, Yale University
Lisa D. Pfefferle, Yale University

One of the many potential benefits of biomass-derived fuels is lower emissions of particulate matter. In order to enable the selection of biofuels that maximize this benefit, we have been building a database of a property — yield sooting index — that characterizes tendency to produce particulates, by measuring it for hundreds of pure hydrocarbons. We have encountered many data management issues while trying to maximize the usefulness of this database. Our research field does not have a general data repository, so we are currently posting the data to the Harvard Dataverse ( We have direct collaborators at the National Renewable Energy Laboratory who are using machine learning techniques to develop empirical models that fit the data and can make predictions for new compounds. We have additional collaborators at Penn State University who are modeling the data from first principles.

ASD Biomarker Detection on fMRI Images: Feature learning with Data Corruptions by Analyzing Deep Neural Network Classifier Outcomes

Xiaoxiao Li 6984086, Yale University

Autism spectrum disorder (ASD) is a complex neurological and developmental disorder. It emerges early in life and is generally associated with lifelong disability. Finding the biomarkers associated with ASD is extremely helpful to understand the underlying roots of the disorder and find more targeted treatment. Previous studies suggested brain activations are abnormal in ASDs, hence functional magnetic resonance imaging (fMRI) has been used to identify ASD. In this work we addressed the problem of interpreting reliable biomarkers in classifying ASD vs. control; therefore, we proposed a 2-step pipeline: 1) classifying ASD and control fMRI images by deep neural network, and 2) finding which brain regions are important for identifying ASD and control. Specifically, in step 2, we used the trained classifier to estimate the feature importance by measuring the prediction distribution change as a function of input image with the corrupted region. However, there is no certain way to corrupt the data without adding side effects. Thus, we aggregated two "opposite" corruption methods: a) blackout and b) add Gaussian noise. Biomarkers found by the 2-step pipeline were verified by Neurosynth brain function decoding. Several key innovations in our research include: i) we created an innovative pipeline for learning image data feature by analyzing the classifier outcomes with corruptions; ii) we proposed a deep learning strategy for classifying 4D data; iii) we aggregated different corruption methods for feature importance analysis, and iv) our neurological interpretation of the final results showed evidence that there were meaningful fMRI biomakers on fMRI for ASD.

Coffee drinking and leukocyte telomere length: A meta-analysis

Bella Kotlyar, Yale University

Telomeres are long tandem nucleotide repeats responsible for maintaining chromosomal integrity. They shorten with each cell division, serving as markers for cellular aging and replicative ability, and shorter telomere length has been associated with greater risk of various chronic diseases of aging. There is increasing interest in the relationship between telomere length and lifestyle factors, such as components of the diet, that are associated with age-related chronic diseases, like cancer and diabetes.

There is mounting evidence that coffee, one of the most commonly consumed beverages in the world, has potential protective effects against chronic disease and mortality. Few studies have evaluated telomere length and coffee consumption specifically. A 2016 study in the Nurses’ Health Study1 found a significant association between coffee intake and longer telomeres. A study conducted within the NHANES cohort replicated these results by also showing a statistically significant positive association between coffee consumption and telomere length2. This study builds on the work exploring this association in another population. We consider the cross-sectional association between coffee consumption and leukocyte telomere length among controls from 6 case-control studies nested within a large population-based cohort of U.S. adults with detailed data on dietary intake and other lifestyle factors.

  1. Liu JJ, Crous-Bou M, Giovannucci E, De Vivo I. Coffee Consumption is Positively Associated with Longer Leukocyte Telomere Length in the Nurses’ Health Study. J Nutr 2016.
  2. Tucker LA. Caffeine consumption and telomere length in men and women of the National Health and Nutrition Examination Survey (NHANES). Nutrition & Metabolism. 2017; 14:10.

Efficient dynamic centrality metrics for election advertising - a case study

Soheil Eshghi, Yale Institute for Network Science
Leandros Tassiulas, Yale Institute for Network Science

In prior work [1], we have shown how advertising channels should be chosen by a budget-constrained electoral campaign. In this poster, we apply the resulting proposed algorithm to the MIT Social Evolution [2] data-set (N=84), which captured political discussions, inclinations, and voting behaviors around the 2008 US Presidential Election within a student dorm. We compare the resulting centrality metrics developed from our algorithm (which have a direct mapping to optimal channel choice decisions) against more traditional static centralities, and show how employing them leads to more votes.

[1] Eshghi, S., Preciado, V.M., Sarkar, S., Venkatesh, S.S., Zhao, Q., D'Souza, R. and Swami, A., 2017. Spread, then Target, and Advertise in Waves: Optimal Capital Allocation Across Advertising Channels. arXiv preprint arXiv:1702.03432.

[2] A. Madan, M. Cebrian, S. Moturu, K. Farrahi, A. Pentland, Sensing the 'Health State' of a Community, Pervasive Computing, Vol. 11, No. 4, pp. 36-45 Oct 2012

Open access to data at Yale University

Harlan Krumholz, Yale University
Limor Peer, Yale University / ISPS
Jessica Ritchie, Yale University
Joseph Ross, Yale University

Open access to research data increases knowledge, advances science, and benefits society. Many researchers are now required to share data. Two research centers at Yale have launched projects that support this mission. Both centers have developed technology, policies, and workflows to facilitate open access to data in their respective fields. The Yale University Open Data Access (YODA) Project at the Center for Outcomes Research and Evaluation advocates for the responsible sharing of clinical research data. The Project, which began in 2014, is committed to open science and data transparency, and supports research attempting to produce concrete benefits to patients, the medical community, and society as a whole. Early experience sharing data, made available by Johnson & Johnson (J&J) through the YODA Project, has demonstrated a demand for shared clinical research data as a resource for investigators. To date, the YODA Project has facilitated the sharing of data for over 65 research projects. The Institution for Social and Policy Studies (ISPS) Data Archive is a digital repository that shares and preserves the research produced by scholars affiliated with ISPS. Since its launch in 2011, the Archive holds data and code underlying almost 90 studies. The Archive is committed to the ideals of scientific reproducibility and transparency: It provides free and public access to research materials and accepts content for distribution under a Creative Commons license. The Archive has pioneered a workflow, “curating for reproducibility,” that ensures long term usability and data quality.

Rescue of neocortical circuit deficits with modified bone marrow-derived mesenchymal stem cells, SB623, in a rat model of photothrombotic stroke

Alexander Urry, Yale University

The following poster characterizes the effects of a novel stem cell line on treating the neural circuit deficits resulting from stroke.

Support at Yale for data throughout the research lifecycle

Linda Coleman, Yale University
Themba Flowers, Yale University
Louis King, Yale University
Jill Parchuck, Yale University
Limor Peer, Yale University / ISPS
Alice Tangredi-Hannon, Yale University

Yale’s research data policy sets expectations of Yale and its researchers and articulates shared principles and rules that guide the use, development, and protection of research data. Yet managing research data is a complex challenge for researchers. It requires investment of time and resources from the entry of data into the research cycle through the dissemination and archiving essential to reliable verification of results and reuse of data in new research. The diversity of disciplinary practices further complicates the activity. Additionally, researchers seeking institutional support must navigate the myriad of support services, uncertain as to whether they will find a resource to meet their needs. The Research Data Support Initiatives Group (RDSIG) – a multi-departmental steering group whose mission is to develop scalable, sustainable, and domain-appropriate data services and policies in support of research at Yale – has been working on developing a coherent program around support for research data management to optimize efforts by various support units and best serve Yale researchers.

Transcriptomics to Develop Biochemical Network Models in Cyanobacteria

Bridget E. Hegarty, Yale University
Jordan Peccia, Yale University
Ratanachat Racharaks, Yale University

Through targeted genetic manipulations guided by network modeling, we will create a flexible, cyanobacteria-based platform for the production of biofuel-precursors and valuable chemical products. To build gene-metabolite predictive models, we have characterized Synecococcus elongatus sp. UTEX 2973’s (henceforth, UTEX 2973) gene expression and metabolite production under a number of environmental conditions.