The Day of Data is an annual, university-wide event that features speakers from a number of disciplines discussing how they use data in their work. This year, the December 1 Day of Data will address Data + Society.
Yale Day of Data 2017
CSSSI 24/7 Space, 4:00 – 6:00 PM
November 30, 2017
1. Experimental Design as Market Design: Billions of Dollars’ Worth of Treatment Assignments
Randomized Controlled Trials (RCT) enroll hundreds of millions of people and involve many human lives. In this paper, I propose a design of RCT with high-stakes treatment. Unlike conventional RCT, my design respects subject welfare; it optimally randomly assigns each treatment to subjects predicted to experience better treatment effects, or to subjects with stronger preferences for the treatment. For preference elicitation, my design is also almost incentive compatible. Finally, this design unbiasedly estimates any causal effect estimable with standard RCT. To quantify these properties, I apply my proposal to a water cleaning experiment in Kenya (Kremer et al., 2011). Compared to usual RCT, my design substantially improves subjects’ well-being while reaching similar treatment effect estimates with similar precision.
2. Disability Prior to Death Among the Oldest-Old in China: A Large Population-Based Study
Zuyun Liu, Ling Han, Xiaofeng Wang, Qiushi Feng, Thomas Gill
Yale University, Fudan University, National University of Singapore
The oldest-old (≥80 years) is the fastest-growing segment of the world’s population, including China. A particularly important question for this aging population is whether and to what extent they can maintain independence with essential activities of daily living (ADL). To date, several studies in developed countries have demonstrated that most older adults face a period of disability in late life, lasting to death. However, it is unknown whether these findings apply to low-income or middle-income countries like China, a developing country with a rapidly growing aging population. Therefore, we carried out a study to evaluate the prevalence of disability prior to death among the oldest-old in China using unique big data from the Chinese Longitudinal Healthy Longevity Survey (CLHLS), one of the largest samples of oldest-old in the world.
3. A Sooting Tendency Database for Accelerating the Introduction of Biomass-Derived Fuels
Charles McEnally, Lance K. Tan, Dhrubajyoti D. Das, Lisa D. Pfefferle
One of the many potential benefits of biomass-derived fuels is lower emissions of particulate matter. In order to enable the selection of biofuels that maximize this benefit, we have been building a database of a property — yield sooting index — that characterizes tendency to produce particulates, by measuring it for hundreds of pure hydrocarbons. We have encountered many data management issues while trying to maximize the usefulness of this database. Our research field does not have a general data repository, so we are currently posting the data to the Harvard Dataverse. We have direct collaborators at the National Renewable Energy Laboratory who are using machine learning techniques to develop empirical models that fit the data and can make predictions for new compounds. Additional collaborators at Penn State University who are modeling the data from first principles.
4. Rescue of Neocortical Circuit Deficits with Modified Bone Marrow-Derived Mesenchymal Stem Cells, SB623, in a Rat Model of Photothrombotic Stroke
Alexander Urry, Zuha Warraich, Anna Sato, Elli Moradi, Damien Bates, Yaisa Andrews-Zwilling, Jeanne T. Paz
Yale University, The Gladstone Institutes, and SanBio, Inc.
Cortical injuries, such as stroke, are a major cause of long-term disability. Currently there are no pharmacological therapies on the market that are able to treat chronic damage from stroke. This chronic damage can leave individuals with paraplegia that persists for years after the initial injury. Recently, various types of stem cells have arisen as a potential treatment for chronic damage following stroke in human patients. SanBio Inc. recently completed a Phase 1/2a clinical trial with intracranial implantation of modified bone marrow-derived mesenchymal stem cells, SB623, which appear to be safe and were associated with improvements in clinical outcome end points at 12 months. However, the mechanisms by which these cells reduce neurological deficits remain unknown. In this current preclinical study, we use SB623 to determine the effects of the cells on neural circuit dynamics in the chronic phase after stroke. Here we investigated the efficacy of SB623 cells transplanted intracranially into rats 28 days after induction of a photothrombotic (PT) stroke in the right somatosensory S1 neocortex.
5. Transcriptomics to Develop Biochemical Network Models in Cyanobacteria
Bridget E. Hegarty, Jordan Peccia, Ratanachat Racharaks
Through targeted genetic manipulations guided by network modeling, we will create a flexible, cyanobacteria-based platform for the production of biofuel-precursors and valuable chemical products. To build gene-metabolite predictive models, we have characterized Synecococcus elongatus sp. UTEX 2973’s (henceforth, UTEX 2973) gene expression and metabolite production under a number of environmental conditions.
6. Support at Yale for Data Throughout the Research Lifecycle
Linda Coleman, Themba Flowers, Louis King, Jill Parchuck, Limor Peer, Alice Tangredi-Hannon
Yale’s research data policy sets expectations of Yale and its researchers and articulates shared principles and rules that guide the use, development, and protection of research data. Yet managing research data is a complex challenge for researchers. It requires investment of time and resources from the entry of data into the research cycle through the dissemination and archiving essential to reliable verification of results and reuse of data in new research. The diversity of disciplinary practices further complicates the activity. Additionally, researchers seeking institutional support must navigate the myriad of support services, uncertain as to whether they will find a resource to meet their needs. The Research Data Support Initiatives Group (RDSIG) – a multi-departmental steering group whose mission is to develop scalable, sustainable, and domain-appropriate data services and policies in support of research at Yale – has been working on developing a coherent program around support for research data management to optimize efforts by various support units and best serve Yale researchers.
7. Open Access to Data at Yale University
Harlan Krumholz, Limor Peer, Jessica Ritchie, Joseph Ross
Open access to research data increases knowledge, advances science, and benefits society. Many researchers are now required to share data. Two research centers at Yale have launched projects that support this mission. Both centers have developed technology, policies, and workflows to facilitate open access to data in their respective fields. The Yale University Open Data Access (YODA) Project at the Center for Outcomes Research and Evaluation advocates for the responsible sharing of clinical research data. The Project, which began in 2014, is committed to open science and data transparency, and supports research attempting to produce concrete benefits to patients, the medical community, and society as a whole. Early experience sharing data, made available by Johnson & Johnson (J&J) through the YODA Project, has demonstrated a demand for shared clinical research data as a resource for investigators. To date, the YODA Project has facilitated the sharing of data for over 65 research projects. The Institution for Social and Policy Studies (ISPS) Data Archive is a digital repository that shares and preserves the research produced by scholars affiliated with ISPS. Since its launch in 2011, the Archive holds data and code underlying almost 90 studies. The Archive is committed to the ideals of scientific reproducibility and transparency: It provides free and public access to research materials and accepts content for distribution under a Creative Commons license. The Archive has pioneered a workflow, “curating for reproducibility,” that ensures long term usability and data quality.
8. ASD Biomarker Detection on fMRI Images: Feature learning with Data Corruptions by Analyzing Deep Neural Network Classifier Outcomes
Xiaoxiao Li, Nicha C. Dvornek, Pamela Ventola, James Duncan
Autism spectrum disorder (ASD) is a complex neurological and developmental disorder. It emerges early in life and is generally associated with lifelong disability. Finding the biomarkers associated with ASD is extremely helpful to understand the underlying roots of the disorder and find more targeted treatment. Previous studies suggested brain activations are abnormal in ASDs, hence functional magnetic resonance imaging (fMRI) has been used to identify ASD. In this work we addressed the problem of interpreting reliable biomarkers in classifying ASD vs. control; therefore, we proposed a 2-step pipeline: 1) classifying ASD and control fMRI images by deep neural network, and 2) finding which brain regions are important for identifying ASD and control. Specifically, in step 2, we used the trained classifier to estimate the feature importance by measuring the prediction distribution change as a function of input image with the corrupted region. However, there is no certain way to corrupt the data without adding side effects. Thus, we aggregated two "opposite" corruption methods: a) blackout and b) add Gaussian noise. Biomarkers found by the 2-step pipeline were verified by Neurosynth brain function decoding. Several key innovations in our research include: i) we created an innovative pipeline for learning image data feature by analyzing the classifier outcomes with corruptions; ii) we proposed a deep learning strategy for classifying 4D data; iii) we aggregated different corruption methods for feature importance analysis, and iv) our neurological interpretation of the final results showed evidence that there were meaningful fMRI biomakers on fMRI for ASD.
9. Efficient Dynamic Centrality Metrics for Election Advertising: A Case Study
In prior work, we have shown how optimal choice of advertising channels for a budget-constrained electoral campaign can be structured. In this poster, we apply the resulting proposed algorithm to the MIT Social Evolution data-set (N=84), which captured political discussions, inclinations, and voting behaviors around the 2008 US Presidential Election within a student dorm. We compare the resulting centrality metrics developed from our algorithm (which have a direct mapping to optimal channel choice decisions) against more traditional static centralities.
10. Selecting Different Susceptibility Phenotypes Associated with Aedes aegypti Vector Competence to Dengue Virus
Leticia T. E. Oda, Andrea Gloria-Soria, Jayme A. Souza-Neto, Jeffrey R. Powell
Yale University and UNESP
Dengue is the most relevant arboviral disease transmitted by Aedes aegypti in the world. In the present study we established a laboratory colony from Aedes aegypti eggs collected in Botucatu city, Brazil (Aaeg_BTU strain) and investigated several aspects associated to its vector competence to DENV. We report a differential vector competence to DENV2 and 4 and successful selection steps in order to obtain R (resistant) and D (dissemination) isogenic mosquitoes.
11. Coffee Drinking and Leukocyte Telomere Length: A Meta-Analysis
Telomeres are long tandem nucleotide repeats responsible for maintaining chromosomal integrity. They shorten with each cell division, serving as markers for cellular aging and replicative ability, and shorter telomere length has been associated with greater risk of various chronic diseases of aging. There is increasing interest in the relationship between telomere length and lifestyle factors, such as components of the diet, that are associated with age-related chronic diseases, like cancer and diabetes.
There is mounting evidence that coffee, one of the most commonly consumed beverages in the world, has potential protective effects against chronic disease and mortality. Few studies have evaluated telomere length and coffee consumption specifically. A 2016 study in the Nurses’ Health Study  found a significant association between coffee intake and longer telomeres. A study conducted within the NHANES cohort replicated these results by also showing a statistically significant positive association between coffee consumption and telomere length . This study builds on the work exploring this association in another population. We consider the cross-sectional association between coffee consumption and leukocyte telomere length among controls from 6 case-control studies nested within a large population-based cohort of U.S. adults with detailed data on dietary intake and other lifestyle factors.
1. Liu JJ, Crous-Bou M, Giovannucci E, De Vivo I. Coffee Consumption is Positively Associated with Longer Leukocyte Telomere Length in the Nurses’ Health Study. J Nutr 2016.
2. Tucker LA. Caffeine consumption and telomere length in men and women of the National Health and Nutrition Examination Survey (NHANES). Nutrition & Metabolism. 2017; 14:10.