Date of Award

Spring 1-1-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Statistics and Data Science

First Advisor

Sekhon, Jasjeet

Abstract

Synthetic data generation has become essential for addressing data scarcity, privacy concerns, and generalization problems when applying machine learning in various domains. Hence, data is the main ingredient of any machine learning model. If synthetic data is too similar to real data, the risk of privacy breaches is significantly higher; if it is too different, its practical applicability is undermined. This dissertation proposes two approaches to enhance synthetic data generation. Chapter 2 develops SC-GOAT, a framework that integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions. Thus, synthetic data are made more useful for specific real-world applications. Chapter 3 introduces a framework designed to enhance existing clinical models, Private Synthetic Hypercube Augmentation (PriSHA). We use generative models to produce synthetic data as a means to augment these models while adhering to strict privacy standards. This approach has the potential to improve model performance without compromising patient confidentiality. To our knowledge, our framework is the first synthetic data augmentation framework that merges privacy-preserving tabular data and real data from multiple sources. Causal inference is central to distinguishing causation from correlation and thus facilitating informed decision-making in many fields, from economics to epidemiology and artificial intelligence. This dissertation makes two contributions to the literature on causal inference. Chapter 4 introduces CLOUD-CG, a clustering method for longitudinal data that uses temporal-directed acyclic graphs (T-DAG) to identify clusters with similar causal structures. While preserving individual-level heterogeneity, CLOUD-CG provides interpretable insights into time-dependent causal representation to evaluate financial stability in emerging economies. Chapter 5 introduces the causal machine learning model in sports analytics through its application to age-curve modeling. The Age-Conditioned Treatment Effect (ACTE) is presented to investigate the causal impact of interventions such as rest days on the performance of athletes at different stages of their careers. Using ACTE in a meta-learning framework, this work provides a load management strategy based on granular game-level data in professional sports. This dissertation advances the data science pipeline, developing synthetic data generation methods to improve data quality and availability, and causal inference frameworks to learn causal relationships. Approaching the shortcomings in both domains reinforces the reliability of the decision-making process in diverse fields.

Recommended Citation

Nakamura Sakai, Shinpei, "Advances in Synthetic Data Generation and Causal Inference" (2025). Yale Graduate School of Arts and Sciences Dissertations. 1740.
https://elischolar.library.yale.edu/gsas_dissertations/1740

Download

COinS

Yale Graduate School of Arts and Sciences Dissertations

Advances in Synthetic Data Generation and Causal Inference

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links

Yale Graduate School of Arts and Sciences Dissertations

Advances in Synthetic Data Generation and Causal Inference

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Recommended Citation

Share

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links