Date of Award

Spring 1-1-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Statistics and Data Science

First Advisor

Sekhon, Jasjeet

Abstract

Synthetic data generation has become essential for addressing data scarcity, privacy concerns, and generalization problems when applying machine learning in various domains. Hence, data is the main ingredient of any machine learning model. If synthetic data is too similar to real data, the risk of privacy breaches is significantly higher; if it is too different, its practical applicability is undermined. This dissertation proposes two approaches to enhance synthetic data generation. Chapter 2 develops SC-GOAT, a framework that integrates a supervised component tailored to the specific downstream task and employs a meta-learning approach to learn the optimal mixture distribution of existing synthetic distributions. Thus, synthetic data are made more useful for specific real-world applications. Chapter 3 introduces a framework designed to enhance existing clinical models, Private Synthetic Hypercube Augmentation (PriSHA). We use generative models to produce synthetic data as a means to augment these models while adhering to strict privacy standards. This approach has the potential to improve model performance without compromising patient confidentiality. To our knowledge, our framework is the first synthetic data augmentation framework that merges privacy-preserving tabular data and real data from multiple sources. Causal inference is central to distinguishing causation from correlation and thus facilitating informed decision-making in many fields, from economics to epidemiology and artificial intelligence. This dissertation makes two contributions to the literature on causal inference. Chapter 4 introduces CLOUD-CG, a clustering method for longitudinal data that uses temporal-directed acyclic graphs (T-DAG) to identify clusters with similar causal structures. While preserving individual-level heterogeneity, CLOUD-CG provides interpretable insights into time-dependent causal representation to evaluate financial stability in emerging economies. Chapter 5 introduces the causal machine learning model in sports analytics through its application to age-curve modeling. The Age-Conditioned Treatment Effect (ACTE) is presented to investigate the causal impact of interventions such as rest days on the performance of athletes at different stages of their careers. Using ACTE in a meta-learning framework, this work provides a load management strategy based on granular game-level data in professional sports. This dissertation advances the data science pipeline, developing synthetic data generation methods to improve data quality and availability, and causal inference frameworks to learn causal relationships. Approaching the shortcomings in both domains reinforces the reliability of the decision-making process in diverse fields.

Share

COinS