Date of Award
Spring 1-1-2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computational Biology and Bioinformatics
First Advisor
Gerstein, Mark
Abstract
Biomedical research increasingly relies on vast datasets, such as genomic sequences, electronic health records, and histology images, to advance scientific knowledge and clinical applications. However, the sensitive nature of biomedical data poses significant privacy risks, particularly with emerging capabilities of artificial intelligence and machine learning technologies. This dissertation addresses two central challenges in biomedical data privacy and security: quantifying privacy risks associated with histology images and developing innovative blockchain-based methods to ensure secure data storage and sharing. The first part of this work investigates the vulnerability of histology images—commonly perceived as safe for public sharing—to privacy breaches. Utilizing advanced computational methodologies, specifically convolutional neural networks (CNNs) and conditional variational autoencoders (CVAEs), we demonstrate the potential to predict gene expression levels and infer individual genotypes from digital pathology images. Through a rigorous pipeline leveraging expression quantitative trait loci (eQTLs), our analysis reveals that histology images indeed carry sufficient biological information to facilitate moderate success rates (approximately 42%) in re-identification linkage attacks. These findings challenge prevailing assumptions of privacy safety in publicly available biomedical images, underscoring the necessity for reevaluating data sharing policies and privacy protections in biomedical informatics. In response to these vulnerabilities, the second part of the dissertation explores blockchain technology as a potential solution for enhancing biomedical data privacy and security. Specifically, we present an Ethereum-based smart contract framework optimized for storing, querying, and retrieving biomedical training certificates entirely on-chain. By employing assembly-level code optimizations within the Solidity programming environment, our implementation significantly reduces gas costs, storage overhead, and query execution time compared to conventional blockchain approaches. The framework achieves efficient handling of large data files (up to gigabytes in size), demonstrating feasibility and scalability of fully on-chain biomedical data management. Building upon this foundation, we propose enhancements to SAMchain—a blockchain framework specifically tailored for genomic data management—which initially encountered scalability challenges when implemented on Multichain. Transitioning SAMchain to Ethereum and applying our previously demonstrated assembly-level optimizations markedly improves data insertion rates, query speeds, and overall storage efficiency. Additionally, we suggest a hybrid on-chain/off-chain data storage strategy leveraging previously developed privacy-preserving computational frameworks to further enhance scalability without compromising security or data integrity. Together, these contributions offer a comprehensive examination of privacy vulnerabilities in biomedical data sharing, particularly histology images, and present practical, robust blockchain-based solutions to mitigate identified risks. Our findings highlight the critical importance of integrating advanced computational methods and innovative blockchain technology to protect sensitive biomedical information. This work not only advances scientific understanding of biomedical privacy challenges but also provides actionable tools and strategies for secure, privacy-conscious biomedical data management in research and clinical settings.
Recommended Citation
Ni, Eric, "Biomedical Privacy and Security: Histology Image Vulnerabilities and Blockchain Solutions" (2025). Yale Graduate School of Arts and Sciences Dissertations. 1610.
https://elischolar.library.yale.edu/gsas_dissertations/1610