"Computational Methods for Single-Cell Data: From Decomposition, Integr" by Yuge Wang

Date of Award

Fall 2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Public Health

First Advisor

Zhao, Hongyu

Abstract

Over the past decade, significant advancements have been made in high-throughput single-cell sequencing technologies, revolutionizing our ability to investigate cellular heterogeneity across multiple molecular levels, including gene expression measured by scRNA-seq and chromatin accessibility measured by scATAC-seq. Unlike traditional bulk sequencing approaches, single-cell sequencing techniques enable the characterization of diverse cell types or states within a sample, shedding light on complex biological processes. However, the analysis of single-cell data presents several challenges, including high dimensionality, large-scale datasets, high sparsity, and batch effects arising from both intra- and inter-modalities. Failure to address these challenges can significantly impact downstream applications, such as cell annotations and the identification of gene expression programs. Consequently, the development of computational tools capable of tackling these challenges and facilitating the extraction of meaningful insights from single-cell data has become increasingly vital. This dissertation contributes to advancing the analysis of single-cell omics data, addressing challenges or proposing future directions in discovering gene expression programs, integrating scRNA-seq datasets, and annotating cell types in scATAC-seq data. The first two chapters center around scRNA-seq data, introducing two novel computational tools for data decomposition and data integration, respectively. In the third chapter, a comprehensive benchmarking study is presented, focusing on the field of label transfer from scRNA-seq data to scATAC-seq data. The first chapter introduces scAAnet, an autoencoder-based method for single-cell non-linear archetypal analysis, allowing the identification of gene expression programs (GEPs) and their relative activity across cell types. Through simulations and analysis of publicly available datasets, scAAnet demonstrates improved performance in extracting biologically meaningful GEPs compared to existing methods. The second chapter addresses the challenge of integrating multiple scRNA-seq datasets due to batch effects. ResPAN, a light-structured deep learning framework, is proposed to reduce differences among batches and enable effective integration. Extensive benchmarking studies on simulated and real datasets validate the superior performance of ResPAN compared to seven other methods, showcasing its ability for batch correction and preservation of biological information. The third chapter focuses on cell type annotation in scATAC-seq data by label transfer from scRNA-seq datasets. A comprehensive benchmarking study evaluates 27 computational tools, considering diverse human and mouse tissues. Performance evaluation reveals the top performers under different scenarios and highlights the impact of data quality, algorithmic efficiency, and methodological considerations. Suggestions for future methodology development are also provided. Collectively, this dissertation expands the methodological landscape of single-cell omics data analysis and provides valuable contributions to the field's ongoing progress.

Share

COinS