"The Elucidation of Chromatin Structural Properties Through Feature Det" by Luka Maisuradze

The Elucidation of Chromatin Structural Properties Through Feature Detection in Hi-C Maps

Date of Award

Spring 2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Molecular Biophysics and Biochemistry

First Advisor

O'Hern, Corey

Abstract

Chromatin is a complex consisting of DNA and associated proteins that must undergo significant compaction (thousands fold) to fit into the cell nucleus. As a result of this required compression, chromatin forms a hierarchy of dynamic structures across different length scales, including nucleosomes, loops, topologically associating domains, A/B compartments, and finally chromosomes. While the primary structure of chromatin, the 10-nm ”beads on a string” fiber consisting of nucleosomes, is relatively well understood, higher-order chromatin structures, and the mechanism by which they form, are still poorly understood. High-throughput chromatin conformation capture techniques, in particular Hi-C, have been recently developed that are able to capture genome-wide spatial interactions for chromatin. Hi-C experiments generate contact maps, represented as a symmetric matrix A, where Aij represents the average contact probability or number of contacts between chromatin loci i and j. Hi-C maps can provide insight into the 3D structure of chromatin across several resolutions and in this dissertation we seek to make progress in two outstanding computational problems that arise from Hi-C map analysis, namely the problems of identifying topologically associating domains in Hi-C maps and classifying a set of single-cell Hi-C maps by their cell type. In chapter 1, we present a novel algorithm, denoted KerTAD, to identify topologically associating domains (TADs) in Hi-C maps. TADs are self-interacting genomic regions that preferentially interact within themselves, forming square contiguous regions on the main diagonal of A. Despite many algorithms being developed to identify TADs in Hi-C maps, most algorithms are unable to identify overlapping or nested TADs and there is still significant variability in the location and number of TADs identified by different methods for any Hi-C map. We develop a novel algorithm, KerTAD, using a set of kernel-based techniques that is able to accurately identify nested and overlapping TADs. We benchmark KerTAD against several state-of-the-art TAD identification algorithms on synthetic and experimental data sets. We find that KerTAD outperforms all other methods, consistently scoring higher true positive rates (TPR) and lower false discovery rates (FDR) than all tested methods for both synthetic and manually annotated experimental Hi-C maps. We find that KerTAD is more robust to increasing noise and sparsity compared to the other methods. We lastly find that KerTAD is more consistent in the number and sizes of TADs identified across replicate experimental Hi-C maps from several different organisms compared to other methods. In chapter 2, we present a novel algorithm to classify single-cell Hi-C maps based on their cell type, inspired by the graph-theoretic version of quantum relative entropy. Unlike bulk Hi-C experiments, where one experiment processes and averages the spatial interactions of millions of cells, single-cell Hi-C experiments are able to generate contact maps at the individual cell level, capturing structural features that would otherwise be lost across a heterogeneous cell population. Single-cell Hi-C techniques are often used used to quantify cell-to-cell variation in a cell population. Unfortunately, single-cell Hi-C maps are highly sparse, have inconsistent coverage, and have poor resolution, making their analysis challenging. Several algorithms have been developed that attempt to partition a set of single-cell Hi-C maps based on the underlying cell type of each individual Hi-C map using feature extraction techniques. We develop a novel algorithm to classify single-cell Hi-C maps based on cell type and benchmark the algorithm across 4 commonly used single-cell Hi-C datasets. We find that on all metrics tested (adjusted Rand index, normalized mutual information, and Fowlkes-Mallows index), our algorithm outperforms the state-of-the-art across all datasets. We additionally find that our algorithm is able to perform accurate classification using only a fraction of the data, reliably classifying single-cell datasets with only one intrachromosomal matrix for each cell. We additionally find that our algorithm is robust to the choice of dimensionality reduction scheme and even performs well without any dimensionality reduction at all.

This document is currently not available here.

Share

COinS