Date of Award

January 2023

Document Type

Thesis

Degree Name

Medical Doctor (MD)

Department

Medicine

First Advisor

Brian P. Hafler

Abstract

Gene expression matrices commonly used in single-cell transcriptomics cannot be directly analyzed with tools developed for natural languages. Restructuring these matrices as abundance-ordered sequences of genes allows the generation of cell sentences: rank-normalized, positionally encoded sequence-structured expression data. The rank-normalization procedure also minimizes batch effects from differential sequencing depth, and comparison of cell sentences against other tools for batch integration shows that cell sentences achieve comparable performance in batch effect removal and biological effect preservation. After transformation, cell sentences can be analyzed using any existing tools from natural language processing that take text as input, enabling a host of new ways to process and understand single-cell transcriptomics data. As an example, a machine translation approach is applied to cells from neural retina, unifying cell and gene representations across species. Finally, this approach can also be used to transform a number of other data modalities to sequential formats, including imaging data. Testing of neural network architectures pretrained on language tasks against those specialized for vision tasks shows that in low-data scenarios, language models can outperform vision models in image classification tasks. These findings suggest homology in the underlying structure of natural language and natural images which may be of particular interest in machine learning for medical imaging, where small datasets are the norm.

Comments

This thesis is restricted to Yale network users only. This thesis is permanently embargoed from public release.

Share

COinS