Date of Award
Medical Doctor (MD)
Brian P. Hafler
Gene expression matrices commonly used in single-cell transcriptomics cannot be directly analyzed with tools developed for natural languages. Restructuring these matrices as abundance-ordered sequences of genes allows the generation of cell sentences: rank-normalized, positionally encoded sequence-structured expression data. The rank-normalization procedure also minimizes batch effects from differential sequencing depth, and comparison of cell sentences against other tools for batch integration shows that cell sentences achieve comparable performance in batch effect removal and biological effect preservation. After transformation, cell sentences can be analyzed using any existing tools from natural language processing that take text as input, enabling a host of new ways to process and understand single-cell transcriptomics data. As an example, a machine translation approach is applied to cells from neural retina, unifying cell and gene representations across species. Finally, this approach can also be used to transform a number of other data modalities to sequential formats, including imaging data. Testing of neural network architectures pretrained on language tasks against those specialized for vision tasks shows that in low-data scenarios, language models can outperform vision models in image classification tasks. These findings suggest homology in the underlying structure of natural language and natural images which may be of particular interest in machine learning for medical imaging, where small datasets are the norm.
Dhodapkar, Rahul M., "Representing Cells As Sentences Enables Natural Language Processing For Single Cell Transcriptomics" (2023). Yale Medicine Thesis Digital Library. 4175.