Date of Award

January 2023

Document Type

Thesis

Degree Name

Medical Doctor (MD)

Department

Medicine

First Advisor

Brian P. Hafler

Abstract

Gene expression matrices commonly used in single-cell transcriptomics cannot be directly analyzed with tools developed for natural languages. Restructuring these matrices as abundance-ordered sequences of genes allows the generation of cell sentences: rank-normalized, positionally encoded sequence-structured expression data. The rank-normalization procedure also minimizes batch effects from differential sequencing depth, and comparison of cell sentences against other tools for batch integration shows that cell sentences achieve comparable performance in batch effect removal and biological effect preservation. After transformation, cell sentences can be analyzed using any existing tools from natural language processing that take text as input, enabling a host of new ways to process and understand single-cell transcriptomics data. As an example, a machine translation approach is applied to cells from neural retina, unifying cell and gene representations across species. Finally, this approach can also be used to transform a number of other data modalities to sequential formats, including imaging data. Testing of neural network architectures pretrained on language tasks against those specialized for vision tasks shows that in low-data scenarios, language models can outperform vision models in image classification tasks. These findings suggest homology in the underlying structure of natural language and natural images which may be of particular interest in machine learning for medical imaging, where small datasets are the norm.

Comments

This thesis is restricted to Yale network users only. This thesis is permanently embargoed from public release.

Recommended Citation

Dhodapkar, Rahul M., "Representing Cells As Sentences Enables Natural Language Processing For Single Cell Transcriptomics" (2023). Yale Medicine Thesis Digital Library. 4175.
https://elischolar.library.yale.edu/ymtdl/4175

Download

COinS

Yale Medicine Thesis Digital Library

Representing Cells As Sentences Enables Natural Language Processing For Single Cell Transcriptomics

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Comments

Recommended Citation

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links

Yale Medicine Thesis Digital Library

Representing Cells As Sentences Enables Natural Language Processing For Single Cell Transcriptomics

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

Comments

Recommended Citation

Share

Search

Browse

Contribute

Researcher Profiles

Copyright, Publishing and Open Access

Links