Date of Award
5-19-2025
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Arman Cohan
Abstract
Contemporary foundation models are predominantly evaluated on isolated documentor image-understanding tasks, thereby overlooking the inherent multimodal multi-document reasoning that characterizes scientific inquiry. To bridge this gap, M3SCIQA is introduced, aMulti-Modal,Multi-document Scientific Question Answering benchmark crafted to test foundation models in practical scientific research settings. A comprehensive evaluation of 18 leading foundation models shows a substantial performance gap between models and human experts. Detailed error analysis reveals persistent deficiencies in both scientific visual reasoning tasks and long-range retrieval. Addressing the former, SPACECUE offers a concise yet effective visual prompting that overlays grid coordinates and Semantic-SAM masks on scientific images to direct model attention to query-relevant regions. This strategy yields accuracy improvements for GPT-4V and Claude 3 Opus on the Physics and Biology subset of the MMMU benchmark.
Recommended Citation
Li, Chuhan, "Towards Multi-Modal Multi-Document Understanding Capabilities in Foundation Models" (2025). Computer Science Theses. 1.
https://elischolar.library.yale.edu/computer_science_theses/1