Date of Award

5-19-2025

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Arman Cohan

Abstract

Contemporary foundation models are predominantly evaluated on isolated documentor image-understanding tasks, thereby overlooking the inherent multimodal multi-document reasoning that characterizes scientific inquiry. To bridge this gap, M3SCIQA is introduced, aMulti-Modal,Multi-document Scientific Question Answering benchmark crafted to test foundation models in practical scientific research settings. A comprehensive evaluation of 18 leading foundation models shows a substantial performance gap between models and human experts. Detailed error analysis reveals persistent deficiencies in both scientific visual reasoning tasks and long-range retrieval. Addressing the former, SPACECUE offers a concise yet effective visual prompting that overlays grid coordinates and Semantic-SAM masks on scientific images to direct model attention to query-relevant regions. This strategy yields accuracy improvements for GPT-4V and Claude 3 Opus on the Physics and Biology subset of the MMMU benchmark.

Share

COinS