Date of Award

5-19-2025

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Tesca Fitzgerald

Abstract

Multimodal foundation models (MFMs) have demonstrated impressive capabilities in static vision-language tasks such as image captioning, video summarization, and cross modal retrieval. However, their ability to reason over time—especially in gesture-rich video inputs—remains limited. This thesis investigates the temporal reasoning capabilities of MFMs in the context of gesture understanding, a critical component for enabling more expressive human-robot interaction. Through a preliminary study, we show that prompting-based strategies offer only marginal improvements in temporal reasoning, despite producing accurate frame-by-frame descriptions.

To more rigorously evaluate these limitations, we introduce TOMATO, a benchmark designed to assess visual temporal reasoning through three diagnostic principles: Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity. Our analysis reveals that existing benchmarks often overestimate model performance by allowing shortcuts that do not require true temporal integration. Building on these insights, we examine the role of both the vision encoder and language backbone in gesture interpretation, finding that the latter presents a major bottleneck in temporal abstraction. This motivates future work on improving model architectures and reasoning strategies, with the goal of enabling MFMs to move beyond gesture classification toward more general and intent-aware gesture understanding for real-world human-robot interaction.

Share

COinS