How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

This paper introduces MM-Escape, a benchmark using the customizable 3D environment EscapeCraft to evaluate multimodal reasoning in MLLMs through room escape tasks, revealing that while models like GPT-4o achieve high success in simple scenarios, performance drops significantly with increased difficulty, exposing distinct limitations in reasoning and spatial awareness.

Multimodal Systems, Reasoning, Human-AI Interaction, Large Language Model, Vision Foundation Model

Ziyue Wang, Yurui Dong, Fuwen Luo, Minyuan Ruan, Zhili Cheng, Chi Chen, Peng Li, Yang Liu

Tsinghua University, Institute for AI Industry Research (AIR), Tsinghua University, Fudan University

Generated by grok-3

Background Problem

The rapid advancement of Multimodal Large Language Models (MLLMs) has highlighted the need for evaluating complex multimodal reasoning in real-world and virtual environments, which requires integrating abilities like visual perception, spatial awareness, and target deduction. Existing evaluations often focus on isolated tasks (e.g., visual grounding) or final task completion in open-world settings, neglecting the intermediate reasoning process. This gap limits a comprehensive understanding of model behaviors and reasoning mechanisms, prompting the development of MM-Escape, a benchmark inspired by real-world escape games, to assess both the process and outcome of multimodal reasoning in an open, interactive environment.

Method

The core idea of MM-Escape is to evaluate complex multimodal reasoning of MLLMs through a customizable 3D room escape environment called EscapeCraft, which supports free-form exploration and assesses intermediate behaviors alongside final task completion. EscapeCraft extends existing frameworks (ProcTHOR and Legent) to automate large-scale scene generation with interactable objects, diverse room styles, and configurable difficulty levels based on prop chains (sequences of required interactions). It defines an action space including moving, view adjustment, and interaction (e.g., grabbing, using props), supported by an inventory system for prop management. The benchmark, MM-Escape, introduces tasks like room escaping (mandatory) and post-game debriefing (optional), with difficulty levels ranging from one-hop (no props needed) to multi-hop reasoning paths (requiring keys and passwords), and multi-room settings for added complexity. Metrics evaluate both task completion (escape rate) and process (prop gain, steps, grab success rate), aiming to capture autonomous coordination of multimodal abilities.

Experiment

Experiments were conducted using MM-Escape on both proprietary (e.g., GPT-4o, Gemini-1.5-Pro) and open-source MLLMs (e.g., Llama-3.2-11b-vision) across single-room and multi-room settings with varying difficulty levels (1 to 3), using datasets of 63 generated scenes (living rooms, kitchens, etc.) with logically arranged objects. The setup limited maximum steps per difficulty (50, 75, 100) to ensure quantitative comparison, and temperature was set to 0 to eliminate token decoding diversity. Results showed that while models like GPT-4o achieved high escape rates (81.36% average) in simpler tasks with human-like strategies, performance dropped sharply with increased difficulty (e.g., GPT-4o’s escape rate fell to 71.36% at Difficulty-3), far below human performance (100% escape rate). Distinct failure modes were observed, such as repetitive trajectories (GPT-4o) and poor spatial awareness (Gemini), with metrics like grab success rate correlating with escape success but revealing inefficiencies (e.g., high grab ratios with low precision). The multi-room setting slightly improved performance with provided paths for reflection, but overall, results did not meet expectations for robust multi-hop reasoning, indicating significant gaps in current MLLM capabilities. The experimental design was comprehensive in covering difficulty variations and metrics, though the reliance on specific models and the synthetic nature of EscapeCraft may limit real-world applicability.

Further Thoughts

The MM-Escape benchmark raises important questions about the scalability of multimodal reasoning evaluations beyond synthetic environments like EscapeCraft to real-world scenarios, such as robotic navigation or augmented reality assistance, where unstructured data and unpredictable interactions dominate. The observed failure modes, such as repetitive trajectories, might relate to broader issues in reinforcement learning or planning algorithms, suggesting a need to integrate adaptive exploration strategies or memory-augmented architectures in MLLMs. Additionally, the sharp performance drop with task complexity echoes findings in other domains like natural language processing, where multi-hop reasoning remains a challenge (e.g., in question-answering tasks over knowledge graphs). Future work could explore hybrid approaches combining MLLMs with specialized spatial reasoning modules or investigate whether training on diverse, procedurally generated environments could mitigate overfitting to specific benchmark designs. This paper also prompts reflection on ethical implications—how do we ensure that multimodal systems, if deployed in real-world assistive roles, do not fail in critical scenarios due to similar reasoning limitations?