SAMR: A Spatial-Augmented Mixed Reality Method for Enhancing Vision-Language Models in 3D Scene Understanding
Abstract
Understanding 3D scenes in mixed reality (MR) is crucial for advancing human-computer interaction, especially in MR applications that demand spatial awareness and contextual reasoning. While Vision-Language Models (VLMs) perform well in 2D image interpretation, they struggle to incorporate spatial context from 3D settings, which limits their effectiveness in MR scenarios. To address this issue, we introduce SAMR, a Spatial-Augmented Mixed Reality method designed to enhance VLMs for 3D scene understanding. Our system consists of three key modules. The first module, a spatial-segmented fusion module, uses FastSAM-based segmentation to create objectlevel meshes from head-mounted display (HMD) images. It maps extracted feature points to 3D coordinates through ray casting on the HMD-captured mesh and applies triangular facet fitting. The second module, a multimodal interaction module, combines gestures, gaze, and voice commands to enable intuitive interaction with 3D meshes for annotating prompts. The third module, a VLM integration module, processes data by merging annotated 2D images with user queries to form standardized prompts for the VLM. The VLM then generates responses linked to user-specified object meshes. By enhancing VLMs with spatial context and multimodal capabilities, SAMR greatly improves 3D scene interpretation. We demonstrate SAMR's effectiveness across six key application scenarios: object identification, relationship analysis, distance estimation, targeted object questioning, and cognitive assistance. This approach provides a robust framework for MR applications with AI agents.