UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

11citations

arXiv:2506.23219 Project

citations

#327

in ICCV 2025

of 2701 papers

Top Authors

Data Points

Top Authors

Jie Feng Shengyuan Wang Tianhui Liu Yanxin Xi Yong Li

Topics

multi-modal large language models urban intelligence spatial reasoning enhancement cross-modal tasks instruction dataset curation multi-stage training framework urban scene understanding

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

Citation History

Jan 24, 2026

Feb 3, 2026

Feb 13, 2026

11+1

Feb 13, 2026