GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

24
citations
#260
in NEURIPS 2025
of 5858 papers
6
Top Authors
4
Data Points

Abstract

While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between ECG time series and ECG images, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction data generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\%$ $\uparrow$), explainability ($22.7\%$ $\uparrow$), and grounding ($25.3\%$ $\uparrow$), making it a promising approach for real-world clinical applications. Codes, model, and data are available at https://github.com/lanxiang1017/GEM.

Citation History

Jan 25, 2026
17
Feb 13, 2026
24+7
Feb 13, 2026
24
Feb 13, 2026
24