Separating the Wheat from the Chaff: Spatio-Temporal Transformer with View-interweaved Attention for Photon-Efficient Depth Sensing
Abstract
Time-resolved imaging is an emerging sensing modality that has been shown to enable advanced applications, including remote sensing, fluorescence lifetime imaging, and even non-line-of-sight sensing. Single-photon avalanche diodes (SPADs) outperform relevant time-resolved imaging technologies thanks to their excellent photon sensitivity and superior temporal resolution on the order of tens of picoseconds. The capability of exceeding the sensing limits of conventional cameras for SPADs also draws attention to the photon-efficient imaging area. However, photon-efficient imaging under degraded conditions with low photon counts and low signal-to-background ratio (SBR) still remains an inevitable challenge. In this paper, we propose a spatio-temporal transformer network for photon-efficient imaging under low-flux scenarios. In particular, we introduce a view-interweaved attention mechanism (VIAM) to extract both spatial-view and temporal-view self-attention in each transformer block. We also design an adaptive-weighting scheme to dynamically adjust the weights between different views of self-attention in VIAM for different signal-to-background levels. We extensively validate and demonstrate the effectiveness of our approach on the simulated Middlebury dataset and a specially self-collected dataset with real-world-captured SPAD measurements and well-annotated ground truth depth maps.