Haotian Zhang

papers

2,848

total citations

papers (19)

KD-MVS: Knowledge Distillation Based Self-Supervised Learning for Multi-View Stereo

ECCV 2022arXiv

citations

Offline and Online Optical Flow Enhancement for Deep Video Compression

AAAI 2024arXiv

citations

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

ICLR 2025

citations

GENMO: A GENeralist Model for Human MOtion

ICCV 2025arXiv

citations

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

ICLR 2025arXiv

citations

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

NEURIPS 2025arXiv

citations

Learned Image Compression with Hierarchical Progressive Context Modeling

ICCV 2025arXiv

citations

Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives

ECCV 2022arXiv

citations

Few-Shot Domain Adaptation for Learned Image Compression

AAAI 2025arXiv

citations

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation

NEURIPS 2025arXiv

citations

"Spotting Temporally Precise, Fine-Grained Events in Video"

ECCV 2022

citations

GenAL: Generative Agent for Adaptive Learning

AAAI 2025

citations

Haotian Zhang

papers (19)

Grounded Language-Image Pre-Training

Ferret: Refer and Ground Anything Anywhere at Any Granularity

GLIPv2: Unifying Localization and Vision-Language Understanding

TransMVSNet: Global Context-Aware Multi-View Stereo Network With Transformers

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

ELSD: Efficient Line Segment Detector and Descriptor

KD-MVS: Knowledge Distillation Based Self-Supervised Learning for Multi-View Stereo

Offline and Online Optical Flow Enhancement for Deep Video Compression

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

GENMO: A GENeralist Model for Human MOtion

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Learned Image Compression with Hierarchical Progressive Context Modeling

Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives

Few-Shot Domain Adaptation for Learned Image Compression

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation

"Spotting Temporally Precise, Fine-Grained Events in Video"

GenAL: Generative Agent for Adaptive Learning

papers (19)

Grounded Language-Image Pre-Training

Ferret: Refer and Ground Anything Anywhere at Any Granularity

GLIPv2: Unifying Localization and Vision-Language Understanding

TransMVSNet: Global Context-Aware Multi-View Stereo Network With Transformers

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

ELSD: Efficient Line Segment Detector and Descriptor

KD-MVS: Knowledge Distillation Based Self-Supervised Learning for Multi-View Stereo

Offline and Online Optical Flow Enhancement for Deep Video Compression

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

GENMO: A GENeralist Model for Human MOtion

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Learned Image Compression with Hierarchical Progressive Context Modeling

Sobolev Training for Implicit Neural Representations with Approximated Image Derivatives

Few-Shot Domain Adaptation for Learned Image Compression

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation

"Spotting Temporally Precise, Fine-Grained Events in Video"

GenAL: Generative Agent for Adaptive Learning