Yu Su

Affiliations

MicrosoftThe Ohio State University

papers

5,301

total citations

papers (23)

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

CVPR 2024arXiv

1,715

citations

Mind2Web: Towards a Generalist Agent for the Web

NEURIPS 2023arXiv

829

citations

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

ICCV 2023arXiv

631

citations

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

ICLR 2024arXiv

261

citations

BioCLIP: A Vision Foundation Model for the Tree of Life

CVPR 2024arXiv

176

citations

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

CVPR 2025arXiv

citations

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

ICML 2024arXiv

citations

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

ICLR 2025arXiv

citations

An Illusion of Progress? Assessing the Current State of Web Agents

COLM 2025arXiv

citations

One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones

CVPR 2022arXiv

citations

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

ICLR 2024arXiv

citations

Dual-View Visual Contextualization for Web Navigation

CVPR 2024arXiv

citations

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

NEURIPS 2025arXiv

citations

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

AAAI 2024

citations

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

CVPR 2025arXiv

citations

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

NEURIPS 2023arXiv

citations

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

CVPR 2025arXiv

citations

Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship

AAAI 2025

citations

VERSE: Verification-based Self-Play for Code Instructions

AAAI 2025

citations

ScholarGEC: Enhancing Controllability of Large Language Model for Chinese Academic Grammatical Error Correction

AAAI 2025

citations

Yu Su

Affiliations

papers (23)

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Mind2Web: Towards a Generalist Agent for the Web

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

GPT-4V(ision) is a Generalist Web Agent, if Grounded

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

BioCLIP: A Vision Foundation Model for the Tree of Life

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

An Illusion of Progress? Assessing the Current State of Web Agents

One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

Dual-View Visual Contextualization for Web Navigation

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship

VERSE: Verification-based Self-Play for Code Instructions

ScholarGEC: Enhancing Controllability of Large Language Model for Chinese Academic Grammatical Error Correction

papers (23)

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Mind2Web: Towards a Generalist Agent for the Web

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

GPT-4V(ision) is a Generalist Web Agent, if Grounded

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

BioCLIP: A Vision Foundation Model for the Tree of Life

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

An Illusion of Progress? Assessing the Current State of Web Agents

One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

Dual-View Visual Contextualization for Web Navigation

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

CONSIDER: Commonalities and Specialties Driven Multilingual Code Retrieval Framework

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Distribution-Driven Dense Retrieval: Modeling Many-to-One Query-Document Relationship

VERSE: Verification-based Self-Play for Code Instructions

ScholarGEC: Enhancing Controllability of Large Language Model for Chinese Academic Grammatical Error Correction