"multi-domain evaluation" Papers
4 papers found
Conference
ClinBench: A Standardized Multi-Domain Framework for Evaluating Large Language Models in Clinical Information Extraction
Ismael Villanueva Miranda, Zifan Gu, Donghan Yang et al.
NEURIPS 2025
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Zaid Khan, Elias Stengel-Eskin, Jaemin Cho et al.
ICLR 2025arXiv:2410.06215
9
citations
MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences
Genta Winata, David Anugraha, Lucky Susanto et al.
ICLR 2025arXiv:2410.02381
17
citations
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning
NEURIPS 2025arXiv:2505.16986
3
citations