Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

23citations
arXiv:2502.00561
23
citations
#236
in ICML 2025
of 3340 papers
20
Top Authors
4
Data Points

Abstract

The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.

Citation History

Jan 28, 2026
0
Feb 13, 2026
23+23
Feb 13, 2026
23
Feb 13, 2026
23