"dataset curation" Papers

12 papers found

DataRater: Meta-Learned Dataset Curation

Dan Andrei Calian, Greg Farquhar, Iurii Kemaev et al.

NEURIPS 2025arXiv:2505.17895
7
citations

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao, Yijun Pan, Emily Xiao et al.

NEURIPS 2025arXiv:2507.09424

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

NEURIPS 2025arXiv:2503.08805
2
citations

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Aladin Djuhera, Swanand Kadhe, Syed Zawad et al.

NEURIPS 2025spotlightarXiv:2506.06522

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Kun Liu, Qi Liu, Xinchen Liu et al.

CVPR 2025arXiv:2503.23715
14
citations

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou, Yi Zhan, Zhongbin Zhang et al.

NEURIPS 2025oralarXiv:2505.16602
3
citations

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Navve Wasserman, Noam Rotstein, Roy Ganz et al.

CVPR 2025arXiv:2404.18212
29
citations

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal, Brian Lester, Colin Raffel et al.

NEURIPS 2025arXiv:2506.05209
11
citations

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Wenhao Wang, Yi Yang

NEURIPS 2025arXiv:2503.01739
11
citations

Dataset Growth

Ziheng Qin, zhaopan xu, YuKun Zhou et al.

ECCV 2024arXiv:2405.18347
4
citations

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.

CVPR 2024arXiv:2402.19479
351
citations

Position: Measure Dataset Diversity, Don't Just Claim It

Dora Zhao, Jerone Andrews, Orestis Papakyriakopoulos et al.

ICML 2024arXiv:2407.08188
32
citations