by Vinko Sabolčec Papers
4 papers found
Conference
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Dongyang Fan, Vinko Sabolčec, Matin Ansaripour et al.
COLM 2025paper
4
citations
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabolčec, Martin Jaggi
NEURIPS 2025arXiv:2502.10361
12
citations
FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec et al.
COLM 2025paperarXiv:2506.20920
51
citations
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
Dongyang Fan, Vinko Sabolčec, Martin Jaggi
NEURIPS 2025arXiv:2505.16570