Parameterized Synthetic Text Generation with SimpleStories
1
citations
#2497
in NEURIPS 2025
of 5858 papers
8
Top Authors
7
Data Points
Top Authors
Abstract
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.
Citation History
Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 28, 2026
0
Feb 13, 2026
1+1
Feb 13, 2026
1
Feb 13, 2026
1