Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

Alaa Khaddaj, Logan Engstrom, Aleksander Madry

International Conference on Learning Representations 2025 (ICLR 2025) Conference

Choice of training data distribution greatly influences model behavior. Yet, inlarge-scale settings, precisely characterizing how changes in trainingdata affects predictions is often difficult due to model training costs. Currentpractice is to instead extrapolate from scaled down, inexpensive-to-train proxymodels. However, changes in data do not influence smaller and larger modelsidentically. Therefore, understanding how choice of data affects large-scalemodels raises the question: how does training data distribution influence modelbehavior across compute scale? We find that small- and large-scale languagemodel predictions (generally) do highly correlate across choice oftraining data. Equipped with these findings, we characterize how proxy scaleaffects effectiveness in two downstream proxy model applications: dataattribution and dataset selection.