Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Alaa Khaddaj, Logan Engstrom, Aleksander Madry

Abstract

Choice of training data distribution greatly influences model behavior. Yet, inlarge-scale settings, precisely characterizing how changes in trainingdata affects predictions is often difficult due to model training costs. Currentpractice is to instead extrapolate from scaled down, inexpensive-to-train proxymodels. However, changes in data do not influence smaller and larger modelsidentically. Therefore, understanding how choice of data affects large-scalemodels raises the question: how does training data distribution influence modelbehavior across compute scale? We find that small- and large-scale languagemodel predictions (generally) do highly correlate across choice oftraining data. Equipped with these findings, we characterize how proxy scaleaffects effectiveness in two downstream proxy model applications: dataattribution and dataset selection.