Scalable Extraction of Training Data from Aligned, Production Language Models

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher Choquette-Choo, Florian Tramer, Katherine Lee

Abstract

Large language models are prone to memorizing some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that model alignment---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.