Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Fort, Stanislav

Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Stanislav Fort

Abstract

We explore a class of adversarial attacks targeting the activations of language models to derive upper-bound scaling laws on their attack susceptibility. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens predicted, $t_\mathrm{max}$, depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits the attacker controls on the input to exert a single bit of control on the output (a property we call \textit{attack resistance $\chi$}) is remarkably stable between $\approx 16$ and $\approx 25$ over orders of magnitude of model sizes and between model families. Compared to attacks directly on input tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert a surprisingly similar amount of control over the model predictions. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models. By using language models as a controllable test-bed to study adversarial attacks, we explored input-output dimension regimes that are inaccessible in computer vision and greatly extended the empirical support for the dimensionality theory of adversarial attacks.

Scaling Laws for Adversarial Attacks on Language Model Activations and Tokens

Authors

Abstract

Name Change Policy