Monitoring Latent World States in Language Models with Propositional Probes

Feng, Jiahai; Russell, Stuart; Steinhardt, Jacob

Monitoring Latent World States in Language Models with Propositional Probes

Jiahai Feng, Stuart Russell, Jacob Steinhardt

International Conference on Learning Representations 2025 (ICLR 2025) Conference

Abstract

Language models (LMs) are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of LMs could help monitor and correct unfaithful behavior. We hypothesize that LMs faithfully represent their input contexts in a latent world model, and we seek to extract these latent world states as logical propositions. For example, given the input context ``Greg is a nurse. Laura is a physicist.'', we aim to decode the propositions WorksAs(Greg, nurse) and WorksAs(Laura, physicist) from the model's internal activations. To do so we introduce _propositional probes_, which compositionally extract lexical concepts from token activations and bind them into propositions. Key to this is identifying a _binding subspace_ in which bound tokens have high similarity (Greg $\leftrightarrow$ nurse) but unbound ones do not (Greg $\not\leftrightarrow$ physicist). Despite only being trained on linguistically simple English templates, we find that propositional probes generalize to inputs written as short stories and translated to Spanish. Moreover, in three settings where LMs respond unfaithfully to the input context---prompt injections, backdoor attacks, and gender bias--- the decoded propositions remain faithful. This suggests that LMs often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

Abstract

Name Change Policy