$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper Supplemental

Authors

Mintong Kang, Bo Li

Abstract

As large language models (LLMs) become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs and ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories (e.g., self-harm, self-harm/instructions) independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories.To address these limitations, we propose $R^2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^2$-Guard comprises two parts: data-driven guardrail models and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories.We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. We also propose different methods to optimize the weights of knowledge. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories and presents new challenges for moderation. We show that $R^2$-Guard is effective even given unrepresentative categories or challenging jailbreaking prompts. We demonstrate the effectiveness of $R^2$-Guard by comparisons with eight strong guardrail models on six standard moderation datasets, and demonstrate the robustness of $R^2$-Guard against four SOTA jailbreaking attacks. $R^2$-Guard significantly surpasses SOTA method LlamaGuard by 12.6% on standard moderation datasets and by 59.9% against jailbreaking attacks.We further reveal that $R^2$-Guard can effectively adapt to safety category updates by simply editing the PGM reasoning graph.