KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Ma, Kaijing; Du, Xeron; Wang, Yunran; Zhang, Haoran; Wen, Zhoufutu; Qu, Xingwei; Yang, Jian; LIU, JIAHENG; Liu, Minghao; Yue, Xiang; Huang, Wenhao; Zhang, Ge

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Authors

Kaijing Ma, Xeron Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, JIAHENG LIU, Minghao Liu, Xiang Yue, Wenhao Huang, Ge Zhang

Abstract

In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88\% and 70.16\%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96\% and 58.00\%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three integrated tasks, explore the impact of Tricks on the Puzzle task, and visualize rule-focused attention. Additionally, we conduct an ablation study on dataset size, benchmark correlations, and zero-shot and three-shot "only questions" experiments. KOR-Bench aims to enhance reasoning evaluation and support further research in this area.

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Authors

Abstract

Name Change Policy