GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Tang, Jianheng; Zhang, Qifan; Li, Yuhan; Chen, Nuo; Li, Jia

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li

International Conference on Learning Representations 2025 (ICLR 2025) Conference

Abstract

The ``arms race'' of Large Language Models (LLMs) demands new benchmarks to examine their progresses. In this paper, we introduce GraphArena, a benchmarking tool designed to evaluate LLMs on real-world graph computational problems. It offers a suite of four polynomial-time tasks (e.g., Shortest Distance) and six NP-complete challenges (e.g., Traveling Salesman Problem). GraphArena features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), hallucinatory (properly formatted but infeasible), or missing. Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems and exhibit hallucination issues. We further explore four potential solutions to address this issue and improve LLMs on graph computation, including chain-of-thought prompting, instruction tuning, code writing, and scaling test-time compute, each demonstrating unique strengths and limitations. GraphArena complements the existing LLM benchmarks and is open-sourced at https://github.com/squareRoot3/GraphArena.

Abstract

Name Change Policy