LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, Naman; Gu, Alex; Li, Wen-Ding; Yan, Fanjia; Zhang, Tianjun; Wang, Sida; Solar-Lezama, Armando; Sen, Koushik; Stoica, Ion

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Authors

Naman Jain, Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

Abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEvla, MBPP) are no longer sufficient for assessing their capabilities suffering from data contamination, overfitting, saturation, and focus on merely code generation. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which collects new problems over time from contests across three competition platforms, Leetcode, Atcoder, and Codeforces. Notably, our benchmark also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts over six hundred coding problems that were published between May 2023 and Aug 2024. We evaluate over 50 LLMs on LiveCodeBench (LCB for brevity) presenting the largest evaluation study of code LLMs on competition problems. Based on the study, we present novel empirical findings on contamination, overfitting, and holistic evaluations. We demonstrate that time-segmented evaluations serve as a robust approach to evade contamination; they are successful at detecting contamination across a wide range of open and closed models including GPT-4O, Claude, Deepseek, and Codestral. Next, we highlight overfitting and saturation of traditional coding benchmarks like HumanEvla and demonstrate LCB allows more reliable evaluations. Finally, our holistic evaluation scenarios allow for measuring the different capabilities of programming agents in isolation.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Authors

Abstract

Name Change Policy