Grounding Multimodal Large Language Model in GUI World

Lei, Weixian; Gao, Difei; Shou, Mike Zheng

Grounding Multimodal Large Language Model in GUI World

Weixian Lei, Difei Gao, Mike Zheng Shou

International Conference on Learning Representations 2025 (ICLR 2025) Conference

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have accelerated the development of Graphical User Interface (GUI) agents capable of automating complex tasks across digital platforms. However, precise GUI element grounding remains a key challenge for accurate interaction and generalization. In this work, we present an effective GUI grounding framework, which includes an automated data collection engine that gathers extensive GUI screenshots and annotations to ensure broad generalization. We also propose a lightweight and flexible GUI grounding module designed to efficiently localize UI elements by pre-training on the collected data, and introduce a novel method to integrate this module with MLLMs for the effective execution of GUI tasks. Our approach demonstrates superior performance in task accuracy and adaptability, as validated by benchmarks such as ScreenSpot, MiniWob, AITW, and Mind2Web.

Abstract

Name Change Policy