SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper Supplemental

Authors

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan Rossi, Changyou Chen, Tong Sun

Abstract

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named Self-Visual Retrieval-Augmented Generation (SV-RAG), which can broaden horizons of any MLLM to support long-document understanding. We demonstrate that MLLMs themselves can be an effective multimodal retriever to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.