Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference
Shaofeng Zhang, Qiang Zhou, Sitong Wu, Haoru Tan, zhibin wang, Jinfa Huang, Junchi Yan
Dense visual contrastive learning (DRL) shows promise for learning localized information in dense prediction tasks, but struggles with establishing pixel/patch correspondence across different views (cross-contrasting). Existing methods primarily rely on self-contrasting the same view with variations, limiting input variance and hindering downstream performance. This paper delves into the mechanisms of self-contrasting and cross-contrasting, identifying the crux of the issue: transforming discrete positional embeddings to continuous representations. To address the correspondence problem, we propose a Continuous Relative Rotary Positional Query ({\mname}), enabling patch-level representation learning. Our extensive experiments on standard datasets demonstrate state-of-the-art (SOTA) results. Compared to the previous SOTA method (PQCL), our approach achieves significant improvements on COCO: with 300 epochs of pretraining, {\mname} obtains \textbf{3.4\%} mAP$^{bb}$ and \textbf{2.1\%} mAP$^{mk}$ improvements for detection and segmentation tasks, respectively. Furthermore, {\mname} exhibits faster convergence, achieving \textbf{10.4\%} mAP$^{bb}$ and \textbf{7.9\%} mAP$^{mk}$ improvements over SOTA with just 40 epochs of pretraining.