Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference
Qiming Huang, Han Hu, Jianbo Jiao
In Open Vocabulary Semantic Segmentation (OVS), we observe a consistent dropin model performance as the query vocabulary set expands, especially when itincludes semantically similar and ambiguous vocabularies, such as ‘sofa’ and‘couch’. The previous OVS evaluation protocol, however, does not account forsuch ambiguity, as any mismatch between model-predicted and human-annotatedpairs is simply treated as incorrect on a pixel-wise basis. This contradicts the opennature of OVS, where ambiguous categories may both be correct from an open-world perspective. To address this, in this work, we study the open nature of OVSand propose a mask-wise evaluation protocol that is based on matched and mis-matched mask pairs between prediction and annotation respectively. Extensiveexperimental evaluations show that the proposed mask-wise protocol provides amore effective and reliable evaluation framework for OVS models compared to theprevious pixel-wise approach on the perspective of open-world. Moreover, analy-sis of mismatched mask pairs reveals that a large amount of ambiguous categoriesexist in commonly used OVS datasets. Interestingly, we find that reducing theseambiguities during both training and inference enhances capabilities of OVS mod-els. These findings and the new evaluation protocol encourage further explorationof the open nature of OVS, as well as broader open-world challenges. Project page: https://qiming-huang.github.io/RevisitOVS/.