Provably Safeguarding a Classifier from OOD and Adversarial Samples

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Nicolas Atienza, Johanne Cohen, Christophe Labreuche, Michele Sebag

Abstract

This paper aims to transform a trained classifier into an abstaining classifier, suchthat the latter is provably protected from out-of-distribution and adversarial samples. The proposed Sample-efficient Probabilistic Detection using Extreme ValueTheory (SPADE) approach relies on a Generalized Extreme Value (GEV) modelof the training distribution in the latent space of the classifier. Under mild assumptions, this GEV model allows for formally characterizing out-of-distributionand adversarial samples and rejecting them. Empirical validation of the approachis conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and considers medium and large-sized datasets (CIFAR-10, CIFAR-100,and ImageNet). The results show the stability and frugality of the GEV model anddemonstrate SPADE’s efficiency compared to the state-of-the-art methods.