Information Theoretic Text-to-Image Alignment

Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference

Bibtex Paper

Authors

Chao Wang, Giulio Franzese, alessandro finamore, Massimo Gallo, Pietro Michiardi

Abstract

Diffusion models for Text-to-Image (T2I) conditional generation have recently achievedtremendous success. Yet, aligning these models with user’s intentions still involves alaborious trial-and-error process, and this challenging alignment problem has attractedconsiderable attention from the research community. In this work, instead of relying onfine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-languagemodels, we use Mutual Information (MI) to guide model alignment. In brief, our methoduses self-supervised fine-tuning and relies on a point-wise MI estimation between promptsand images to create a synthetic fine-tuning set for improving model alignment. Ouranalysis indicates that our method is superior to the state-of-the-art, yet it only requiresthe pre-trained denoising network of the T2I model itself to estimate MI, and a simplefine-tuning strategy that improves alignment while maintaining image quality. Code available at https://github.com/Chao0511/mitune.