Part of International Conference on Representation Learning 2025 (ICLR 2025) Conference
Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of *polysemantic* neurons into *monosemantic* features and composing a sparse dictionary of words.However, traditional performance metrics like Mean Squared Error and $\mathrm{L}_{0}$ sparsity ignore the evaluation of the semantic representational power of SAEs - whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words.For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word.In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words.Our findings reveal that SAEs developed to improve the MSE-$\mathrm{L}_0$ Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features.The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word.Our semantics-focused evaluation offers new insights into the polysemy and the existing SAE objective and contributes to the development of more practical SAEs.