Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

This paper uses Sparse Autoencoders to identify and manipulate language-specific features in Large Language Models, introducing a monolinguality metric, demonstrating context dependency via code-switching, and enhancing steering vectors for better control over multilingual generation while revealing significant language-specific impacts through ablation studies.

Large Language Model, Multimodal Data, Representation Learning, Pre-training, Interpretability

Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng

Tongyi Lab, Alibaba Group Inc, Institute of Dataspace, Hefei, Anhui, China

Generated by grok-3

Background Problem

The growing demand for multilingual capabilities in Large Language Models (LLMs) necessitates a deeper understanding of the mechanisms behind these abilities. Previous neuron-based and internal-activation-based methods to analyze multilingual capabilities suffer from issues like superposition (multiple concepts in a single neuron) and layer-wise activation variance, leading to unreliable interpretations. This paper aims to address these limitations by using Sparse Autoencoders (SAEs) to decompose LLM activations into interpretable, language-specific features, thereby uncovering how LLMs handle multiple languages and enabling targeted control over language generation.

Method

The core method involves using Sparse Autoencoders (SAEs) to decompose LLM activations into sparse linear combinations of features at each layer, focusing on the residual stream for interpretability. A novel monolinguality metric, defined as the difference in mean activation of a feature for a specific language versus other languages ( $\nu_{s}^{L} = \mu_{s}^{L} - \gamma_{s}^{L}$ ), is proposed to identify language-specific features. Experiments include analyzing activations on monolingual texts from the Flores-10 dataset, exploring context dependency via a code-switching dataset, and performing directional ablation to ‘zero out’ specific features and observe language-specific impacts. Additionally, language-specific features are used as gating signals to enhance steering vectors, selectively applying them to control language output in LLMs, avoiding unnecessary interference with non-target language tokens.

Experiment

The experiments were conducted on multiple LLMs (Gemma 2 2B, Gemma 2 9B, Llama-3.1-8B) using the Flores-10 dataset (a subset of Flores-200 with 10 languages) for monolingual analysis and a custom code-switching dataset generated by GPT-4o (5 sentences per language with noun substitutions). The setup included calculating the monolinguality metric to rank language-specific features, showing high activation differences for top features in specific languages. Code-switching results indicated that language-specific features are context-dependent, with prefixes altering feature activations (e.g., Spanish prefixes increasing Spanish feature activation for non-Spanish nouns). Ablation studies demonstrated that removing language-specific features significantly increased cross-entropy (CE) loss for the target language while minimally affecting others, with synergistic effects observed when ablating multiple features together. Steering vector enhancements using SAE features outperformed baseline methods in tasks like Adversarial Language Identification and Cross-Lingual Continuation, achieving higher success rates (up to 97% in some cases) and lower CE loss on non-target languages. However, the experimental setup has limitations: the code-switching dataset is small, potentially lacking representativeness, and results vary across languages and models, suggesting inconsistent performance. While the method shows promise, the results partially match expectations, with gaps in robustness for certain languages and tasks.

Further Thoughts

The identification of language-specific features via SAEs offers a promising direction for improving multilingual model training, particularly in low-resource languages where such features could guide data augmentation or targeted fine-tuning. However, the inconsistent performance of enhanced steering vectors across languages raises questions about underlying linguistic biases in LLMs—could these features reflect training data imbalances rather than true linguistic distinctions? This connects to broader research in AI ethics and fairness, where biased representations can perpetuate inequities in language processing. Additionally, the synergistic effects of feature ablation suggest potential overlaps in how LLMs encode related languages (e.g., Romance languages like Spanish and French), which could be explored further using phylogenetic linguistic frameworks or cross-lingual transfer learning studies. Future work might also investigate whether training SAEs on curated multilingual datasets, as suggested by the authors, could uncover more nuanced features, potentially linking to emergent abilities in foundation models under scaling laws. This paper’s approach could inspire novel alignment techniques, such as using language-specific features to enforce safety constraints in multilingual contexts, ensuring culturally appropriate responses.