Tag: Alignment
All the articles with the tag "Alignment".
-
Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?
本文提出并验证了'浅层偏好信号'现象,通过截断偏好数据集(保留前40%-50% token)训练奖励模型和DPO模型,性能与完整数据集相当甚至更优,并揭示了当前对齐方法过于关注早期token的局限性。
-
Cross-Lingual Optimization for Language Transfer in Large Language Models
本文提出跨语言优化(CLO)方法,通过翻译数据和改进的DPO策略,将英语中心的大型语言模型有效转移到目标语言,在保持英语能力的同时显著提升目标语言性能,尤其在低资源语言中以更少数据取得优于传统SFT的结果。
-
Activation Space Interventions Can Be Transferred Between Large Language Models
This paper demonstrates that activation space interventions for AI safety, such as backdoor removal and refusal behavior, can be transferred between large language models using autoencoder mappings, enabling smaller models to align larger ones, though challenges remain in cross-architecture transfers and complex tasks like corrupted capabilities.
-
Improving Multilingual Language Models by Aligning Representations through Steering
本文提出了一种通过表示引导调整大型语言模型层级表示的方法,以提升多语言任务性能,实验显示其在多种任务中优于基本提示并接近翻译基线,但对英语任务有负面影响且对低资源语言改进有限。
-
Latent Principle Discovery for Language Model Self-Improvement
本文提出STaPLe算法,通过Monte Carlo EM方法自动化发现和学习语言模型自我改进的潜在原则,在多个指令跟随基准上显著提升小型模型性能,同时通过聚类生成人类可解释的宪法。