Tag: Alignment

All the articles with the tag "Alignment".

Reverse Preference Optimization for Complex Instruction Following

Published: 1 Jun, 2025 at 11:44 AM

85.20 🤔

本文提出逆向偏好优化（RPO）方法，通过动态反转指令中未满足的约束消除偏好对噪声，在多轮复杂指令跟随任务上显著优于DPO基线，并在70B模型上超越GPT-4o。
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Published: 18 May, 2025 at 11:16 AM

85.12 🤔

本文通过分析对齐前后LLM输出分布的变化，揭示了对齐虽减少分布性多元化但通过更长响应实现奥弗顿多元化，且基础模型通过上下文学习可有效模仿对齐模型行为，支持表面对齐假说。
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Published: 1 Jun, 2025 at 11:45 AM

85.12 🤔

本文提出残差对齐模型（RAM），通过重要性采样分离对齐模块，实现高效的序列级训练和令牌级解码，在多个对齐任务中显著提升性能并降低资源成本。
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Published: 4 May, 2025 at 04:27 PM

78.97 🤔

本文提出Head-Specific Intervention (HSI)方法，通过针对特定注意力头的激活干预，成功诱导Llama 2模型在AI协调行为上绕过安全对齐，效果优于监督微调和其它干预策略。
Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Published: 12 May, 2025 at 11:14 AM

76.90 🤔

This paper introduces Latent Preference Coding (LPC), a framework that uses discrete latent codes to model multifaceted human preferences, consistently improving the performance of offline alignment algorithms like DPO, SimPO, and IPO across multiple LLMs and benchmarks.

Tag: Alignment

Reverse Preference Optimization for Complex Instruction Following

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes