Tag: Reinforcement Learning

All the articles with the tag "Reinforcement Learning".

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Published: 22 May, 2025 at 11:19 AM

85.18 🤔

InfiFPO提出了一种在偏好对齐阶段进行隐式模型融合的偏好优化方法，通过序列级概率融合和优化策略，将多个源模型知识整合到枢轴模型中，显著提升了Phi-4在11个基准上的平均性能从79.95到83.33。
Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

Published: 17 May, 2025 at 11:01 AM

85.16 🤔

This paper introduces Adaptive Difficulty Curriculum Learning (ADCL) and Expert-Guided Self-Reformulation (EGSR) to enhance LLM reasoning by dynamically adjusting training curricula and guiding models to reformulate expert solutions, achieving significant performance improvements over standard RL baselines on mathematical reasoning benchmarks.
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Published: 1 Jun, 2025 at 11:53 AM

85.15 🤔

ReMA通过多智能体强化学习分离元思考和推理过程，提升了大型语言模型在数学推理和LLM-as-a-Judge任务上的性能，尤其在分布外泛化能力上表现出色，但对超参数敏感且多轮设置存在稳定性挑战。
Concise Reasoning via Reinforcement Learning

Published: 17 May, 2025 at 11:21 PM

85.10 🤔

本文提出了一种两阶段强化学习训练策略，通过在极小数据集上分阶段优化推理能力和简洁性，显著减少大型语言模型的响应长度（最高54%），同时保持甚至提升准确性，并增强低采样强度下的鲁棒性。
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

Published: 5 Jun, 2025 at 11:23 AM

85.08 🤔

本文提出'Trajectory Policy Gradient Theorem'，从理论上证明在LLM在线强化学习中仅用响应级别奖励即可无偏估计token级奖励的策略梯度，并基于此设计了TRePO算法，简化PPO设计并具备token级建模能力。

Tag: Reinforcement Learning

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Concise Reasoning via Reinforcement Learning

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective