Tag: Fine-tuning

All the articles with the tag "Fine-tuning".

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published: 8 May, 2025 at 06:17 PM

87.73 🤔

RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.
Merge to Mix: Mixing Datasets via Model Merging

Published: 26 May, 2025 at 11:24 AM

87.71 🤔

本文提出*Merge to Mix*方法，通过模型合并技术作为代理，高效选择数据集混合用于大型模型微调，在图像分类和语言任务中显著优于传统方法，接近甚至部分超过Oracle性能。
MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Published: 1 Jun, 2025 at 11:52 AM

87.68 🤔

本文提出MELoRA，通过并行堆叠多个小型LoRA模块实现更高的等效秩，以更少的参数在自然语言理解和指令跟随任务上显著优于LoRA。
Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition

Published: 18 May, 2025 at 11:16 AM

87.67 🤔

本文通过线性探查和神经元激活分析，复制并扩展了对密集检索模型中预训练与微调知识获取作用的研究，发现预训练知识在DPR模型中主导检索效果且微调导致知识分散，但此结论在不同架构（如Contriever、RepLlama）和表示策略下并不成立。
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Published: 26 May, 2025 at 11:25 AM

87.52 🤔

本文提出RLKD，一个基于强化学习的知识蒸馏框架，通过生成结构奖励模型（GSRM）将教师模型推理中的隐式多分支结构传递给学生模型，实验表明其在数学和问答任务上显著优于SFT和传统RL方法。

Tag: Fine-tuning

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Merge to Mix: Mixing Datasets via Model Merging

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning

Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition

Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning