Tag: Fine-tuning
All the articles with the tag "Fine-tuning".
-
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.
-
Merge to Mix: Mixing Datasets via Model Merging
本文提出*Merge to Mix*方法,通过模型合并技术作为代理,高效选择数据集混合用于大型模型微调,在图像分类和语言任务中显著优于传统方法,接近甚至部分超过Oracle性能。
-
MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning
本文提出MELoRA,通过并行堆叠多个小型LoRA模块实现更高的等效秩,以更少的参数在自然语言理解和指令跟随任务上显著优于LoRA。
-
Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition
本文通过线性探查和神经元激活分析,复制并扩展了对密集检索模型中预训练与微调知识获取作用的研究,发现预训练知识在DPR模型中主导检索效果且微调导致知识分散,但此结论在不同架构(如Contriever、RepLlama)和表示策略下并不成立。
-
Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning
本文提出RLKD,一个基于强化学习的知识蒸馏框架,通过生成结构奖励模型(GSRM)将教师模型推理中的隐式多分支结构传递给学生模型,实验表明其在数学和问答任务上显著优于SFT和传统RL方法。