Tag: Transformer

All the articles with the tag "Transformer".

You Do Not Fully Utilize Transformer's Representation Capacity

Published: 31 May, 2025 at 11:35 AM

88.14 🤔

本文提出Layer-Integrated Memory (LIMe)，通过学习跨层路由机制整合之前所有层的Key-Value表示，显著缓解Transformer的表示崩塌问题，并在语言建模、推理任务和深层网络中实现更快收敛和更高准确率。
LoKI: Low-damage Knowledge Implanting of Large Language Models

Published: 2 Jun, 2025 at 11:23 AM

87.82 🤔

本文提出LoKI，一种参数高效微调框架，通过分析Transformer FFN层的知识存储机制和层平衡参数选择策略，在下游任务适应和预训练知识保留之间实现了竞争性平衡。
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published: 8 May, 2025 at 06:17 PM

87.73 🤔

RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Published: 21 May, 2025 at 11:29 AM

87.72 🤔

本文提出 Rodimus 和 Rodimus+ 模型，通过数据依赖温度选择（DDTS）和滑动窗口共享键注意力（SW-SKA）机制，在保持性能的同时显著降低大型语言模型的计算和内存复杂度，挑战了准确性与效率的权衡。
An Analysis for Reasoning Bias of Language Models with Small Initialization

Published: 25 May, 2025 at 11:52 AM

87.56 🤔

本文通过理论分析和实验验证，揭示了小参数初始化规模如何通过影响嵌入空间和训练动态，促使大型语言模型更倾向于推理任务而非记忆任务。

Tag: Transformer

You Do Not Fully Utilize Transformer's Representation Capacity

LoKI: Low-damage Knowledge Implanting of Large Language Models

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

An Analysis for Reasoning Bias of Language Models with Small Initialization