Tag: Transformer
All the articles with the tag "Transformer".
-
You Do Not Fully Utilize Transformer's Representation Capacity
本文提出Layer-Integrated Memory (LIMe),通过学习跨层路由机制整合之前所有层的Key-Value表示,显著缓解Transformer的表示崩塌问题,并在语言建模、推理任务和深层网络中实现更快收敛和更高准确率。
-
LoKI: Low-damage Knowledge Implanting of Large Language Models
本文提出LoKI,一种参数高效微调框架,通过分析Transformer FFN层的知识存储机制和层平衡参数选择策略,在下游任务适应和预训练知识保留之间实现了竞争性平衡。
-
RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
RADLADS introduces a cost-effective three-step distillation protocol to convert softmax attention transformers into linear attention models using only 350-700M tokens, achieving near-teacher performance on benchmarks and setting a new state-of-the-art for pure RNNs with models up to 72B parameters.
-
Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions
本文提出 Rodimus 和 Rodimus+ 模型,通过数据依赖温度选择(DDTS)和滑动窗口共享键注意力(SW-SKA)机制,在保持性能的同时显著降低大型语言模型的计算和内存复杂度,挑战了准确性与效率的权衡。
-
An Analysis for Reasoning Bias of Language Models with Small Initialization
本文通过理论分析和实验验证,揭示了小参数初始化规模如何通过影响嵌入空间和训练动态,促使大型语言模型更倾向于推理任务而非记忆任务。