Tag: Transformer
All the articles with the tag "Transformer".
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
本文通过在softmax注意力机制的SDPA输出后引入头特定sigmoid门控机制,显著提升了15B MoE和1.7B密集模型的性能、训练稳定性和长上下文泛化能力,同时消除了注意力沉积现象。
-
RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning
本文提出 RaaS 算法,通过识别推理任务中的里程碑令牌并采用 LRU 缓存策略管理 KV 向量,在保持高准确性的同时实现了 O(L) 的时间和内存复杂度,显著优于现有方法如 Quest 的内存效率。
-
Foundation Models For Seismic Data Processing: An Extensive Review
This paper conducts an extensive review of natural image foundation models for seismic data processing, demonstrating that hierarchical models like Swin and ConvNeXt, especially with self-supervised pre-training, outperform non-hierarchical ones in demultiple, interpolation, and denoising tasks, while highlighting the benefits and limitations of natural image pre-training for seismic applications.
-
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
本文通过提出Gather-and-Aggregate (G&A)机制,揭示了Transformer和SSM模型在上下文检索能力上的性能差距主要源于少数关键头部的实现差异,并通过混合模型实验验证了注意力机制在改进SSM检索能力上的潜力。
-
Born a Transformer -- Always a Transformer?
本文通过检索和复制任务研究Transformer的长度泛化限制,发现预训练选择性增强了归纳能力(向右/向前任务),但无法克服架构固有局限,微调可平衡不对称性但仍受理论约束。