Tag: Transformer
All the articles with the tag "Transformer".
-
QKV Projections Require a Fraction of Their Memory
本文提出PAMM方法,通过随机选择代表性token近似输入张量,大幅减少注意力机制中Q、K、V投影的内存占用(高达512倍),同时在预训练和微调中基本维持模型性能。
-
Always Skip Attention
This paper theoretically demonstrates the ill-conditioning of Self-Attention Blocks in Vision Transformers without skip connections, highlights their role as regularizers, and proposes Token Graying (SVD and DCT) to improve input token conditioning, achieving modest performance gains in supervised and self-supervised tasks.
-
Do Language Models Use Their Depth Efficiently?
本文通过对Llama 3.1和Qwen 3模型的残差流分析和干预实验,发现大型语言模型未有效利用深度,后半部分层主要细化概率分布而非进行新计算,且处理深度与输入复杂性无关,提示当前架构和训练目标需改进。
-
From Compression to Expansion: A Layerwise Analysis of In-Context Learning
本文通过统计几何分析揭示了大型语言模型在上下文学习中的层级压缩-扩展现象,早期层压缩任务信息,后期层扩展生成预测,并探讨了模型大小、演示数量和噪声对性能的影响。
-
Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It
This paper introduces geodesic sharpness, a novel measure using Riemannian geometry to account for transformer symmetries on a quotient manifold, demonstrating stronger correlations with generalization across diagonal networks, vision transformers, and language models compared to traditional adaptive sharpness.