Tag: Transformer

All the articles with the tag "Transformer".

Don't be lazy: CompleteP enables compute-efficient deep transformers

Published: 11 May, 2025 at 11:16 AM

81.10 🤔

This paper introduces CompleteP, a parameterization for transformers with α = 1, which ensures depth-wise hyperparameter transfer and complete feature learning, achieving 12-34% compute efficiency improvements and enabling a wider range of compute-optimal width-to-depth ratios.
Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Published: 19 May, 2025 at 11:18 AM

79.83 🤔

本文提出patch级训练方法，通过将多个token聚合成高信息密度patch并分阶段训练大型语言模型，在训练成本减半的情况下保持甚至略提升模型性能。
Does Self-Attention Need Separate Weights in Transformers?

Published: 11 May, 2025 at 11:12 AM

79.57 🤔

This paper introduces a shared weight self-attention mechanism for transformers, using a single weight matrix with diagonal scaling to reduce parameters by 66.53% in attention blocks, achieving competitive performance on GLUE and improved noise robustness while slightly underperforming on SQuAD tasks compared to standard BERT.
Large Language Model Compression with Global Rank and Sparsity Optimization

Published: 11 May, 2025 at 11:14 AM

77.26 🤔

This paper introduces a two-stage LLM compression method using RPCA for low-rank and sparse decomposition and probabilistic pruning via policy gradient, outperforming state-of-the-art techniques at a 50% compression ratio while automatically adapting to layer-wise redundancy without manual thresholds or extensive fine-tuning.
LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

Published: 12 May, 2025 at 11:20 AM

76.16 🤔

This paper introduces a framework to classify algorithmic innovations in LLMs as compute-dependent or compute-independent, demonstrating through small-scale GPT-2 experiments that compute-independent advancements like FlashAttention can yield up to 3.5× compute-equivalent gains even under hardware constraints, challenging the efficacy of hardware-focused AI regulation.

Tag: Transformer

Don't be lazy: CompleteP enables compute-efficient deep transformers

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Does Self-Attention Need Separate Weights in Transformers?

Large Language Model Compression with Global Rank and Sparsity Optimization

LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?