Tag: Efficiency
All the articles with the tag "Efficiency".
-
Large Language Model Compression with Global Rank and Sparsity Optimization
This paper introduces a two-stage LLM compression method using RPCA for low-rank and sparse decomposition and probabilistic pruning via policy gradient, outperforming state-of-the-art techniques at a 50% compression ratio while automatically adapting to layer-wise redundancy without manual thresholds or extensive fine-tuning.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
本文首次系统调查了大型语言模型高效推理的进展,通过分类模型、输出和提示-based方法,探讨了减少"过度思考"现象的策略,以优化计算效率并保持推理能力。
-
LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?
This paper introduces a framework to classify algorithmic innovations in LLMs as compute-dependent or compute-independent, demonstrating through small-scale GPT-2 experiments that compute-independent advancements like FlashAttention can yield up to 3.5× compute-equivalent gains even under hardware constraints, challenging the efficacy of hardware-focused AI regulation.
-
COSMOS: Predictable and Cost-Effective Adaptation of LLMs
COSMOS introduces a cost-effective framework to predict performance and cost of LLM adaptation strategies like QLoRA fine-tuning and retrieval-augmented ICL, achieving high accuracy (1.09% MAE) and reducing computational costs by 92.72% across eight diverse benchmarks.
-
From Attention to Atoms: Spectral Dictionary Learning for Fast, Interpretable Language Models
本文提出光谱字典生成模型(SDGM),通过学习全局傅里叶字典和 token 混合系数替换自注意力机制,实现 O(KL) 复杂度的高效语言建模,并在基准数据集上取得竞争性 perplexity 和显著的资源节省。