Tag: Efficiency

All the articles with the tag "Efficiency".

Efficient Reasoning for LLMs through Speculative Chain-of-Thought

Published: 6 May, 2025 at 01:19 AM

79.97 🤔

本文提出了推测思维链（SCoT）框架，通过轻量级草稿模型并行生成多个思维链草稿，并由微调后的目标大模型选择最佳草稿或决定重新思考，从而在保持接近大模型准确率的同时，显著降低了大型语言模型的推理延迟。
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Published: 4 May, 2025 at 04:29 PM

79.93 🤔

本文提出 StreamRL 框架，通过分离式流生成架构优化 RL 训练，解决了流水线和偏斜气泡问题，提高了 LLMs RL 训练的吞吐量和成本效率。
Radio: Rate-Distortion Optimization for Large Language Model Compression

Published: 9 May, 2025 at 11:09 AM

79.84 🤔

This paper introduces 'Radio,' a rate-distortion optimization framework for LLM compression that outperforms existing quantization methods in perplexity and downstream task accuracy, particularly at lower bit depths, by iteratively optimizing bit depths and using companding quantization post-training.
Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Published: 19 May, 2025 at 11:18 AM

79.83 🤔

本文提出patch级训练方法，通过将多个token聚合成高信息密度patch并分阶段训练大型语言模型，在训练成本减半的情况下保持甚至略提升模型性能。
Does Self-Attention Need Separate Weights in Transformers?

Published: 11 May, 2025 at 11:12 AM

79.57 🤔

This paper introduces a shared weight self-attention mechanism for transformers, using a single weight matrix with diagonal scaling to reduce parameters by 66.53% in attention blocks, achieving competitive performance on GLUE and improved noise robustness while slightly underperforming on SQuAD tasks compared to standard BERT.

Tag: Efficiency

Efficient Reasoning for LLMs through Speculative Chain-of-Thought

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation

Radio: Rate-Distortion Optimization for Large Language Model Compression

Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Does Self-Attention Need Separate Weights in Transformers?