Tag: Efficiency

All the articles with the tag "Efficiency".

Splitwiser: Efficient LM inference with constrained resources

Published: 11 May, 2025 at 11:14 AM

60.85 🤔

Splitwiser introduces a method to split LLM inference phases on a single GPU using multiprocessing and NVIDIA MPS, achieving modest latency reductions (up to 18.2%) and throughput improvements (up to 1.42x) on Huggingface and vLLM pipelines, though constrained by overheads and scalability issues.
Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving

Published: 4 May, 2025 at 04:30 PM

60.43 🤔

本文提出基于认知负载的适应性流式传输框架，用于优化 LLM 服务，通过动态调整输出速度减少计算资源消耗高达 16.8%，同时维持用户满意度。
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Published: 4 May, 2025 at 04:31 PM

59.95 🤔

本文提出Token-Shuffle方法，通过利用视觉词汇维度冗余动态合并和恢复图像令牌，实现高效的高分辨率文本到图像生成，同时在统一自回归框架下保持出色性能。
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Published: 4 May, 2025 at 04:28 PM

59.39 🤔

本研究提出 SpargeAttn，一种通用稀疏注意力机制，通过两阶段在线过滤器和量化技术加速各种模型的推理，同时保持端到端性能无损。
W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models

Published: 4 May, 2025 at 04:30 PM

53.85 🤔

本文提出 W-PCA 方法，通过结合参数数量和主成分分析，提供一种高效的零-shot NAS 代理，用于轻量级语言模型的搜索，显著提高了搜索效率和模型性能。

Tag: Efficiency

Splitwiser: Efficient LM inference with constrained resources

Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models