Posts

All the articles I've posted.

Why do LLMs attend to the first token?

Published: 17 May, 2025 at 11:04 AM

90.22 🤔

This paper argues that attention sinks in LLMs, particularly at the first token, are a useful mechanism to prevent over-mixing of information in deep Transformers, supported by theoretical insights and empirical evidence from Gemma 7B, LLaMa 3.1 models, and pre-training experiments showing stronger sinks with larger models and longer contexts.
M+: Extending MemoryLLM with Scalable Long-Term Memory

Published: 3 Jun, 2025 at 11:27 AM

90.20 🤔

M+通过引入长期记忆机制和协同训练的检索器，显著扩展了MemoryLLM的知识保留能力至超过160k token，并在长上下文任务中优于基线，同时保持较低GPU内存消耗。
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Published: 13 May, 2025 at 11:21 AM

90.20 🤔

This paper investigates inter-layer communication in Transformer LMs by identifying low-rank communication channels via SVD, demonstrating their causal role in prompt sensitivity through interventions that significantly improve performance on context retrieval tasks like the Laundry List task.
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Published: 31 May, 2025 at 11:22 AM

90.15 🤔

本文通过综述、基准测试和提出权重重分解与动量重置两种技术，探索了大型语言模型预训练中的参数和内存高效方法，显著提升了低秩方法的性能并减少内存消耗，但仍无法完全匹配全秩训练的效果。
Learning Composable Chains-of-Thought

Published: 30 May, 2025 at 11:12 AM

90.13 🤔

本文提出Composable Chain-of-Thought方法，通过数据增强改进原子任务CoT格式，并结合多任务学习或模型合并实现零样本组合推理，使用拒绝采样微调进一步提升性能，在字符串操作和自然语言任务上优于标准CoT基准。

Posts

Why do LLMs attend to the first token?

M+: Extending MemoryLLM with Scalable Long-Term Memory

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Learning Composable Chains-of-Thought