Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement

This paper introduces Temperature Scaling (TS) and Trace Length Control for Dynamic Reasoning (TLDR) to enhance token efficiency in small language models, achieving up to 50% reduction in response length with minimal accuracy loss across multiple reasoning benchmarks.

Reinforcement Learning, Supervised Learning, Reasoning, Efficiency, Large Language Model

Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak

University of Michigan

Generated by grok-3

Background Problem

The paper addresses the inefficiency of small language models (SLMs) in reasoning tasks, particularly when trained with supervised fine-tuning (SFT), which often leads to verbose and repetitive outputs due to poor control over the stopping point of the reasoning process. This redundancy increases computational costs and hinders practical deployment. The key problem solved is achieving token-efficient reasoning in SLMs by optimizing the trade-off between response length and accuracy, ensuring effective performance with reduced computational overhead.

Method

The paper proposes two main methods to enhance token efficiency in SLMs:

Temperature Scaling (TS): A training-free intervention that adjusts the logits of the end-of-sequence (EOS) token during inference to increase its selection probability, thereby encouraging earlier termination of generation. This method modifies the EOS logit as $l_{i_{eos}}' = z_{i_{eos}} / T$ where $T < 1$ , without altering the model’s internal reasoning process.
Trace Length Control for Dynamic Reasoning (TLDR): A reinforcement learning (RL) approach based on Group Relative Policy Optimization (GRPO) that incorporates a length penalty into the reward function, defined as $r = r_{hat} - \zeta(L)$ , where $r_{hat}$ is the original reward (e.g., accuracy) and $\zeta(L)$ is a penalty based on response length $L$ . TLDR supports multi-level length control (short, moderate, long) via user prompts, allowing dynamic adjustment of response verbosity by training the model to associate specific penalties with corresponding length prompts. Both methods aim to control the stopping point of reasoning traces, with TS offering a lightweight solution and TLDR providing a more robust, trainable framework for efficiency-accuracy trade-offs.

Experiment

The experiments were conducted on four reasoning benchmarks: MATH500, AMC, AIME24, and OlympiadBench, using SLMs of varying sizes (1.5B to 7B parameters) such as Qwen2.5 and DeepSeek-R1-Distill models. The setup compared baseline SFT models, test-time interventions like Budget Forcing (BF), and the proposed TS and TLDR methods under different token budgets and length control levels (short, moderate, long). Results showed that TS reduced response length by up to 50% while maintaining accuracy, outperforming BF in achieving a better efficiency-accuracy trade-off. TLDR further improved token efficiency by approximately 50% over SFT baselines with minimal to no accuracy loss, and its multi-level control allowed flexible response length adjustments. The experimental design was comprehensive in testing across multiple datasets and model sizes, though it focused heavily on mathematical reasoning tasks, potentially limiting generalizability to other domains. The results matched the expectation of improved efficiency, but the lack of detailed computational overhead analysis for TLDR training raises questions about practical scalability. Additionally, comparisons with state-of-the-art methods like L1 showed TLDR performing on par under max trace length constraints, though deeper analysis of edge cases or failure modes was missing.

Further Thoughts

The approaches proposed in this paper, particularly TLDR, open up intriguing possibilities for broader applications beyond mathematical reasoning, such as in conversational AI or code generation, where response length control could significantly impact user experience and computational cost. However, the reliance on user prompts for multi-level length control in TLDR might introduce inconsistencies in real-world settings where user inputs vary widely; exploring automated or context-aware length adjustment mechanisms could be a valuable next step. Additionally, the paper’s focus on SLMs prompts a connection to federated learning scenarios, where efficient reasoning could be critical for on-device deployment under resource constraints—combining TLDR with privacy-preserving techniques might yield novel solutions for edge AI. The lack of discussion on training overhead for TLDR also suggests a need for comparative studies with lightweight fine-tuning methods like Parameter-Efficient Fine-Tuning (PEFT) to assess true practicality. Finally, the observed redundancy in SFT models aligns with broader challenges in distillation processes, hinting at potential synergies with research on knowledge distillation under constrained environments, where efficiency and performance must be balanced meticulously.