Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

This paper introduces a recursive summarization method to enhance long-term dialogue memory in LLMs, achieving marginal quantitative improvements and notable qualitative gains in consistency and coherence across multiple models and datasets.

Large Language Model, Long Context, In-Context Learning, Multimodal Systems, Human-AI Interaction

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, Liang Ding

Hong Kong University of Science and Technology, Hong Kong, China, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China, National University of Defense Technology, Changsha, China, University of Sydney, Sydney, Australia

Generated by grok-3

Background Problem

Large Language Models (LLMs) like ChatGPT and GPT-4 excel in conversational tasks but struggle to maintain consistency in long-term dialogues due to their inability to effectively recall and integrate past interactions. This limitation is critical in applications such as personal AI companions and health assistants, where maintaining dialogue history is essential for rapport and accurate responses. Existing solutions, such as retrieval-based methods and memory modules, often fail to capture complete semantics or update memory dynamically, leading to outdated or irrelevant information affecting response quality. The paper aims to address this by proposing a method to enhance LLMs’ long-term dialogue memory through recursive summarization, ensuring consistent and contextually relevant responses over extended conversations.

Method

The proposed method, termed ‘LLM-Rsum,’ leverages recursive summarization to build and update dialogue memory in LLMs for long-term conversations. The core idea is to enable the LLM to self-generate and iteratively update memory by summarizing dialogue contexts over multiple sessions. The process involves two main stages: (1) Memory Iteration, where the LLM generates an initial summary from a short dialogue context and recursively updates it by integrating previous memory with new session data, guided by a structured prompt to ensure coherence and relevance (formulated as $M_i = \mathbf{LLM}(S_i, M_{i-1}, \mathbf{P}_m)$ ); (2) Memory-based Response Generation, where the LLM uses the latest memory alongside the current dialogue context to produce consistent responses (formulated as $r_t = \mathbf{LLM}(C_t, M_N, \mathbf{P}_r)$ ). Prompts are carefully designed with step-by-step instructions to ensure the LLM captures key personality traits and maintains conversational flow. This method is plug-and-play, requiring no model retraining, and aims to model long-term dependencies efficiently while potentially handling extremely long contexts by summarizing rather than expanding input length.

Experiment

The experiments were conducted on two long-term dialogue datasets, Multi-Session Chat (MSC) and Carecall, focusing on sessions 4 and 5 to evaluate long-term modeling. Various LLMs (e.g., Llama2-7B, ChatGLM2-6B, ChatGPT) were tested under zero-shot settings, with comparisons against context-only, retrieval-based (BM25, DPR), and memory-based (MemoryBank, MemoChat) baselines. Evaluation metrics included automatic scores (F1, BLEU-1/2, BertScore), human evaluations (engagingness, coherence, consistency), and LLM judgments (GPT-4 as evaluator). Results showed marginal improvements in automatic metrics (e.g., +0.2% F1 for ChatGPT-Rsum over baselines on MSC), which are modest but acceptable given the dataset complexity. Human and LLM evaluations indicated stronger qualitative gains, with ChatGPT-Rsum achieving higher scores in coherence (1.60 vs. 1.57 for MemoryBank on MSC) and consistency (1.70 vs. 1.68 on Carecall). The setup was comprehensive, testing across multiple LLMs and complementary methods (retrieval and long-context models), demonstrating robustness and universality. However, the improvement in automatic metrics is not substantial, and the reliance on subjective evaluations raises concerns about over-optimism. Ablation studies confirmed the necessity of memory, though using ground truth memory surprisingly underperformed compared to generated memory, suggesting the latter’s coherence. Error analysis revealed minor factual inaccuracies in memory (less than 10%), which is acceptable but indicates room for improvement.

Further Thoughts

The recursive summarization approach opens up intriguing possibilities for enhancing LLMs beyond dialogue tasks, such as in narrative generation or legal document analysis, where maintaining long-term context is crucial. However, the minor factual errors in memory generation (e.g., fabricated facts, missing details) could be more problematic in high-stakes domains like healthcare or law, where precision is paramount. This connects to broader research on hallucination mitigation in LLMs, suggesting a potential hybrid approach combining recursive summarization with fact-checking mechanisms or retrieval-augmented generation (RAG) to validate memory content. Additionally, the computational cost of repeatedly calling LLMs for memory updates is a significant barrier to scalability, aligning with ongoing discussions on efficient inference in AI systems. Future work could explore distilling the summarization process into a smaller, fine-tuned model to reduce costs, akin to efforts in parameter-efficient fine-tuning. Lastly, the positive correlation between memory quality and response quality hints at a feedback loop mechanism—could LLMs iteratively refine their own memory based on response feedback? This could tie into reinforcement learning paradigms like RLHF, offering a novel direction for self-improving dialogue systems.