Tag: Interpretability

All the articles with the tag "Interpretability".

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Published: 7 May, 2025 at 09:32 AM

87.79 🤔

本文提出EMORL框架，通过集成学习分别训练单目标模型并在隐藏状态层聚合，结合分层网格搜索优化权重，在咨询反思生成任务中实现了与传统方法相当的性能，同时显著提升了训练效率、可扩展性和解释性。
Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

Published: 14 May, 2025 at 11:08 AM

86.86 🤔

The Boltzmann Classifier introduces a thermodynamically inspired supervised learning approach that uses an energy-based model derived from the Boltzmann distribution to estimate class probabilities, achieving competitive accuracy on benchmark datasets while offering interpretability and computational efficiency.
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

Published: 30 May, 2025 at 11:15 AM

86.28 🤔

本文通过微调GPT-4o和GPT-4o-mini，展示了大型语言模型能够量化报告其内部决策过程（如属性权重），并通过内省训练显著提升报告准确性，且这种能力可泛化至原生偏好，为AI可解释性和安全性提供了新路径。
Large Language Models are Locally Linear Mappings

Published: 3 Jun, 2025 at 11:44 AM

85.46 🤔

本文提出了一种通过分离Jacobian将大型语言模型在特定输入点转化为近乎精确局部线性系统的方法，揭示了模型内部低秩语义结构，并初步探索了输出引导应用，但泛化性和实用性受限。
When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Published: 25 May, 2025 at 11:47 AM

85.45 🤔

本文通过构建模型特定数据集和信念操控实验，揭示了大型语言模型（LLMs）的撤回行为受内部信念因果影响，并通过监督微调显著提高撤回性能。

Tag: Interpretability

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

Large Language Models are Locally Linear Mappings

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction