DistIL: Reinforcement Learning from Rich Feedback with Distributional DAgger

Abstract

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient conduct rich credit assignment by propagating future expert–student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen–Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

DistIL improves over RLVR and RL with self-distillation baselines across scientific reasoning, mathematical reasoning, and coding.

Method

DistIL minimizes a forward cross-entropy between a teacher policy (stop-gradient conditional distribution which is conditioned on feedback \(f\)) and the student policy, summed over all time steps of a sampled rollout. This replaces scalar reward signals with token-level distributional supervision while remaining compatible with any blackbox feedback source.

DistIL Objective $$\mathcal{L}_{\mathrm{DistIL}}(\theta) \;:=\; \mathbb{E}_{x \sim \rho,\; y \sim \pi_\theta(\cdot \mid x)} \!\left[ \sum_{t=1}^{H} H^{\times}\!\Big( \mathsf{sg}\!\left(\pi_{\theta}(\cdot \mid x, y_{1:t-1}, f)\right),\; \pi_\theta(\cdot \mid x, y_{1:t-1}) \Big) \right]$$ Here \(H^{\times}(p,q) = -\mathbb{E}_{a \sim p}[\log q(a)]\) is the cross-entropy, \(\mathsf{sg}(\cdot)\) denotes stop-gradient, \(f\) is the rich feedback signal (e.g. execution trace, tool output), and \(H\) is the rollout horizon. The teacher \(\mathsf{sg}(\pi_\theta(\cdot \mid x, y_{1:t-1}, f))\) may be accessed as a blackbox.

Rich Credit Assignment via Future-Credit Propagation

The sequence-level gradient of \(\mathcal{L}_{\mathrm{DistIL}}\) decomposes into a local term (dense token-wise supervision, as in prior self-distillation methods) and a future-credit term that propagates downstream teacher–student disagreement back to earlier decisions — better enabling the model to learn that a bad early token choice leads to worse futures.

Policy Gradient Decomposition $$\nabla_\theta \mathcal{L}_{\mathrm{DistIL}} \;=\; \underbrace{ \mathbb{E}_{y \sim \pi_\theta}\!\left[ \sum_{t=1}^{H} \nabla_\theta H^{\times}\!\Big( \mathsf{sg}\!\big(\pi_{\theta}(\cdot \mid s_t, f)\big),\; \pi_\theta(\cdot \mid s_t) \Big) \right] }_{\text{local credit assignment}} \;+\; \underbrace{ \mathbb{E}_{y \sim \pi_\theta}\!\left[ \sum_{t=1}^{H} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \!\left( \sum_{i > t} H^{\times}\!\Big( \mathsf{sg}\!\big(\pi_{\theta}(\cdot \mid s_i, f)\big),\; \pi_\theta(\cdot \mid s_i) \Big) \right) \right] }_{\text{future-credit assignment}}$$ where \(s_t = (x, y_{1:t-1})\). The local term mirrors the gradient of SDPO/OPSD and provides dense per-token supervision. The future-credit term is unique to DistIL: it weights \(\log\pi_\theta(a_t\mid s_t)\) by cumulative teacher–student cross-entropy at all later steps \(i > t\), propagating the cost of disagreement back through time. Prior methods based on reverse KL or Jensen–Shannon lack this term and cannot guarantee monotonic policy improvement.

Learning under Sparse Feedback: Science

We evaluate on SciKnowEval L3 across four scientific domains (Chemistry, Physics, Biology, Materials), reporting best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Bold = best; underline = second best per column.

Method	Chemistry		Physics		Biology		Materials		Average
Method	1h	5h	1h	5h	1h	5h	1h	5h	1h	5h
Qwen3-8B	41.2		59.2		30.8		58.9		47.5
+ GRPO (off-policy)	65.9	74.5	63.8	72.7	35.1	59.9	74.3	77.1	59.8	71.1
+ GRPO (on-policy)	63.3	63.4	63.6	63.6	49.8	49.8	73.9	74.1	62.7	62.7
+ SDPO	73.0	80.2	68.0	72.4	52.9	63.6	72.2	75.9	66.5	73.0
+ DistIL (Ours)	75.8	80.8	72.7	80.8	53.3	66.6	74.9	76.2	69.2	76.1
Olmo3-7B-Instruct	22.8		37.7		16.2		36.7		28.4
+ GRPO (off-policy)	39.7	56.7	55.3	63.3	35.6	55.8	70.9	75.0	50.4	62.7
+ GRPO (on-policy)	51.4	57.5	62.7	62.7	49.8	49.8	73.3	73.5	59.3	60.9
+ SDPO	70.2	79.2	59.8	64.9	49.5	52.9	71.8	78.1	62.8	68.8
+ DistIL (Ours)	72.1	81.0	67.4	74.5	47.8	55.3	73.5	76.9	65.2	71.9

Table 1. Comparison on scientific reasoning benchmarks (SciKnowEval L3). We report best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Average is computed over all four domains.

Best@16

Biology

Chemistry

Materials

Physics

Maj@16

Figure 1. Validation Best@16 (top) and Maj@16 (bottom) over training for RL with self-distillation algorithm SDPO and DistIL (ours) on Qwen3-8B across four scientific reasoning domains: biology, chemistry, materials, and physics. DistIL generally achieves higher validation performance than SDPO across domains and metrics, with gains often appearing early and sustained during training. SDPO exhibits greater variability with longer training, including a pronounced decline in biology Best@16 after roughly 100 steps and larger oscillations in chemistry and physics; DistIL is comparatively more stable.

Learning with Rich Environment Feedback: Coding

The coding setting provides rich feedback in the form of execution traces and test outputs. We evaluate on LiveCodeBench v6 (LCBv6), reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\) at temperature \(\tau{=}0.2\), checkpoint at step-80 (following SDPO).

Score · Maj@k, τ = 0.2

Score · Best@k, τ = 0.2

Accuracy · Maj@k, τ = 0.2

Accuracy · Best@k, τ = 0.2

Figure 2. LCBv6 evaluation at \(\tau{=}0.2\) for checkpoint at step-80, reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\).

Learning to Solve Very Hard Mathematical Reasoning Problems

We benchmark on AIME24, AIME25, HMMT25, AMC23, and Minerva, reporting average (Avg) and pass (Pass) accuracy. We follow OPSD and report the best checkpoint up to step 100. Bold = best; underline = second best per column.

Model	Method	AIME24		AIME25		HMMT25		AMC23		Minerva
Model	Method	Avg	Pass	Avg	Pass	Avg	Pass	Avg	Pass	Avg	Pass
Qwen3-4B	Base	61.4	82.0	50.3	69.5	30.3	48.8	93.8	98.8	43.2	48.6
	+ SFT	55.8	80.9	43.1	67.4	29.7	46.0	91.7	97.0	41.9	50.4
	+ GRPO	61.4	82.0	50.3	69.5	30.3	48.8	93.8	98.8	43.2	48.6
	+ SDPO	60.9	85.4	49.6	74.1	32.8	50.9	93.9	99.8	43.4	52.0
	+ OPSD	63.2	85.4	51.5	76.1	33.0	54.2	94.5	99.6	43.4	51.8
	+ DistIL (Ours)	65.3	87.5	55.3	77.6	33.0	56.6	94.8	100.0	44.2	52.9
Qwen3-8B	Base	75.9	86.8	66.9	79.6	44.7	65.6	95.6	100.0	48.9	56.3
	+ SFT	74.2	85.3	65.3	80.7	42.2	64.4	96.3	100.0	48.2	56.2
	+ GRPO	75.9	86.8	66.9	79.6	44.7	65.6	95.6	100.0	48.9	56.3
	+ SDPO	76.9	89.5	68.3	84.0	45.6	71.2	96.5	100.0	49.2	57.7
	+ OPSD	76.5	91.3	69.7	83.3	45.5	68.2	96.2	100.0	49.2	57.8
	+ DistIL (Ours)	76.4	90.7	71.1	85.0	46.4	71.4	96.6	100.0	49.5	58.4

Table 2. Performance on mathematical reasoning benchmarks for Qwen3 models. We report the best checkpoint up to step 100 following OPSD.

Ablation Study

We isolate the contribution of DistIL's future-credit term. The CE baseline shares the forward cross-entropy form but performs only local token-wise credit assignment, similar to SDPO/OPSD. We also ablate the number of top-\(k\) teacher tokens used during training.

Figure 3. DistIL vs. CE on Materials benchmark. CE does local token-wise credit assignment only; DistIL adds full future-credit propagation.

Figure 4. Avg@16 on Physics benchmark as a function of the number of top-\(k\) tokens used during DistIL training on the Materials benchmark.

Abstract

Method

Rich Credit Assignment via Future-Credit Propagation

Learning under Sparse Feedback: Science

Learning with Rich Environment Feedback: Coding

Learning to Solve Very Hard Mathematical Reasoning Problems

Ablation Study

BibTeX