Reinforcement Learning from Rich Feedback
with Distributional DAgger

University of Southern California

Abstract

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient conduct rich credit assignment by propagating future expert–student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen–Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

DistIL improves over RLVR and RL with self-distillation baselines across scientific reasoning, mathematical reasoning, and coding.

Method

DistIL minimizes a forward cross-entropy between a teacher policy (stop-gradient conditional distribution which is conditioned on feedback \(f\)) and the student policy, summed over all time steps of a sampled rollout. This replaces scalar reward signals with token-level distributional supervision while remaining compatible with any blackbox feedback source.

DistIL Objective
$$\mathcal{L}_{\mathrm{DistIL}}(\theta) \;:=\; \mathbb{E}_{x \sim \rho,\; y \sim \pi_\theta(\cdot \mid x)} \!\left[ \sum_{t=1}^{H} H^{\times}\!\Big( \mathsf{sg}\!\left(\pi_{\theta}(\cdot \mid x, y_{1:t-1}, f)\right),\; \pi_\theta(\cdot \mid x, y_{1:t-1}) \Big) \right]$$

Here \(H^{\times}(p,q) = -\mathbb{E}_{a \sim p}[\log q(a)]\) is the cross-entropy, \(\mathsf{sg}(\cdot)\) denotes stop-gradient, \(f\) is the rich feedback signal (e.g. execution trace, tool output), and \(H\) is the rollout horizon. The teacher \(\mathsf{sg}(\pi_\theta(\cdot \mid x, y_{1:t-1}, f))\) may be accessed as a blackbox.

Rich Credit Assignment via Future-Credit Propagation

The sequence-level gradient of \(\mathcal{L}_{\mathrm{DistIL}}\) decomposes into a local term (dense token-wise supervision, as in prior self-distillation methods) and a future-credit term that propagates downstream teacher–student disagreement back to earlier decisions — better enabling the model to learn that a bad early token choice leads to worse futures.

Policy Gradient Decomposition
$$\nabla_\theta \mathcal{L}_{\mathrm{DistIL}} \;=\; \underbrace{ \mathbb{E}_{y \sim \pi_\theta}\!\left[ \sum_{t=1}^{H} \nabla_\theta H^{\times}\!\Big( \mathsf{sg}\!\big(\pi_{\theta}(\cdot \mid s_t, f)\big),\; \pi_\theta(\cdot \mid s_t) \Big) \right] }_{\text{local credit assignment}} \;+\; \underbrace{ \mathbb{E}_{y \sim \pi_\theta}\!\left[ \sum_{t=1}^{H} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \!\left( \sum_{i > t} H^{\times}\!\Big( \mathsf{sg}\!\big(\pi_{\theta}(\cdot \mid s_i, f)\big),\; \pi_\theta(\cdot \mid s_i) \Big) \right) \right] }_{\text{future-credit assignment}}$$

where \(s_t = (x, y_{1:t-1})\). The local term mirrors the gradient of SDPO/OPSD and provides dense per-token supervision. The future-credit term is unique to DistIL: it weights \(\log\pi_\theta(a_t\mid s_t)\) by cumulative teacher–student cross-entropy at all later steps \(i > t\), propagating the cost of disagreement back through time. Prior methods based on reverse KL or Jensen–Shannon lack this term and cannot guarantee monotonic policy improvement.

Learning under Sparse Feedback: Science

We evaluate on SciKnowEval L3 across four scientific domains (Chemistry, Physics, Biology, Materials), reporting best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Bold = best; underline = second best per column.

Method Chemistry Physics Biology Materials Average
1h5h 1h5h 1h5h 1h5h 1h5h
Qwen3-8B 41.2 59.2 30.8 58.9 47.5
  + GRPO (off-policy) 65.974.5 63.872.7 35.159.9 74.377.1 59.871.1
  + GRPO (on-policy) 63.363.4 63.663.6 49.849.8 73.974.1 62.762.7
  + SDPO 73.080.2 68.072.4 52.963.6 72.275.9 66.573.0
  + DistIL (Ours) 75.880.8 72.780.8 53.366.6 74.976.2 69.276.1
Olmo3-7B-Instruct 22.8 37.7 16.2 36.7 28.4
  + GRPO (off-policy) 39.756.7 55.363.3 35.655.8 70.975.0 50.462.7
  + GRPO (on-policy) 51.457.5 62.762.7 49.849.8 73.373.5 59.360.9
  + SDPO 70.279.2 59.864.9 49.552.9 71.878.1 62.868.8
  + DistIL (Ours) 72.181.0 67.474.5 47.855.3 73.576.9 65.271.9

Table 1. Comparison on scientific reasoning benchmarks (SciKnowEval L3). We report best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Average is computed over all four domains.

Best@16
Biology
Biology Best@16
Chemistry
Chemistry Best@16
Materials
Materials Best@16
Physics
Physics Best@16
Maj@16
Biology Maj@16
Chemistry Maj@16
Materials Maj@16
Physics Maj@16

Figure 1. Validation Best@16 (top) and Maj@16 (bottom) over training for RL with self-distillation algorithm SDPO and DistIL (ours) on Qwen3-8B across four scientific reasoning domains: biology, chemistry, materials, and physics. DistIL generally achieves higher validation performance than SDPO across domains and metrics, with gains often appearing early and sustained during training. SDPO exhibits greater variability with longer training, including a pronounced decline in biology Best@16 after roughly 100 steps and larger oscillations in chemistry and physics; DistIL is comparatively more stable.

Learning with Rich Environment Feedback: Coding

The coding setting provides rich feedback in the form of execution traces and test outputs. We evaluate on LiveCodeBench v6 (LCBv6), reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\) at temperature \(\tau{=}0.2\), checkpoint at step-80 (following SDPO).

LCB Score Maj@k

Score · Maj@k, τ = 0.2

LCB Score Best@k

Score · Best@k, τ = 0.2

LCB Accuracy Maj@k

Accuracy · Maj@k, τ = 0.2

LCB Accuracy Best@k

Accuracy · Best@k, τ = 0.2

Figure 2. LCBv6 evaluation at \(\tau{=}0.2\) for checkpoint at step-80, reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\).

Learning to Solve Very Hard Mathematical Reasoning Problems

We benchmark on AIME24, AIME25, HMMT25, AMC23, and Minerva, reporting average (Avg) and pass (Pass) accuracy. We follow OPSD and report the best checkpoint up to step 100. Bold = best; underline = second best per column.

Model Method AIME24 AIME25 HMMT25 AMC23 Minerva
AvgPass AvgPass AvgPass AvgPass AvgPass
Qwen3-4B Base 61.482.0 50.369.5 30.348.8 93.898.8 43.248.6
+ SFT 55.880.9 43.167.4 29.746.0 91.797.0 41.950.4
+ GRPO 61.482.0 50.369.5 30.348.8 93.898.8 43.248.6
+ SDPO 60.985.4 49.674.1 32.850.9 93.999.8 43.452.0
+ OPSD 63.285.4 51.576.1 33.054.2 94.599.6 43.451.8
+ DistIL (Ours) 65.387.5 55.377.6 33.056.6 94.8100.0 44.252.9
Qwen3-8B Base 75.986.8 66.979.6 44.765.6 95.6100.0 48.956.3
+ SFT 74.285.3 65.380.7 42.264.4 96.3100.0 48.256.2
+ GRPO 75.986.8 66.979.6 44.765.6 95.6100.0 48.956.3
+ SDPO 76.989.5 68.384.0 45.671.2 96.5100.0 49.257.7
+ OPSD 76.591.3 69.783.3 45.568.2 96.2100.0 49.257.8
+ DistIL (Ours) 76.490.7 71.185.0 46.471.4 96.6100.0 49.558.4

Table 2. Performance on mathematical reasoning benchmarks for Qwen3 models. We report the best checkpoint up to step 100 following OPSD.

Ablation Study

We isolate the contribution of DistIL's future-credit term. The CE baseline shares the forward cross-entropy form but performs only local token-wise credit assignment, similar to SDPO/OPSD. We also ablate the number of top-\(k\) teacher tokens used during training.

DistIL vs CE Credit Assignment

Figure 3. DistIL vs. CE on Materials benchmark. CE does local token-wise credit assignment only; DistIL adds full future-credit propagation.

Top-k Ablation Physics

Figure 4. Avg@16 on Physics benchmark as a function of the number of top-\(k\) tokens used during DistIL training on the Materials benchmark.

BibTeX

If you find this work useful, please cite our paper:

@misc{TBA, title = {Reinforcement Learning from Rich Feedback with Distributional DAgger}, author = {Rishabh Agrawal and Jacob Fein-Ashley and Paria Rashidinejad}, year = {2026}, eprint = {2606.05152}, archivePrefix = {arXiv}, primaryClass = {cs.LG}, url = {https://arxiv.org/abs/2606.05152}, }