Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient conduct rich credit assignment by propagating future expert–student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen–Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
DistIL minimizes a forward cross-entropy between a teacher policy (stop-gradient conditional distribution which is conditioned on feedback \(f\)) and the student policy, summed over all time steps of a sampled rollout. This replaces scalar reward signals with token-level distributional supervision while remaining compatible with any blackbox feedback source.
Here \(H^{\times}(p,q) = -\mathbb{E}_{a \sim p}[\log q(a)]\) is the cross-entropy, \(\mathsf{sg}(\cdot)\) denotes stop-gradient, \(f\) is the rich feedback signal (e.g. execution trace, tool output), and \(H\) is the rollout horizon. The teacher \(\mathsf{sg}(\pi_\theta(\cdot \mid x, y_{1:t-1}, f))\) may be accessed as a blackbox.
The sequence-level gradient of \(\mathcal{L}_{\mathrm{DistIL}}\) decomposes into a local term (dense token-wise supervision, as in prior self-distillation methods) and a future-credit term that propagates downstream teacher–student disagreement back to earlier decisions — better enabling the model to learn that a bad early token choice leads to worse futures.
where \(s_t = (x, y_{1:t-1})\). The local term mirrors the gradient of SDPO/OPSD and provides dense per-token supervision. The future-credit term is unique to DistIL: it weights \(\log\pi_\theta(a_t\mid s_t)\) by cumulative teacher–student cross-entropy at all later steps \(i > t\), propagating the cost of disagreement back through time. Prior methods based on reverse KL or Jensen–Shannon lack this term and cannot guarantee monotonic policy improvement.
We evaluate on SciKnowEval L3 across four scientific domains (Chemistry, Physics, Biology, Materials), reporting best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Bold = best; underline = second best per column.
| Method | Chemistry | Physics | Biology | Materials | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1h | 5h | 1h | 5h | 1h | 5h | 1h | 5h | 1h | 5h | |
| Qwen3-8B | 41.2 | 59.2 | 30.8 | 58.9 | 47.5 | |||||
| + GRPO (off-policy) | 65.9 | 74.5 | 63.8 | 72.7 | 35.1 | 59.9 | 74.3 | 77.1 | 59.8 | 71.1 |
| + GRPO (on-policy) | 63.3 | 63.4 | 63.6 | 63.6 | 49.8 | 49.8 | 73.9 | 74.1 | 62.7 | 62.7 |
| + SDPO | 73.0 | 80.2 | 68.0 | 72.4 | 52.9 | 63.6 | 72.2 | 75.9 | 66.5 | 73.0 |
| + DistIL (Ours) | 75.8 | 80.8 | 72.7 | 80.8 | 53.3 | 66.6 | 74.9 | 76.2 | 69.2 | 76.1 |
| Olmo3-7B-Instruct | 22.8 | 37.7 | 16.2 | 36.7 | 28.4 | |||||
| + GRPO (off-policy) | 39.7 | 56.7 | 55.3 | 63.3 | 35.6 | 55.8 | 70.9 | 75.0 | 50.4 | 62.7 |
| + GRPO (on-policy) | 51.4 | 57.5 | 62.7 | 62.7 | 49.8 | 49.8 | 73.3 | 73.5 | 59.3 | 60.9 |
| + SDPO | 70.2 | 79.2 | 59.8 | 64.9 | 49.5 | 52.9 | 71.8 | 78.1 | 62.8 | 68.8 |
| + DistIL (Ours) | 72.1 | 81.0 | 67.4 | 74.5 | 47.8 | 55.3 | 73.5 | 76.9 | 65.2 | 71.9 |
Table 1. Comparison on scientific reasoning benchmarks (SciKnowEval L3). We report best avg@16 within 1h and 5h of wall-clock training on 4× NVIDIA H200 GPUs. Average is computed over all four domains.
Figure 1. Validation Best@16 (top) and Maj@16 (bottom) over training for RL with self-distillation algorithm SDPO and DistIL (ours) on Qwen3-8B across four scientific reasoning domains: biology, chemistry, materials, and physics. DistIL generally achieves higher validation performance than SDPO across domains and metrics, with gains often appearing early and sustained during training. SDPO exhibits greater variability with longer training, including a pronounced decline in biology Best@16 after roughly 100 steps and larger oscillations in chemistry and physics; DistIL is comparatively more stable.
The coding setting provides rich feedback in the form of execution traces and test outputs. We evaluate on LiveCodeBench v6 (LCBv6), reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\) at temperature \(\tau{=}0.2\), checkpoint at step-80 (following SDPO).
Score · Maj@k, τ = 0.2
Score · Best@k, τ = 0.2
Accuracy · Maj@k, τ = 0.2
Accuracy · Best@k, τ = 0.2
Figure 2. LCBv6 evaluation at \(\tau{=}0.2\) for checkpoint at step-80, reporting Score and Accuracy at Best@\(k\) and Maj@\(k\) for \(k \in \{2,4,8,16\}\).
We benchmark on AIME24, AIME25, HMMT25, AMC23, and Minerva, reporting average (Avg) and pass (Pass) accuracy. We follow OPSD and report the best checkpoint up to step 100. Bold = best; underline = second best per column.
| Model | Method | AIME24 | AIME25 | HMMT25 | AMC23 | Minerva | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Avg | Pass | Avg | Pass | Avg | Pass | Avg | Pass | Avg | Pass | ||
| Qwen3-4B | Base | 61.4 | 82.0 | 50.3 | 69.5 | 30.3 | 48.8 | 93.8 | 98.8 | 43.2 | 48.6 |
| + SFT | 55.8 | 80.9 | 43.1 | 67.4 | 29.7 | 46.0 | 91.7 | 97.0 | 41.9 | 50.4 | |
| + GRPO | 61.4 | 82.0 | 50.3 | 69.5 | 30.3 | 48.8 | 93.8 | 98.8 | 43.2 | 48.6 | |
| + SDPO | 60.9 | 85.4 | 49.6 | 74.1 | 32.8 | 50.9 | 93.9 | 99.8 | 43.4 | 52.0 | |
| + OPSD | 63.2 | 85.4 | 51.5 | 76.1 | 33.0 | 54.2 | 94.5 | 99.6 | 43.4 | 51.8 | |
| + DistIL (Ours) | 65.3 | 87.5 | 55.3 | 77.6 | 33.0 | 56.6 | 94.8 | 100.0 | 44.2 | 52.9 | |
| Qwen3-8B | Base | 75.9 | 86.8 | 66.9 | 79.6 | 44.7 | 65.6 | 95.6 | 100.0 | 48.9 | 56.3 |
| + SFT | 74.2 | 85.3 | 65.3 | 80.7 | 42.2 | 64.4 | 96.3 | 100.0 | 48.2 | 56.2 | |
| + GRPO | 75.9 | 86.8 | 66.9 | 79.6 | 44.7 | 65.6 | 95.6 | 100.0 | 48.9 | 56.3 | |
| + SDPO | 76.9 | 89.5 | 68.3 | 84.0 | 45.6 | 71.2 | 96.5 | 100.0 | 49.2 | 57.7 | |
| + OPSD | 76.5 | 91.3 | 69.7 | 83.3 | 45.5 | 68.2 | 96.2 | 100.0 | 49.2 | 57.8 | |
| + DistIL (Ours) | 76.4 | 90.7 | 71.1 | 85.0 | 46.4 | 71.4 | 96.6 | 100.0 | 49.5 | 58.4 | |
Table 2. Performance on mathematical reasoning benchmarks for Qwen3 models. We report the best checkpoint up to step 100 following OPSD.
We isolate the contribution of DistIL's future-credit term. The CE baseline shares the forward cross-entropy form but performs only local token-wise credit assignment, similar to SDPO/OPSD. We also ablate the number of top-\(k\) teacher tokens used during training.
Figure 3. DistIL vs. CE on Materials benchmark. CE does local token-wise credit assignment only; DistIL adds full future-credit propagation.
Figure 4. Avg@16 on Physics benchmark as a function of the number of top-\(k\) tokens used during DistIL training on the Materials benchmark.
If you find this work useful, please cite our paper: