Skip to main content
Latest research:When the Circuit Dissolves →12 vindexes on Hugging Face
Request demo

Speculative Decoding for Free.
92% of the speedup, zero drafter retraining.

z-lab's DFlash drafter was trained against stock Gemma 4 31B. We dropped it on top of our QLoRA fine-tune and it captured 92% of the published speedup with no drafter retraining. Here is the math, the vLLM patch we had to upstream to make it run, and the prod-cutover numbers (~15× faster, ~4× cheaper).

Building in Public — Inference Diaries, Part I


TL;DR

We have a fine-tuned 31B-parameter Gemma 4 served on Modal H100 — Direct Free-text Optimization (DFO), our internal SFT/DPO mix on the AskTheDoctor medical Q&A corpus. The question we were trying to answer:

Can z-lab’s recently-released gemma-4-31B-it-DFlash — a 2B block-diffusion drafter trained against the stock Gemma 4 31B Instruct — give us a meaningful inference speedup without retraining the drafter against our DFO weights?

Three numbers tell the story:

$$ \text{speedup}{\text{DFO}} ;=; 1.18\times ;\text{(avg)} \quad\quad \text{speedup}{\text{DFO}}^{\text{math-peak}} ;=; 4.0\times \quad\quad \text{retention} ;=; \frac{1.18}{1.28} ;=; 92%. $$

The retention number is the load-bearing one. The drafter was trained for a target it never saw — and it kept 92% of the throughput it earned on the target it was trained for. We expected somewhere between 50% and 80%. We got 92%.

We also had to patch a structural blocker in vLLM that prevented Gemma 4 + DFlash from working at all, and contributed the patch upstream:

  • vLLM issue #42068Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters
  • vLLM PR #42069 — one-line backend=None override letting the drafter autoselect a non-causal-capable backend

Total experiment GPU spend: ~$27 across 14 attempts on Modal.


The drafter that isn’t a model

Speculative decoding lets a small fast “drafter” propose tokens that a large slow “target” verifies in parallel. Standard implementations draft K tokens autoregressively — K serial forward passes through the drafter — then verify them all in one parallel pass through the target.

DFlash is a different shape. It’s a block-diffusion drafter:

  • 5 trained transformer layers (Qwen3 derivatives have 8)
  • Shares the target’s embedding + LM head, frozen
  • Conditioned on hidden states from 5 uniformly-sampled layers of the target — those states are concatenated, projected, and injected into the drafter’s KV cache as persistent context
  • Drafts a whole block of K = 16 tokens in one parallel forward pass, then the target verifies the entire block in one parallel pass

Throughput grows roughly linearly with acceptance length $\ell$ — the number of drafted tokens the target accepts before rejecting one and resuming autoregressive generation:

$$ \text{tok/s}{\text{spec}} ;\approx; \frac{\mathbb{E}[\ell] + 1}{T{\text{drafter}} + T_{\text{verifier}}} \quad\text{vs}\quad \text{tok/s}{\text{base}} ;=; \frac{1}{T{\text{verifier}}}. $$

When $\mathbb{E}[\ell]$ is high (sharp next-token distributions — arithmetic, code, step-by-step reasoning), spec-decode wins big. When it’s low (open-ended creative text, low-entropy verbose padding), the drafter overhead can erase the gain.

The catch with DFlash specifically: because the drafter is conditioned on the target’s hidden-state distribution, it’s tuned to a specific target. Z-lab’s published drafter was trained against google/gemma-4-31B-it — stock Instruct, no fine-tune. Our DFO checkpoint drifts from that base by however much our SFT + DPO + Direct Free-text passes shifted the model.

No one had published a base-vs-fine-tune ablation. We’re the experiment.


Why we expected the answer to be “mostly works”

Two reasons to believe a stock-trained drafter degrades gracefully on a fine-tuned target rather than collapsing:

  1. Verifier-side losslessness is unconditional. The target sees every drafted block, accepts the longest verifiable prefix, and generates the next token autoregressively. There is no quality-loss path. If the drafter is bad, the system gets slower, not worse.

  2. DFO is a relatively small distributional shift. We’re not training a different model — we’re fine-tuning on a domain corpus with DPO from a strong base. The hidden-state distribution at the 5 layers DFlash conditions on shouldn’t be wildly off-manifold.

Where the drafter could collapse: if our DFO training shifted early-layer representations a lot (the drafter is conditioned on shallow → deep layers), or if DFO output puts mass on tokens stock Gemma rarely picks. Either is plausible. Phase 2 had to tell us.


What it took to even run the experiment

We expected this to be a couple of model-launch invocations. It wasn’t — and the blocker turned out to be an architectural decoupling problem worth describing because it generalizes beyond DFlash.

The blocker, in one paragraph. vLLM’s Gemma 4 config force-locks the attention backend to TRITON_ATTN when the model has heterogeneous head dimensions (Gemma 4 has head_dim=256 for sliding-window attention layers and global_head_dim=512 for full-attention layers). That lock is correct for the target’s own forward pass — preventing mixed-backend numerical drift between sliding and global layers. But when spec-decode is wired in, the same lock propagates to the drafter as well. DFlash’s drafter uses non-causal (bidirectional) attention to draft a full 16-token block in one pass. TRITON_ATTN doesn’t support non-causal attention and rejects the drafter at engine init:

ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid
for this configuration. Reason: ['non-causal attention not supported']

Result: Gemma 4 + DFlash speculative decoding is structurally impossible upstream today.

The general lesson: spec-decode’s MTP (multi-token prediction) variant needs backend propagation, because those drafters share KV cache with the target. DFlash drafters have their own KV cache and are algorithmically independent — they’re a different shape of speculative-decode entirely. A backend lock that’s correct for one shape is wrong for the other. The fix is one line — make backend propagation conditional on whether the drafter is independent — but the diagnosis is the load-bearing work, because nothing in the error message points you at MTP-vs-DFlash as the relevant distinction.

The fork lives at vLLM PR #42069; the upstream issue with the full diagnosis is at #42068. 12 attempts and ~$25 of H100 time before we had a clean Phase 1 run, almost all of it spent isolating this single decoupling issue.


Phase 1 — stock target, the harness check

Stock google/gemma-4-31B-it + DFlash drafter, 10 prompts (5 math, 5 conversational), temperature=0.0, max_new_tokens=256, on Modal H100-80GB:

Prompt categorywith DFlashwithoutspeedup
Math reasoning peak (prompt 4)169–176 tok/s~40 tok/s4.4×
Mixed average (10 prompts)50.6 tok/s39.4 tok/s1.28×

Math-heavy prompts dominate the speedup — exactly as the paper predicts. Acceptance length is highest when the next-token distribution is sharp, which is the case for arithmetic and step-by-step reasoning. The cold-start prompt drags the average down (17.5 tok/s on prompt 1 due to torch.compile + CUDA graph capture for the spec pipeline).

Output bit-identical between the two runs, as the verifier-lossless guarantee promises.

This was enough to confirm: our patched vLLM works, the drafter loads, the spec pipeline runs end-to-end. Time to swap in our target.


Phase 2 — DFO target, the actual question

Our QLoRA fine-tune ships as a 4-bit adapter (adapter_model.safetensors + adapter_config.json) trained with unsloth. To feed it to vLLM we needed a merged bf16 checkpoint. After peft 0.13’s Gemma4ClippableLinear rejection ate ~$0.20 of CPU-merge attempt, we split the work:

  1. merge_dfo_to_volume on A100-40GB — unsloth FastLanguageModel.from_pretrained(..., load_in_4bit=True) then save_pretrained_merged(save_method="merged_16bit"). NF4 load 102s, bf16 dequant + write 357s. Total ~7.6 min, ~$0.20. Persisted to arena-models:/gemma4-31b-qlora-v2-atd-merged/.
  2. phase2_dfo_target on H100 — loads the merged path directly (no merge cost on the expensive GPU), runs the same A/B as Phase 1.

Result:

PhaseTargetAvg speedupMath peakvs Phase 1
1google/gemma-4-31B-it (stock)1.28×4.4×
2merged DFO QLoRA target1.18×4.0×92%

DFO captures 92% of the stock-target speedup. We expected somewhere between 50% and 80%. Got 92%.

The math-peak retention is similarly strong (4.0× / 4.4× = 91%). And critically, the verifier-lossless guarantee held: prompt 3 emitted exactly 1 token in both runs (a behavior shift in the DFO model where it terminates early on a particular medical-reasoning prompt) — confirming the spec-decode pipeline really is preserving the target’s distribution.

Bar chart of DFlash speedups for stock Gemma 4 31B vs our DFO QLoRA fine-tune, on the mixed-prompt average (1.28× vs 1.18×, 92% retention) and the math-reasoning peak prompt (4.4× vs 4.0×, 91% retention). Footer band notes that spec-decode is verifier-lossless so the speedup is a strict throughput improvement with zero quality cost.
The drafter was trained for a target it never saw. It kept 92% of the throughput it earned on the target it was trained for. Verifier-lossless means the per-token quality cost is exactly zero — the speedup is a strict win.

What this means for anyone fine-tuning Gemma 4

The implication of Phase 2 is the genuinely useful one:

You can take z-lab’s stock-trained DFlash drafter, drop it on top of your QLoRA-merged Gemma 4, and capture ~90% of the published speedup. No drafter retraining. ~$0 on top of whatever you spend serving today.

z-lab’s training recipe isn’t public yet (“coming soon”), and a custom drafter pass is ~$5–15K of 8×H100 time. If you can get 92% of the speedup for free, the math says wait on the custom drafter.

We’d love to see independent confirmation on other Gemma 4 fine-tunes — and on Llama 3.1 / Qwen3 fine-tunes paired with their respective stock drafters. The acceptance-length retention is probably similar (transformers fine-tuned on domain corpora generally preserve the layer-wise hidden-state distribution well), but 92% is one datapoint, not a curve.


Two ways to read “1.18×”

The headline 1.18× hides two separate comparisons that point in different directions.

Comparison 1 — same target, same H100, with-DFlash vs without. The patch’s direct impact. 1.18× / 4.0× on our DFO target. Verifier-lossless. The spec-decode mechanism literally adds tokens-per-second to a fixed checkpoint on a fixed GPU.

Comparison 2 — stock target vs DFO target, both with DFlash. The 92% retention. Confirms our fine-tune composes with the stock-trained drafter, which is the load-bearing finding for the entire “drop-in DFlash for fine-tunes” hypothesis.

The first comparison says spec-decode works. The second says it transfers across the supervised + DPO distributional shift. Neither follows from the other; both are necessary for the thesis.


Concurrency: where the architecture stops mattering

Single-stream throughput numbers are easy to over-interpret. The interesting throughput regime for any inference path is what happens under concurrent load — and here the architectural choice (continuous batching vs serialized model.generate) dominates the kernel-level speedup. We measured the DFlash endpoint at concurrency 1 / 5 / 10 / 25 / 50:

Three-panel comparison between the prior serialized inference path and the DFlash continuous-batching path on the same suite. Pass rate at concurrency=2: 2 of 10 (8 timeouts on the serialized path) vs 10 of 10 on DFlash. Median per-test latency: 37 seconds vs 2.5 seconds (~15× faster). Cost per request: $0.0113 vs $0.0027 (~4× cheaper). A footer strip notes quality on the 2 prompts both paths served was identical, so the failures are infrastructure (queue + timeout), not output drift.
The 15× per-test gap is mostly the architectural difference (continuous batching vs serialized generate), not just the spec-decode kernel speedup. Spec-decode is the per-token win; continuous batching is the queue-depth win. They compose multiplicatively under realistic concurrent load.

Throughput plateaus at concurrency ≈ 10 (~1.3 rps, ~86 tok/s); beyond that the engine just queues and inflates p99 latency without raising completion throughput. The single-stream → 10-way batched gain on DFlash specifically is ~2.3× (38 → 86 tok/s). Less dramatic than what you see on long-prompt scenarios — our test prompts were short medical Q&A — but consistent with what continuous-batching architectures show on any LLM. For the long-form chat regime that real users actually generate, the multiplier grows with average response length.

Quality on every prompt that both paths could serve was identical, as the verifier-losslessness guarantee predicts. The 8 failures on the serialized path were timeout failures (queue exhaustion at 240s), not output-drift failures.


TPU is a separate bet

Per Google’s blog, DFlash gets an additional ~2× on TPU v5p via JAX/Pallas. We’re deferring because:

  • No published Gemma-31B-on-TPU benchmark; the blog uses Llama-3.1-8B and Qwen3-4B targets.
  • On-demand TPU v5p list price ($4.20/chip-hour × 2-4 chips for 31B = $8.40–$16.80/hr) is roughly cost-neutral with Modal H100 at $3.95/hr unless we commit to 1-yr/3-yr discounts.
  • The PyTorch/torchax TPU path is WIP; production stack would mean JAX/Pallas, a much bigger porting effort.

Once we have a real H100 + DFlash $/M-tokens baseline through Fuhrman calibration, we’ll have something concrete to compare a TPU pilot against.


Reproduce it

The experiment is two phases. Each takes about 10 minutes of H100 time once the patched vLLM is in place.

Phase 1: stock target sanity check
  load google/gemma-4-31B-it + z-lab/gemma-4-31B-it-DFlash drafter
  run 10 prompts, temperature=0.0, max_new_tokens=256
  measure tok/s with and without spec-decode
  expected: 1.28× avg, 4.4× math-peak

Phase 2: your fine-tune
  merge your QLoRA adapter to bf16
  load the merged checkpoint + the same stock drafter
  run the identical 10-prompt suite
  measure tok/s with and without spec-decode
  the ratio of (Phase 2 speedup) / (Phase 1 speedup) is your retention number

The patched dflash.py (with the one-line backend-decoupling fix) is in our public repo and overlays onto vLLM nightly without a rebuild. Once vLLM PR #42069 lands upstream, the overlay disappears and the standard pip install vllm is all you need.


Acknowledgments

z-lab for releasing the DFlash drafter and the underlying paper. vLLM maintainers for the spec-decode framework and for entertaining a fix for a corner-case backend lock. unsloth for making the Gemma 4 4-bit + merge-to-bf16 path Just Work.


Next up in the Inference Diaries: porting this same stack to TPU v5p and seeing whether the published 2× JAX/Pallas multiplier holds for a 31B medical Q&A target — and what changes when DFlash sits behind a calibrated judge instead of a strict gold-reference scorer.

Building Divinci in public. The full notebook for this experiment is at notebooks/DFLASH_EXPERIMENT.md. The Modal scaffold and patched dflash.py are at workspace/workers/eval-harness/eval_harness/scripts/.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today