Research

Posts in categories: "Research" (6 posts)

May 09, 2026

Research

Speculative Decoding for Free: Pairing DFlash with our DFO-Tuned Gemma 4 31B

z-lab's DFlash drafter was trained against stock Gemma 4 31B. We dropped it on top of our QLoRA fine-tune and it captured 92% of the published speedup with no drafter retraining. Here is the math, the vLLM patch we had to upstream to make it run, and the prod-cutover numbers (~15× faster, ~4× cheaper).

DFlashSpeculative DecodingGemma 4vLLMInferenceH100QLoRADFO

May 01, 2026

Research

The Two Models That Never Met. Both Measured at the Same Depth.

Gemma4 and Qwen3 were trained by different organizations on different data with different architectures. Their internal representations are 99.2% similar at matched depth. Neither model knew the other existed.

LarQLInterpretabilityCKACross-ModelMechanistic InterpretabilityUniversal Constants

April 27, 2026

Research

When the Circuit Dissolves

Two natively-trained 1-bit language models, from two different organizations, converge on the same anomaly: the four-stage circuit that organizes every fp16 transformer simply isn't there. Both models still answer correctly. The structure is gone, but the behavior survived.

LarQLInterpretabilityQuantizationBitNetBonsaiMechanistic Interpretability

April 26, 2026

Research

Inside the RAG Arena: When the Judges Don't Agree

We ran a 200-item RAG arena on the AskTheDoctor corpus across three models and two retrieval configurations. The headline (v2-atd ≈ Llama 4 Scout, both at ~0.58) is interesting. The methodology footnote is more interesting: we then re-judged 415 of those answers with two different LLM judges and got Spearman ρ = 0.55 between them. That number is the case for human calibration.

RAG-ArenaScoredQARAG RoutingEXITLLM-as-JudgeSpearmanEvaluationQLoRA

April 25, 2026

Research

Deleting Paris from a Language Model

A single rank-1 weight edit suppresses one learned fact while leaving the rest of the model intact. No fine-tuning. No retraining. Just a feature subtracted from one layer's gate matrix — with a receipt.

LarQLInterpretabilityKnowledge EditingUnlearningMechanistic Interpretability

April 23, 2026

Research

The Architecture Every Language Model Converges To

I've run LarQL on 9 models from 5 organizations — from a 360M toy to OpenAI's 120B MoE. Three numbers hold within ±15% across all of them. One pattern vanishes the moment you go to 1-bit weights.

LarQLInterpretabilityTransformersMachine LearningMechanistic Interpretability