Researchers have proposed Orthrus, a dual-view diffusion method designed to enable parallel token generation in large language models without modifying or retraining the base model. According to the paper summary circulating on r/MachineLearning and r/LocalLLaMA, the approach works with a frozen backbone and claims to produce output distributions provably identical to standard autoregressive generation.
On Qwen3-8B, the method reportedly generates up to 7.8 tokens per forward pass — compared to the standard one token per pass — while maintaining memory efficiency. No fine-tuning or quantization changes to the base model are required.
The authors frame Orthrus as targeting inference throughput directly: the same model weights, higher token yield per compute step. Community discussion on both subreddits has been active, suggesting practitioners are evaluating it against existing speculative decoding and draft-model approaches.
For inference engineers running quantized deployments, Orthrus as described offers a potential path to higher throughput without incurring the cost of distillation, fine-tuning, or maintaining a separate draft model.