Vector Policy Optimization: Training diversity improves test-time search

Researchers demonstrated that training models to generate diverse policy outputs during training increases the effectiveness of search procedures at inference time. This challenges the conventional approach of optimizing for single best-path predictions, instead treating diversity as a training objective that improves downstream search efficiency.

For LLM operators, this directly impacts decoding strategy ROI. Current inference costs scale with search breadth (beam size, tree search depth). If training-time diversity reduces the search space needed to find high-quality outputs, operators can either maintain quality at lower compute cost or improve quality within fixed budgets. This applies directly to reasoning tasks, where multi-step search is already standard practice.

Builders optimizing for cost should reconsider training objectives. Rather than purely supervised loss functions, incorporating output diversity during training—potentially through ensemble methods or explicit variance penalties—becomes a tractable way to reduce inference-time computational requirements. This shifts the cost-benefit calculus of longer training runs versus lower per-inference overhead.