Small Model Trained on Its Own Mistakes Reaches 80% on HumanEval, Beats GPT-3.5 on Math

VOKRIX INTELLIGENCE

WHY IT MATTERS

A reported experiment on r/LocalLLaMA describes training a small language model iteratively on its own errors, achieving 80% on HumanEval and outperforming GPT-3.5 on math benchmarks. No paper or external verification link is provided. The methodology aligns with self-improvement and rejection sampling fine-tuning techniques.

A Reddit user on r/LocalLLaMA reports training a small language model iteratively on its own errors, claiming the resulting model scores 80% on HumanEval and outperforms GPT-3.5 on math benchmarks. No paper, weights, or external verification accompany the post.

The described methodology maps to established techniques — rejection sampling fine-tuning and self-improvement loops — where a model generates outputs, filters incorrect ones, and retrains on corrected examples. The approach is computationally inexpensive relative to pretraining and has precedent in published research, though the specific implementation details here remain unverified.

The 80% HumanEval threshold is a meaningful marker; GPT-4 sits near 67% on the standard pass@1 evaluation, while GPT-3.5 scores roughly 48%. If the claimed figures hold under controlled conditions, the result would place a small, locally-run model above several commercial baselines on coding tasks.

No model size, base architecture, dataset composition, or training compute is disclosed in the original post. The claims are unverified and sourced from a single community report.

Builders exploring low-cost fine-tuning for coding or math tasks should monitor for replication attempts or a follow-up paper before drawing conclusions about reproducibility.

SOURCE