OpenAI says SWE-bench Verified coding benchmark is contaminated and should be retired

OpenAI declared the SWE-bench Verified programming benchmark has lost its value, finding that at least 59.4% of tasks are flawed and reject correct solutions. Many tasks and solutions have also leaked into training data for GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview, meaning scores increasingly reflect memorization rather than coding ability. OpenAI recommends SWE-bench Pro as a replacement and is building its own non-public tests. The announcement has a strategic angle: a contaminated benchmark could make open-source rivals look artificially competitive, particularly ahead of DeepSeek's anticipated V4 release.

View full digest for February 24, 2026