Weibo's VibeThinker-3B matches DeepSeek V3.2 on AIME 2026 at 1/200th the size

A nine-researcher team at Sina Weibo posted a 14-page arXiv paper claiming their 3B-parameter VibeThinker model scored 94.3 on AIME 2026, matching the 671B-parameter DeepSeek V3.2 and beating Gemini 3 Pro's 91.7. With test-time scaling via Claim-Level Reliability Assessment, the score climbs to 97.1. The model picked up 685 GitHub stars, 130 Hugging Face likes, and 62 paper upvotes within hours, but the reaction was sharply divided. Skeptics argue the benchmark gains may reflect training-data leakage or aggressive overfitting on math competitions rather than genuine reasoning, reigniting the small-model-versus-large-model debate.

View full digest for June 17, 2026