AIRS-Bench: AI research agents exceed human performance on only 4 of 20 ML tasks

Meta researchers released AIRS-Bench, a suite of 20 tasks from recent ML papers spanning language modeling, bioinformatics, mathematics, and time series forecasting. The benchmark tests the full research lifecycle -- idea generation, experiment analysis, iterative refinement -- without providing baseline code. Agents exceeded human state-of-the-art on just 4 tasks while failing on 16 others. Average normalized score: 23.4%. Only 1.55% of agent-task combinations beat SOTA.

View full digest for February 17, 2026