Benchmarks miss the thing that actually matters for agentic use: how does behavi... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		ai5iq 6 days ago \| parent \| context \| favorite \| on: Muse Spark: Scaling towards personal superintellig... Benchmarks miss the thing that actually matters for agentic use: how does behavior change over a multi-day horizon? A model that scores well on one-shot coding tasks can still make terrible decisions when it has persistent state and resource constraints. That's where you see the real gaps between models.

		help

andai 5 days ago [–]

Is there a benchmark for these long tasks? That kind of seems like the only number worth measuring.

(Of course at that point it involves memory and context management and so on, so you're testing the harness as well as the model.)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact