Arguably I would think that the last year was mainly inner harness improvement i...

SatvikBeri · 2026-02-12T17:32:13 1770917533

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

jwpapi · 2026-02-16T16:12:43 1771258363

I mean that just the way it tackles task in the core is generated differently, like inner harness, through system prompt or deeper root. F.e. Instead of answering instantly it goes through a pre-defined steps which strategy should be done, split task, use thinking tokens, use tools etc.