Hi, author here. I mean the piece of code that calls the model and executes the tool calls. My colleague Philip calls it “9 lines of code”: https://sketch.dev/blog/agent-loop
We have built two of them now, and clearly the state of the art here can be improved. But it is hard to push too much on this while the models keep improving.
the harness being "9 lines of code" is deceptive in the same way a web server is "just accept connections and serve files."
the hard part isn't the loop itself — it's everything around failure recovery.
when a browser agent misclicks, loads a page that renders differently than expected, or hits a CAPTCHA mid-flow, the 9-line loop just retries blindly. the real harness innovation is going to be in structured state checkpointing so the agent can backtrack to the last known-good state instead of restarting the whole task. that's where the gap between "works in a demo" and "works on the 50th run" lives.
We have built two of them now, and clearly the state of the art here can be improved. But it is hard to push too much on this while the models keep improving.