March 30, 2026
7 minutes

Honest Agents

Daniel Kahneman didn’t write a book about LLMs. Thinking, Fast and Slow is about human judgment. Still, the distinction between fast, associative processing and slower, effortful judgment—what the book frames as System 1– and System 2–style reasoning—turned out to be the right mental model for how we think about and operate agent stacks.

Honest Agents

Once we stopped treating a model as a single homunculus and started engineering for two modes of operation—fluent pattern completion on one side, inspectable delegation and provenance on the other—our own scorecard moved. This post is about that shift: what we measured, what we shipped the week it clicked, and why we’re comfortable calling it honest without pretending neuroscience wrote our config files.

What we’re claiming (and what we’re not)

We are not saying Kahneman predicted tokenizer math or that Thinking, Fast and Slow is an ML paper. We are saying: the same conceptual move the book maps for humans—when to let the fast system run, when to force the slow one—maps cleanly onto how we design hooks, harnesses, and review loops around agents.

The measurable part is operational: Delegation Duty on our internal Fear and Trust rubric. In plain language:

  • Fear and Trust is our scorecard for how much we trust an agent setup in production—not vibe, but structured criteria we track over time.
  • Delegation Duty is one axis on that card: roughly, “are we delegating to the model with explicit duty boundaries, provenance, and checks—or without them?”

We moved Delegation Duty from 0/6 to 5/6 or higher. That is not a flex about “more tokens.” It tracks a shift from guessing and pattern completion toward treating the stack as if it had something like System 1 and System 2—fast fluency plus a slow, accountable path when stakes matter.

If you only take one line: the gain is behavioral and architectural, grounded in our rubric, explained with Kahneman’s vocabulary.

The week it landed

We don’t have to guess when the engineering caught up to the idea. Our aitools history and local Claude Code hooks line up on the same calendar week in March 2026:

  • 24 March — Delegation-duty guard and related harness work ship in the repo; session and harvest hooks on disk show the same window.
  • 25 MarchProvenance becomes the sixth harness component in our spec; harness-db session hooks and delegation-duty-guard land under ~/.claude/hooks (local disk times cluster on 25 March afternoon). Harvested artifacts from that day tie delegation to a Datadog region investigation—operational learning, not slideware.

So “about a week ago” was right in spirit: the spike is late March, anchored in commits and artifacts, not in a vibes-based timeline.

What we built in that window

Across hundreds of sessions, multiple machines, platforms, repos, and agents, we converged on a concrete shape:

  • A self-learning, provenance-aware harness—not a slogan, but a spec: what gets recorded, how rewinds work, how session truth stays attached to artifacts.
  • A CLI utility and the intelligence system around it: the glue between hooks, databases, and how we query what happened.

We already maintain a full-text knowledge index (~/.aitools/knowledge.db, built by build-knowledge-db.py) for searching plans, harvests, and transcripts—FTS5 search across the corpus. That’s separate from the harness databases that hold duty and session lifecycle. Two stores, two jobs: search the pile vs run the harness. Both matter; neither replaces the other.

If you’re building something similar: expect to need both a corpus search story and a duty / provenance story. Mixing them in one head is how teams get lost.

Why “honest”

Honest here means: we don’t claim the model “understands” duty. We claim we built surfaces—hooks, scores, provenance—so that when we delegate, we can see whether we did it with guardrails or on a wish.

Kahneman’s gift to this post is vocabulary, not validation. The validation is the scorecard and the shipping history.

The bottom line

  • We used System 1 / System 2 as a lens for agent design, not as a lab result about brains.
  • We measured a large move on Delegation Duty (0/6 → 5/6+) on our Fear and Trust rubric—the kind of shift you’d expect when you stop treating agents as opaque oracles and start engineering delegation.
  • The work is dated and shippable: late March 2026, in aitools, in hooks, in harvest artifacts—the same week provenance joined the harness definition.

If you’re still running agents like a single “it,” try splitting the story: what’s allowed to be fast, and what must leave a trail. Then measure something. We did—and the numbers moved.

Jose is the founder of nobul.tech, where we work on experiments, ideas, tools, and thoughts that don’t need a pitch deck.