What Breaks When Nodes Have Judgment
← All posts

What Breaks When Nodes Have Judgment

The engineering problems nobody warned you about when building multi-agent AI systems.

The title of the first post was a celebration. This one is the reckoning.

"The nodes have judgment now" is exciting until you start building with it. Then you discover that judgment — the thing that makes AI agents powerful — is also the thing that breaks every assumption you've spent your career building on.

Some of those assumptions have held for thirty years. They're gone now.

The Analogies That Still Work

The good news first. Some distributed systems patterns map cleanly to the multi-agent world, which means some of your hard-won intuition is still valid.

Load balancing becomes semantic routing. Traditional load balancing works because servers are fungible. Round-robin, least-connections, weighted routing all assume the nodes are identical. Agent task routing is harder because agents aren't fungible — a coding agent and a research agent have fundamentally different capabilities. The "load balancer" in a multi-agent system is often itself an LLM: a meta-agent that reads the task and routes to the right specialist. This is L7 application-aware routing, except the "L7" is natural language understanding.

Service mesh becomes agent communication protocol. Istio and Linkerd extracted cross-cutting concerns — TLS, retries, circuit breaking, observability — into a sidecar proxy so services could focus on business logic. MCP (Model Context Protocol) does something analogous for the agent-to-tool layer: standardizing how agents connect to data sources and external capabilities so every agent doesn't need custom glue code for every integration. It's less like a traffic proxy and more like what REST did for web APIs — a standard interface contract, written once, usable everywhere.

Container orchestration becomes agent orchestration. We're in the Docker Swarm / early Kubernetes era: multiple competing frameworks, no clear winner, probably 2-3 years from a dominant standard. That gets its own post.

Where the Analogy Breaks Down

Here's where the infrastructure veteran in me has to pump the brakes.

Idempotency is dead. A microservice does exactly what its code says. An AI agent interprets, exercises discretion, and may decide the request is actually asking for something different than what was specified. Two identical requests can produce different responses. You can't replay an agent interaction and expect the same result. You can't cache an agent response the way you cache a database query. Retry logic that assumes the same input produces the same output will lie to you.

Idempotency is dead. Retry logic that assumes the same input produces the same output will lie to you.

Failure is semantic, not operational. A microservice fails with a timeout or an error code. You can observe it. Alert on it. Page someone at 3am about it. An AI agent can fail by producing a confident, well-formatted, completely wrong answer. There's no error code. No timeout. No failed health check. Just bad reasoning, potentially invisible to every monitoring system you have.

Detecting semantic failure requires a different kind of health check — not "is the service responding?" but "is the reasoning sound?" This is why multi-agent critique loops, where one agent checks another's work, are emerging as a critical architectural pattern.

Peer review as infrastructure.

Communication is lossy in a new way. In traditional systems, messages arrive intact or not at all. TCP guarantees delivery. In multi-agent systems, communication is natural language — messages can be partially understood, misinterpreted, or selectively attended to. An agent might receive a 2,000-word briefing from another agent and focus on the wrong paragraph. This failure mode has no analog in service-to-service communication. It's closer to organizational dysfunction in human teams. The fix isn't a retry. It's a reframe.

Coordination overhead is measured in dollars, not milliseconds. Fred Brooks observed in The Mythical Man-Month that communication overhead grows quadratically with team size. For agent teams, the cost is tokens and dollars, not meeting hours. This creates pressure toward what I call minimum viable agency: use the fewest agents with the least inter-agent communication that can accomplish the task.

This is the opposite instinct from microservices, where the impulse was always to decompose further. When someone pitches you a 47-agent pipeline for a task a single agent could handle in three steps, they've built the distributed monolith. They just don't know it yet — but they will, when they spend the next year debugging it.

The optimal agent granularity is coarser than the optimal service granularity.

The Gap Nobody Is Talking About

Building reliable distributed systems took two decades of painful lessons. We're at year one of multi-agent. Budget accordingly.

The hardest problem isn't routing or orchestration or even semantic failure detection. It's the one that underpins all of them.

You can't fix what you can't see.

Multi-agent AI is at the "we have logs" stage of observability. Understanding why a system of five agents produced a particular output — which agent contributed what, where reasoning diverged, which step introduced the error — is the most critical unsolved infrastructure problem of 2026.

That gets its own post. Because it deserves it.

Coming soon: the observability crisis in multi-agent AI -- and why whoever builds the Datadog of agents will build a generational company.