Agent Orchestration with Feedback Loops: What Builders Are Learning the Hard Way
Multi-agent systems amplify errors 17x without proper feedback. Here's what Reddit builders, Spotify Engineering, and Anthropic research reveal about making orchestrated agents actually work in production.
Jordan Reeves
Developer Experience Lead
Gartner recorded a 1,445% surge in enterprise multi-agent inquiries between Q1 2024 and Q2 2026. Almost every serious AI team is building with orchestrated agents now. And almost every serious AI team is hitting the same wall: agents that work in demos fail in production—not because the models are bad, but because the coordination layer has no feedback.
This post draws from Spotify Engineering's public post-mortem on background coding agents, Anthropic's multi-agent research system paper, Google ADK's eight production patterns, and first-hand developer accounts from r/AI_Agents and r/LocalLLaMA. The pattern that comes through in every source is the same: the feedback loop is the product, not the agent.
Why Multi-Agent Systems Fail Without Feedback
Adding more agents to a system without coordination infrastructure doesn't improve outcomes—it makes them worse. Research published in Towards Data Science quantified this: uncoordinated "bag of agents" systems amplify errors at approximately 17× the rate of individual agents. Each agent's mistakes propagate and compound rather than cancel out.
The math is brutal. A 10-step agentic process with 99% per-step success has only 90.4% end-to-end success. Production systems need 99.9%+. The gap between those numbers is the engineering problem—and it cannot be solved by better prompting alone.
From the community, one developer's observation became widely cited across multiple threads:
"The framework handles the happy path. The sad path is always bespoke." — Developer on r/AI_Agents
This isn't a complaint about frameworks. It's a fundamental architectural reality: every agent orchestration system needs custom failure handling, retry logic, circuit breakers, and recovery from partial failure. No framework provides this for you.
The Hub-and-Spoke Pattern: What 80% of Production Systems Use
After testing every major orchestration approach, the field has converged on one pattern that reliably scales: a central orchestrator manages 2–4 specialist workers who report back rather than communicating directly with each other.
Google ADK's documentation describes eight production orchestration patterns. The Coordinator/Dispatcher is the most common, but real-world systems almost always combine it with at least one more—typically a Generator/Critic loop for quality control. Here is how Anthropic's internal research system implements this:
- A lead orchestrator decomposes queries and spawns 3–5 subagents in parallel
- Each subagent handles a specific research slice and returns structured results
- The orchestrator synthesizes findings and runs a critic pass before surfacing output
- Result: 90.2% performance improvement over single-agent on internal evals and 40% task completion time reduction
Cursor's architecture names the three roles explicitly: Planners explore the codebase and decompose tasks, Workers execute independently, and Judge agents evaluate whether each cycle produced acceptable output before the next begins.
The key insight is that the orchestrator's job is coordination and quality control—not doing the work. It keeps the feedback loop closed.
The Agentic Feedback Stack
Spotify Engineering's post on background coding agents (Honk project) identified three escalating failure modes when agents run without verification:
- PR generation failures — minor, acceptable for manual intervention
- CI failures — wastes engineer time reviewing incomplete work
- Functionally incorrect PRs that pass CI — the most dangerous; erodes trust and risks production incidents
Their solution was a layered feedback stack: independent, auto-activating verifiers that trigger based on codebase contents (a Maven verifier activates on pom.xml detection), parse error outputs via regex to extract relevant messages, and gate agent progress without adding cognitive load to the agent's context window.
The key design principle from Spotify's write-up:
"The agent doesn't know what the verification does and how, it just knows that it can (and in certain cases must) call it to verify its changes." — Spotify Engineering
Their LLM Judge layer vetoes approximately 25% of agent sessions—meaning 1 in 4 agent runs would have shipped broken or out-of-scope code without the feedback layer. The veto triggers mostly on scope creep: agents refactoring code they weren't asked to touch, disabling tests to make CI pass, or making changes beyond the stated task boundary.
The four-layer stack that emerges from production experience:
- Layer 1 — Orchestrator guardrails: iteration caps, state externalisation, hard circuit breakers, handoff schemas, confidence thresholds
- Layer 2 — Agent self-verification: critic agents, TDD pass/fail, evidence bundles, tool output inspection
- Layer 3 — CI/automated verification: unit tests, integration tests, lint, typecheck, exit codes
- Layer 4 — Production monitoring: real traffic, error rates, cost per session, LLM veto rate
State Is the Hardest Problem
Routing is a solved problem. State management is where production systems collapse. From Builder.io's engineering blog:
"The hardest problem in multi-agent orchestration isn't routing—it's state."
Three failure modes appear repeatedly in developer threads:
Race conditions
When parallel agents write to shared state, one agent silently overwrites another's work with no error. Google ADK's parallel fan-out pattern addresses this by requiring each worker to write to unique state keys—never the same field. A synthesizer agent then merges the results explicitly.
Context overflow
Single agents work until context windows fill up. Multi-agent systems hit this faster because handoff messages carry accumulated context from all prior agents. Too little context and agents repeat work. Too much and token costs scale quadratically with every handoff. One developer reported a client hitting $2,000 in API costs in a single day because an agent discovered recursive self-improvement—it kept calling itself to optimise its own prompts with no circuit breaker to stop it.
The "50 First Dates" problem
Agents forget everything between sessions. Steve Yegge's "Beads" system addresses this with git-backed JSONL memory using hash-based IDs to prevent merge conflicts across parallel agents. Addy Osmani's pattern uses an AGENTS.md file—a running handbook documenting discovered patterns, gotchas, and conventions that persists across sessions. The principle: "Each improvement should make future improvements easier."
TDD Is the Killer Feedback Signal
The best feedback signal for an agentic loop is one that is unambiguous, immediate, and requires no human in the critical path. Test-driven development provides exactly this.
Write failing tests first. The agent implements against them, runs them, and self-corrects until they pass. No interpretation required—pass or fail is the verdict. Colin Eberhardt's flexbox algorithm experiment (published on Scott Logic's blog) completed 800 lines of code and 350 tests in 3 hours using this pattern—a task that took 2 weeks manually in 2015.
His observation about why TDD works so well for agents:
"How much code can you write in your editor and be sure that it is correct without running it? Personally I'd struggle to get beyond 5 lines of code." — Colin Eberhardt, Scott Logic
The same constraint applies to agents. The difference is that running code is cheap and fast for them. The bottleneck is having a clear signal about whether the output is correct. Tests provide that signal.
The Competing-Hypotheses Pattern
One insight from Claude Code's Agent Teams documentation deserves wider attention. When debugging complex issues, sequential investigation is biased: once one hypothesis is explored, subsequent investigation anchors on it. The alternative is parallel adversarial investigation:
"Spawn 5 agent teammates to investigate different hypotheses. Have them talk to each other to try to disprove each other's theories, like a scientific debate." — Claude Code documentation
This produces more reliable root cause identification because no single thread of reasoning dominates. Each agent builds its own evidence. The orchestrator evaluates competing conclusions rather than extending a single chain of thought.
Recommended team size for this pattern: 3–5 agents. More agents increase coordination overhead without proportional quality gains.
Token Economics Determine Architecture
Every architectural decision in a multi-agent system is also a cost decision. Anthropic data shows multi-agent systems use approximately 15× more tokens than single-agent chats. A CrewAI crew with 5 agents costs roughly 5× more per task than a single LangChain agent.
The right question is not "can we parallelise this?" but "does the quality improvement justify the token cost?"
Developer accounts from the community are direct about the cost reality:
- One developer burned $4 in API costs from 11 uncontrolled revision cycles on a small task
- Multi-agent orchestration for complex workflows can reach $200 per session
- Semantic caching at the infrastructure layer (Redis) achieves 70% hit rates, reducing LLM costs by up to 70% in high-volume systems
Practical guidance from the framework comparisons: token usage explains 80% of performance variance. Optimise the context passed to each agent before adding more agents.
Humans On the Loop, Not In It
Martin Fowler's framing of the three human positioning modes in agentic systems is the clearest description of where engineering effort should go:
- Humans outside the loop ("vibe coding") — humans specify outcomes, agents implement. Risk: codebase quality degrades over time.
- Humans in the loop — humans manually inspect every agent output. Problem: agents generate code faster than humans can inspect it, creating bottlenecks.
- Humans on the loop (recommended) — humans engineer the harness: specs, quality gates, evaluation criteria. Rather than reviewing artifacts, they improve the system that produces them.
The practical implication: your engineering time should go into the feedback infrastructure, not into reviewing individual agent outputs. A well-engineered harness catches bad outputs automatically. A poorly-engineered harness forces you to be the verification layer—which doesn't scale.
Framework Selection in 2026
The community has developed a useful heuristic: "If it looks like a flowchart with loops, LangGraph. If it looks like a conversation thread, AutoGen. If it looks like a job description board, CrewAI."
What the framework comparisons reveal about production readiness:
- LangGraph — best for stateful, cyclic workflows with self-correction. Steep learning curve. Requires upfront state schema definition. LangSmith shifted debugging from "print statements everywhere" to "click the node that failed."
- CrewAI — best for role-based teams with YAML-driven workflows. Logging is broken (standard print/log functions don't work inside Task). Critic-to-researcher feedback loops can feel like fighting the framework.
- AutoGen — best for conversational multi-agent problem solving.
speaker_selection_method="auto"unpredictably skips agents or creates loops with no obvious reason. Hard to debug conversations at scale. - Claude Code Agent Teams — best for parallel research, review, and cross-layer features. Experimental but the competing-hypotheses pattern is uniquely powerful.
The Protocol Layer Is Maturing
Two emerging standards are worth tracking as the orchestration ecosystem stabilises:
- MCP (Model Context Protocol) by Anthropic — standardises how agents access tools, reducing integration surface area
- A2A (Agent-to-Agent) by Google — peer-to-peer agent collaboration protocol backed by 50+ companies including Microsoft and Salesforce
These protocols address one of the most painful parts of multi-agent development: the glue code between agents and tools. As they mature, more engineering effort can go into orchestration logic and feedback infrastructure rather than integration plumbing.
What to Build First
The practical starting point, based on what actually works in production:
- Start with a single agent and good tests. The feedback loop from TDD is more valuable than adding agents. Most tasks that "feel" like multi-agent problems are single-agent problems with insufficient verification.
- Add a critic agent before adding worker agents. A critic that verifies output quality gives you the feedback signal you need. A second worker agent gives you parallelism—which is only useful after the quality problem is solved.
- Build the feedback stack layer by layer. Add Layer 2 (agent self-verification) before Layer 3 (CI integration). Each layer catches what the layer above misses. Don't skip to production monitoring before the earlier layers are solid.
- Hard-cap iterations and externalise state. Every orchestration system needs a maximum iteration count. Agents that haven't resolved a problem in N cycles should escalate, not keep trying. State should live outside the agent's context window.
- Track cost per session from day one. Without this metric, you cannot tell whether your orchestration is working or burning tokens on repeated failures.
The Bottom Line
The 40% of agentic AI projects projected to be scrapped by 2027 will not fail because the models are insufficient. They will fail because the coordination layer has no feedback—agents run, produce output, and no system exists to tell them whether that output was correct or not.
The teams that ship working multi-agent systems treat the feedback infrastructure as the primary engineering deliverable. The agents are secondary. Get the loop right, and the agents will surprise you with what they can do. Skip the loop, and adding more agents just amplifies whatever is already broken.
Build the harness first. The agents will thank you.