Could the future of reliable operational documentation be an AI cage match? It sounds like science fiction, but the implications for how we build and maintain critical systems are staggering.
When you ask a single Large Language Model to write a production runbook, it’s akin to handing one engineer the keys to the entire software development kingdom – design, code, review, and sign-off. It might get the job done, sometimes. But the real danger? The silent failure mode: confident, fluent text that casually omits a crucial rollback step, invents a non-existent configuration flag, or simply misses an entire edge case. It’s the digital equivalent of a confidently delivered wrong turn.
This very problem is what drove the creation of the AI Council, a fascinating experiment from the maker of RunDoc, a SaaS platform for crafting runbooks and standard operating procedures. The premise is elegantly simple, yet deceptively powerful: four LLMs generate a runbook draft independently. Then, in a digital gladiatorial arena, they cross-review each other’s work. Finally, a fifth model, the ‘Chairman,’ synthesizes the final, fortified document. And here’s the kicker: the cross-review phase, not the specific model or the final synthesis, turned out to be the absolute linchpin.
Why One AI Isn’t Enough
Initially, the approach was straightforward, almost naive. Send the same prompt across GPT-4o, Gemini 2.5 Flash, Claude Sonnet, and Grok 3 Mini. Collect the four distinct outputs. Then, task a ‘Chairman’ model to cherry-pick the best bits and stitch them into a cohesive whole. The outcome? ‘Fine.’ Better than any single model, sure, but hardly a leap into the future. The Chairman, bless its algorithmic heart, tended to favor verbosity or unearned confidence – precisely the wrong metrics for accuracy.
The fundamental flaw was that the individual LLMs were tripping over the same stones. Rollback steps? Skipped unless explicitly demanded. Command-line flags? Fabricated with plausible-sounding conviction. Narrative prose over checklists? The default, unless explicitly re-instructed. Aggregating four failures, even slightly different ones, doesn’t magically create success. It just creates a slightly more complex failure.
The Crucible of Critique
This is where the second iteration ignited. The cross-review step: after each model birthed its draft, every other model was unleashed upon it. The prompt was starkly focused: “Find the errors. Specifically: missing prerequisites, hallucinated commands or flags, missing rollback steps, unsafe ordering, missing verification steps. Be specific. Cite line numbers.”
And this, my friends, is where the magic happened.
Models are much better at finding errors than avoiding them.
It’s a profound observation, isn’t it? When Claude drafted a runbook, GPT-4o spotted those subtly wrong kubectl flags it had confidently injected. When GPT-4o penned its own version, Gemini flagged a rollback step that blindly assumed a backup existed without verification. They weren’t catching their own flaws, but they were unerringly pouncing on each other’s. This is the digital equivalent of a brilliant editor’s keen eye, honed by a different perspective. Critique mode forces a distinct cognitive pivot. Generating an output prioritizes fluency and completeness; reviewing demands a laser focus on what’s broken. These are diametrically opposed forces, and a single agent trying to juggle both often falls prey to an overzealous optimism about its own creations.
The Chairman’s New Mandate
With four drafts and a dozen cross-reviews (each model meticulously scrutinized by the other three), the Chairman’s role transformed. It was no longer about picking the ‘best’ draft. Now, its mission was to synthesize a draft that could withstand the gauntlet of critiques. The Chairman’s instructions morphed:
You have 4 candidate runbooks and 12 peer reviews.
For each step in the final runbook, you must:
1. Include only steps that appear in at least 2 candidates OR
are uniquely justified by a critique
2. Apply every correction that has not been refuted by another reviewer
3. Default to the most conservative version when candidates disagree
(e.g., add the verification step, include the rollback)
4. Flag any disagreement that you couldn't resolve
That final point is a masterstroke. The Chairman doesn’t pretend to be omniscient. It surfaces disagreements as direct warnings within the final runbook. “Two models suggested --force, two recommended against it. Verify your cluster’s policy before using.” This isn’t just documentation; it’s an operational risk assessment embedded in the instructions.
The observable outcome, even without formal benchmarks yet, is striking. The AI Council’s runbooks are demonstrably denser. They boast more verification steps, more explicit prerequisites, richer rollback detail, and fewer implicit assumptions about the user’s environment. For a seasoned SRE navigating familiar territory, this might feel like unnecessary clutter. But for a junior on-call engineer facing a critical incident at 3 AM, this granular detail is the difference between a swift resolution and a cascading disaster.
What To Skip Next Time
A few experiments from the trenches are worth noting. Adding more models beyond four yielded diminishing returns – the cost and latency soared while the quality gains were marginal. The sweet spot, it seems, lies in having enough diversity to surface varied failure modes without drowning in complexity. And self-review? Largely a bust. The real power of critique emerges when it comes from an external, independent perspective – a different AI with a distinct set of learned priors.
This AI Council approach isn’t just about better runbooks; it’s a blueprint for how we can imbue AI systems with greater reliability and safety. By forcing them to interact, to critique, and to collaborate (or, in this case, to argue constructively), we build systems that are more than the sum of their parts. We’re moving from AI assistants to AI collaborators, and the operational integrity of our digital world depends on it.
Is This the Future of Operational Documentation?
It certainly feels like a powerful step in that direction. When a single point of failure in documentation can have catastrophic system-wide consequences, the notion of a distributed, self-correcting AI system generating those documents becomes incredibly appealing. The AI Council approach essentially builds redundancy and a strong error-checking mechanism directly into the generative process. This isn’t just about churning out text; it’s about building trust in the instructions we rely on when the pressure is highest.
Why Does This Matter for Developers?
For developers and SREs, this is a potential game-changer for reducing operational toil and incident frequency. Imagine receiving runbooks that are less prone to hallucinated commands or missed rollback procedures. This means less time spent debugging documentation errors under duress and more time focusing on core system health. It also suggests a future where more complex operational tasks can be documented and executed with greater confidence, even by less experienced team members. The friction between development and operations can begin to erode when the handoff documents are this meticulously crafted.
🧬 Related Insights
- Read more: PyTorch 2.11: 2723 Commits Later, FlashAttention Speeds Up — But TorchScript’s Dead
- Read more: FreeRasp RASP: React Native Security Finally Gets Real
Frequently Asked Questions
What is the AI Council? The AI Council is a system where four AI models independently generate runbooks, then cross-review each other’s output. A fifth ‘Chairman’ model synthesizes the final, strong version.
Why is cross-review important? AI models are better at identifying errors in others’ work than their own. Cross-review forces them to adopt a critical, error-detection mindset, catching mistakes that single models often miss.
Will this replace human runbook writers? Not necessarily. The AI Council augments human expertise by creating more strong drafts. Human oversight will likely remain critical for complex, nuanced systems and for the final sign-off, but it significantly raises the quality baseline.