Sakana Fugu Ships a Ready-to-Use Council of Experts System

Picture the moment a founder pastes a signed vendor contract into ChatGPT and asks for a risk review. One model, one pass, one opinion. If it misses a liability clause on page 14, nothing in the interface says so. You get confident prose either way.

July 1, 2026 9 min read

Quick Scan

What matters today

Format TOP UPDATE

Audience Executives using AI at work

Time 9 min read

Topic Top Update

Key points

What "Orchestration" Actually Means
Does It Actually Beat One Model Alone
Running a Real Task Through Fugu's API
What Fugu Actually Costs
The Free Alternative: Hermes Agent

What You'll Learn

What a "conductor model" does differently from a router or wrapper script
The exact Fugu and Fugu Ultra pricing, down to the per-million-token rate
How to run a real code review or competitive analysis through Fugu's API
How to get a rougher but free version of the same result with Hermes Agent
A decision rule for which option fits a 3-person team versus a 30-person one

That single-model blind spot gets expensive once AI output feeds real decisions: a positioning memo the board reads, a code change shipping to production, a security review before a client audit. Different frontier models catch different mistakes. Relying on just one is a bet made without realizing it was placed.

Two products launched this year fix that by putting multiple models on the same task and forcing them to check each other's work. One costs real money and zero setup. The other costs almost nothing and an afternoon of setup. Here is exactly how each one works, with the numbers that decide which fits your team.

What "Orchestration" Actually Means

Orchestration, in plain terms, means one system decides which AI models get involved in a task, what job each one does, and how their answers combine into a single final response. Think of a hiring manager who does not do the work personally, but knows exactly which specialist to call for each piece and how to merge the results.

Sakana AI, a Tokyo-based lab, launched a product built around this idea on June 22, 2026, called Fugu, named for the Japanese pufferfish that a licensed chef checks for toxin before it reaches the plate. The pitch: one system that checks the work before you see it.

Here is what separates Fugu from a router that just picks "GPT for coding, Claude for writing." A router follows rules a human wrote in advance. Fugu is a conductor model: software trained to decide which other AI models work on a task and in what order, the way an orchestra conductor decides which section plays and when, without playing an instrument itself. It learned, through training, which models to call, what role each plays, and how to merge their answers into one response.

Fugu draws on a pool that includes GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, and recursive calls to itself. For each task it assigns three roles, drawn from two peer-reviewed papers Sakana published for ICLR 2026, the top academic AI conference where research gets vetted before publication.

Thinker

Breaks down the problem

Worker

Drafts an answer

Verifier

Checks the draft for errors before it goes out

TRINITY uses a 0.6 billion parameter coordinator, tiny next to the hundreds-of-billions-parameter models it manages, refined with an evolutionary algorithm called CMA-ES rather than standard training. It assigns the Thinker, Worker, and Verifier roles across multiple turns, adapting the team as the task unfolds. Conductor is a 7 billion parameter model trained with reinforcement learning, rewarded for good outcomes instead of following fixed rules, that invents its own coordination strategies and calls itself recursively for deeper reasoning. Together, these systems are what Fugu runs on: a trained decision-maker, not an if-then script.

Council of experts, in plain terms: a small group of different AI models working the same problem from different angles, the way a founder might loop in a lawyer, a CFO, and an outside consultant before signing a deal, instead of asking one advisor and trusting the first answer.

Does It Actually Beat One Model Alone

Sakana published its benchmark numbers openly, and they are more mixed, and more credible, than a simple "beats everything" claim. On SWE-Bench Pro, a real-world engineering test, Fugu Ultra scored 73.7 against Opus 4.8's 69.2, Gemini 3.1 Pro's 54.2, and GPT-5.5's 58.6. On GPQA-Diamond, 198 graduate-level science questions, Fugu and Fugu Ultra both scored 95.5, ahead of Opus 4.8's 92.0.

SWE-Bench Pro Score

Sakana also compares its results against Claude's Fable 5 and Mythos Preview, two Anthropic models not publicly available and therefore excluded from Fugu's agent pool. By Sakana's own account, Fugu Ultra trails Fable 5 by roughly 6 to 9 points on benchmarks like SWE-Bench Pro, but beats the older Mythos Preview on GPQA-D. Fugu Ultra sits at or near the top of publicly available models, not above every model that exists. These numbers are self-reported and not yet independently reproduced, so treat them as a strong signal, not a certified result.

Running a Real Task Through Fugu's API

Fugu ships through an OpenAI-compatible API, meaning it works with the same code structure most tools already use to call ChatGPT. No new SDK, no new client library. Point the base URL at Sakana's endpoint, swap in your API key, and pick a model name.

from openai import OpenAI client = OpenAI( base_url=" https://console.sakana.ai/v1 ", api_key="YOUR_SAKANA_API_KEY", ) response = client.chat.completions.create( model="fugu-ultra-20260615", messages=[{"role": "user", "content": "Review this codebase's authentication module for security " "flaws, race conditions, and missing input validation. " "Flag every issue with a severity rating and a fix."}], ) print(response.choices[0].message.content)

Send that request against a real repository and Fugu Ultra's Thinker role breaks the module into pieces (session handling, password storage, token refresh), the Worker drafts findings for each piece, and the Verifier checks them before the final answer reaches you. One engineer quoted in Sakana's launch materials reported that where other tools flagged about three issues, Fugu Ultra surfaced more than twenty. Treat that as one account, not a controlled study, but it matches the design: three roles checking the same code catch more than one pass does.

The same pattern works for a competitive analysis: swap the prompt to "Compare these three competitors' pricing pages and identify where our positioning is weakest, citing specific language," and Fugu Ultra spreads research, drafting, and fact-checking across the same three roles instead of one model doing all three jobs in one pass.

What Fugu Actually Costs

Fugu Ultra (model ID fugu-ultra-20260615) costs $5 per million input tokens and $30 per million output tokens pay-as-you-go, with cached input at $0.50 per million. Those rates double, to $10 input and $45 output, once a request's context passes 272,000 tokens, which matters when feeding it a large codebase in one call. For regular Fugu, one active model bills at that model's standard rate; several active agents still bill at one blended rate based on the highest-tier model involved, never a stacked sum.

For a flat bill instead, three monthly tiers cover both Fugu and Fugu Ultra: Standard at $20 for light daily use, Pro at $100 for 10 times the Standard allowance, and Max at $200 for 30 times the allowance, built for long, heavy workloads. A 3-person team doing a handful of reviews a week fits Standard or Pro. A team running Fugu Ultra all day needs Max or pay-as-you-go.

The Free Alternative: Hermes Agent

Nous Research, an open-source AI lab, released Hermes Agent in February 2026 under the MIT license: free to use, modify, and run with no licensing fee. It is not built around a trained conductor the way Fugu is; no model learned to assign Thinker, Worker, and Verifier roles. Instead, it is a self-hosted framework that points at any model, or several in sequence, plus a memory system Fugu lacks entirely.

Hermes Agent runs a three-layer memory system: skill memory (procedural knowledge written to plain markdown files after solving a hard problem), conversational memory (a searchable record of past sessions), and user modeling (a profile built over time of how you work). Fugu has no persistent memory between API calls; every request starts fresh. That is one real gap Hermes Agent closes and Fugu does not.

Hermes Agent ships with 40-plus built-in tools (web search, browser automation, image generation, code execution), connects to over 200 model backends through OpenRouter with one API key, and installs with one command on Linux, macOS, or Windows via WSL2.

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash hermes setup hermes model hermes

The installer handles Python, dependencies, and the hermes command automatically. Setup connects to a provider: OpenRouter, a direct API key from OpenAI or Anthropic, or a local model through vLLM for zero API cost and weaker output. Once running, tell it what to check for.

Review the authentication module for security flaws, race conditions, and missing input validation. Check it against OWASP categories, flag each issue with a severity rating, and save what you learn as a skill so future reviews start from what you already know.

That last instruction, asking it to save a skill, is the piece Fugu cannot replicate. Hermes Agent writes what it learns to a markdown file, so the next review against the same repository starts from what it already knows, not from cold.

The Real Cost and Skill Tradeoff

Hermes Agent itself is free. What you pay for is the model traffic behind it, and that swings hard by model choice. Budget models on OpenRouter, like DeepSeek V4 Flash at roughly $0.14 per million input tokens and $0.28 per million output tokens, keep a small team's monthly bill in the $2 to $8 range for light-to-moderate use. Route the same setup through Claude Sonnet at $3 input and $15 output per million tokens, and the bill climbs toward Fugu subscription territory, minus the trained conductor doing role assignment.

Stated plainly: Fugu costs $20 to $200 a month and needs no setup, point an existing tool at the endpoint and it works in minutes. Hermes Agent costs closer to $2 to $10 a month on a budget model, but demands an afternoon of setup and hand-written multi-step prompts since no trained conductor assigns roles automatically. That trades dollars for setup time, not quality for a discount, since Hermes Agent can point at Claude Opus or GPT-5.5 directly if budget allows.

A 3-person team with nobody comfortable at a terminal should default to Fugu's Standard plan at $20 a month and skip the setup. A team with one technical co-founder who wants the memory system Fugu lacks gets nearly the same outcome from Hermes Agent for a fraction of the spend. Neither choice is wrong: it comes down to whether the scarce resource is cash or setup time.

Action Steps Summary

1. Pick your task first. Identify one recurring high-stakes task (code review, competitive analysis, contract review) where a single model's blind spots already cost you something.

2. Try Fugu on Standard first. Sign up at console.sakana.ai for the $20 Standard tier and point an existing OpenAI-compatible tool at the endpoint.

3. Watch the context size. Keep requests under 272,000 tokens on Fugu Ultra to stay at $5 input and $30 output per million tokens instead of the doubled rate above that.

4. Install Hermes Agent as a parallel test. Run the one-line curl install, connect it to a budget model through OpenRouter, and run the same task side by side with Fugu.

5. Score both outputs against one rubric. Compare issues found, accuracy, and time to result, remembering Sakana's benchmark numbers are self-reported and not yet independently verified.

Bottom line

The useful move with Sakana Fugu Ships a Ready-to-Use Council of Experts System is to run one narrow test this week, then keep only the workflow that saves time, improves a decision, or gives your team clearer output. Treat the announcement as raw material, not the win itself.

About the author

Pierre Bradshaw Founder, PromptHacker.ai

Pierre has spent 25+ years building growth systems across fintech, real estate, lending, campaigns, and AI workflows, with machine-learning work dating back to 2012.

Email us