Inside BingWow's Compound AI Stack: Which Model Owns Which Decision (and Why None of Them Own Control Flow)

Updated May 14, 2026

The Berkeley AI Research blog called it the Compound AI System in February 2024. The thesis is one design choice: either control flow is written in traditional code that calls LLMs at specific bounded steps, or control flow is driven by an LLM that decides what to do next. Compound systems pick the first. Agents pick the second.

I wrote in The AI Journal about which choice fails more often and why. This post is the implementation detail of the working alternative: which model owns which decision inside BingWow's stack, and why none of them owns control flow.

The six models

BingWow ships AI-generated bingo cards. The product is free, ad-free, and used by classrooms and HR teams. Six models from four vendors handle different parts of the pipeline:

Claude Sonnet 4.5 — content quality judgment, generation. Decides whether a card draft meets the publishability bar; produces new card titles when an existing one needs improvement.
Claude Haiku 4.5 — classification: moderation, dedup, categorization. Routed to Haiku because Haiku is one-fifth the cost of Sonnet per token and one-third Opus, and classification is a job a smaller model handles cleanly.
Gemini 2.5 Flash — bingo-clue generation. Each clue is a 3-7 word phrase tied to a card topic. Flash is the throughput model for that bounded shape.
Gemini 3 Flash Preview + Gemini 2.5 Pro fallback — themed display names. Every player who joins a multiplayer room gets a fresh themed name (a halloween game produces halloween names; a football game produces football names). Gemini-with-fallback is the right routing for a job that needs variety but cannot afford to fail.
GPT-4o Vision — background image description. AI-generated backgrounds get a text caption for accessibility and search; GPT-4o is the strongest vision model for that specific job and is gated behind a single cron route.
Replicate Flux Schnell — background image generation. Schnell is the fast tier of the Flux family; the cron generates an image, sends it back through GPT-4o Vision for the alt text, and stores both.

None of these models decides what happens next. Code decides.

Where the code lives

Every transition between models is a TypeScript function, a SQL query, or a cron job. Here is the pipeline from the moment a visitor types a card topic to the moment that card is browsable on bingwow.com/cards:

Visitor submits a topic. A Next.js API route validates the input, rate-limits the request via an anonymous-rate-key derived from the bingwow_anon_id cookie, and inserts a pending_topics row in Supabase.
Cron picks up the topic. At 06:00 UTC, app/api/cron/process-pending-topics queries Supabase for unprocessed rows. The cron schedule is in vercel.json. No model decided when this should run.
Gemini 2.5 Flash generates clues. The orchestrator (lib/process-one-topic.ts) calls Gemini with a structured-output schema. The output is 24-50 candidate clues in a typed JSON shape.
SQL deduplicates. A normalized-form similarity query against existing cards in the same category rejects duplicates. Deterministic. No LLM involved.
Claude Haiku 4.5 categorizes. The orchestrator hands Haiku the topic + clue list + a freshly-queried list of valid category IDs. Haiku returns a single ID. The orchestrator validates the ID against the same DB read before accepting it.
Claude Sonnet 4.5 makes the publishability call. Returns { publish: boolean }. The orchestrator routes accordingly — publish path or hard-delete (for AI-pipeline cards with no human owner).
Replicate Flux Schnell generates a background. Same cron. The orchestrator runs up to 4 attempts with a Haiku-based text-detection gate (rejecting images that accidentally render English letters on the bingo board).
GPT-4o Vision writes the alt text. Separate cron route. The image is now an asset; the description is metadata.
Card status flips to published. SQL UPDATE with a CHECK constraint that only allows the three legal statuses. The card now appears in the sitemap, in category archives, in the IndexNow daily push.

Eight steps. Each one a code decision. The models do bounded work between the decisions.

What this buys

Auditable cost. Every API call has a named caller in the codebase. When the April 2026 Anthropic bill spiked, I found the offender by grepping for claude-3-5-opus in lib/*.ts and replacing it with claude-haiku-4-5 in three files. The bill dropped from $560 a month to between $170 and $245. An agent burns the same $560 because routing is a code decision and the agent owned the decisions.

Auditable failure. When categorization started landing in the wrong subcategory in March, the bug was in a static fallback list in the moderation prompt — not in the model's judgment. The fix was to read the subcategory list from the categories table at request time and validate the AI's returned ID against the same DB read. The fix is in TypeScript, not in prompt engineering, because the failure was a code failure.

Auditable evaluation. Tests exist for code. Every API route has a test fixture that calls it with a known input and asserts on the output shape. The research portal publishes monthly metrics on pipeline output (publish rate, dedup rejection rate, moderation flips). Drift on any axis triggers a code change, not a vibes-based prompt tweak.

The bingo caller is a worked example

The BingWow caller is the most-trafficked surface in the product. It supports 30-ball, 75-ball, and 90-ball bingo with voice calls, a flashboard, auto-draw, manual draw, and printable number cards. Every layer of it is a worked example of the compound pattern:

The voice that calls each ball is one of 331 pre-recorded MP3s. The choice to ship pre-recorded audio instead of synthesizing speech at call time is a code decision — Web Speech API drifts in pacing and pronunciation; recorded audio is identical every run.
The flashboard renders 75 cells in a deterministic layout. The bingo-detection logic is TypeScript (lib/bingo-checker.ts); no LLM is asked whether a row is complete.
The card-validation flow (proving that a 5-character card code corresponds to a winning board) is a single SQL query plus a deterministic reconstructCard function. No model is asked to validate; the math is the contract.

If any of those layers were handed to an LLM with the framing "you are an autonomous bingo agent," the product would be slower, more expensive, and less reliable on every dimension. The product is faster, cheaper, and more reliable because each layer is bounded code calling a bounded model only where the model adds real value.

What this does not buy

A compound system does not replace the engineer who writes the orchestration code. The shape of that engineer's day changes: instead of prompt-tuning a single multi-step plan, they are writing TypeScript that calls bounded models and writing tests that pin the boundary. That work is not glamorous; it does not show up in any vendor's pitch deck. There is no margin in selling code that calls Python functions.

If your team has the resources to staff one senior engineer plus a continuous evaluation discipline, you can ship a compound system today. The model choices in this post are deliberate and replaceable — a year from now the right routing might be different — but the architecture is durable.

Receipts

Every claim in this post is grounded in BingWow's actual production stack:

bingwow.com/research — the open-licensed engagement research that the system is built on; the SSRN abstract IDs are linked from each report.
State of Team Building Games 2026 — the most-recent research output; the same compound stack that runs the product also ran the dataset analysis behind this report.
bingwow.com/caller — the bingo caller surface described above; click through for the live product.
The AI Journal — "I Built an AI Agent for $310. It Failed for the Same Reason Yours Will." — the companion editorial that tells the story of the failed agent and the data on why agents fail at the rate they do.
Berkeley AI Research — The Shift from Models to Compound AI Systems — the paper that named the pattern, by Zaharia, Khattab, Chen and colleagues.

Pick the architecture. Don't pick the marketing label. The compound AI system is the architecture nobody is marketing — and that is the point.

Frequently Asked Questions

What is a compound AI system?

A compound AI system is an architecture where traditional code orchestrates one or more bounded AI components — each model handles a single well-defined task and the overall control flow lives in deterministic code (TypeScript, SQL, cron jobs). It is the opposite of an agentic system, where an LLM decides what to do next on every step. Berkeley AI Research named the pattern in February 2024.

Why does BingWow use a compound system instead of an agent?

Agents fail at higher rates than the architecture marketing sells. Gartner's Hype Cycle 2026 predicts more than 40% of agentic AI projects will be canceled by end of 2027. McKinsey's 2026 Global AI Survey shows 73% of enterprise AI projects fail to deliver ROI. The failure causes Forrester identifies — unclear success criteria, insufficient tool access, drift in evaluation — are all what happens when an LLM owns control flow that traditional code should own. BingWow's compound stack runs the entire product at $200 a month while serving 100,000 people; the prior agent (OpenClaw) cost $310 over six months and produced zero new dofollow linking domains.

Which models does BingWow use, and what does each one do?

Six models from four vendors. Claude Sonnet 4.5 handles content quality judgment and the heaviest generation work. Claude Haiku 4.5 handles classification — moderation, dedup, categorization. Gemini 2.5 Flash generates bingo clues at scale. Gemini 3 Flash Preview (with Gemini 2.5 Pro fallback) generates themed display names per card topic. GPT-4o Vision describes background images for accessibility and search. Replicate Flux Schnell generates background images. None of these models decides what happens next — code decides.

How much did routing classification to Haiku save?

After moving classification off Claude Opus 4.6 and onto Haiku 4.5 in April 2026, BingWow's Anthropic bill fell from $560 a month to between $170 and $245. The system generates 30,000 AI bingo cards a month at that cost — same Anthropic, same OpenAI, same Gemini, same Replicate, opposite outcome to the agent that burned $310 for nothing. Model routing is a code decision; the agent had no model-routing intelligence because routing requires owning the decisions.

What was OpenClaw and why was it retired?

OpenClaw was BingWow's autonomous SEO agent, retired 2026-05-07. It ran for six months on Claude Opus on a DigitalOcean droplet with permission to email journalists, post on Medium, and pitch on Reddit. It published 10 Medium articles, 39 Pinterest pins, and 12 guest-post pitches. The Reddit account got permanently banned from r/BabyBumps after three promotional posts in one day. Total bill: $310. Total result: zero new backlinks from any site that passes Domain Rating. The full story is in the AI Journal article 'I Built an AI Agent for $310. It Failed for the Same Reason Yours Will.'

Is the compound AI pattern only for small teams?

No. The pattern scales independently of team size — the Berkeley AI Research paper that named it cites enterprise deployments at companies far larger than BingWow. What scales with team size is the amount of orchestration code, not the architectural choice. A single engineer can ship a compound system; a team of fifty engineers can ship one too. The decision is who owns control flow — code or model — not how many people are writing the code.

Ready to try it?

Create your own bingo card in seconds — free, no signup required.

Create a Card