Inside BingWow's Compound AI Stack: Which Model Owns Which Decision (and Why None of Them Own Control Flow)
The Berkeley AI Research blog called it the Compound AI System in February 2024. The thesis is one design choice: either control flow is written in traditional code that calls LLMs at specific bounded steps, or control flow is driven by an LLM that decides what to do next. Compound systems pick the first. Agents pick the second.
I wrote in The AI Journal about which choice fails more often and why. This post is the implementation detail of the working alternative: which model owns which decision inside BingWow's stack, and why none of them owns control flow.
The six models
BingWow ships AI-generated bingo cards. The product is free, ad-free, and used by classrooms and HR teams. Six models from four vendors handle different parts of the pipeline:
- Claude Sonnet 4.5 — content quality judgment, generation. Decides whether a card draft meets the publishability bar; produces new card titles when an existing one needs improvement.
- Claude Haiku 4.5 — classification: moderation, dedup, categorization. Routed to Haiku because Haiku is one-fifth the cost of Sonnet per token and one-third Opus, and classification is a job a smaller model handles cleanly.
- Gemini 2.5 Flash — bingo-clue generation. Each clue is a 3-7 word phrase tied to a card topic. Flash is the throughput model for that bounded shape.
- Gemini 3 Flash Preview + Gemini 2.5 Pro fallback — themed display names. Every player who joins a multiplayer room gets a fresh themed name (a halloween game produces halloween names; a football game produces football names). Gemini-with-fallback is the right routing for a job that needs variety but cannot afford to fail.
- GPT-4o Vision — background image description. AI-generated backgrounds get a text caption for accessibility and search; GPT-4o is the strongest vision model for that specific job and is gated behind a single cron route.
- Replicate Flux Schnell — background image generation. Schnell is the fast tier of the Flux family; the cron generates an image, sends it back through GPT-4o Vision for the alt text, and stores both.
None of these models decides what happens next. Code decides.
Where the code lives
Every transition between models is a TypeScript function, a SQL query, or a cron job. Here is the pipeline from the moment a visitor types a card topic to the moment that card is browsable on bingwow.com/cards:
- Visitor submits a topic. A Next.js API route validates the input, rate-limits the request via an anonymous-rate-key derived from the
bingwow_anon_idcookie, and inserts apending_topicsrow in Supabase. - Cron picks up the topic. At 06:00 UTC,
app/api/cron/process-pending-topicsqueries Supabase for unprocessed rows. The cron schedule is invercel.json. No model decided when this should run. - Gemini 2.5 Flash generates clues. The orchestrator (
lib/process-one-topic.ts) calls Gemini with a structured-output schema. The output is 24-50 candidate clues in a typed JSON shape. - SQL deduplicates. A normalized-form similarity query against existing cards in the same category rejects duplicates. Deterministic. No LLM involved.
- Claude Haiku 4.5 categorizes. The orchestrator hands Haiku the topic + clue list + a freshly-queried list of valid category IDs. Haiku returns a single ID. The orchestrator validates the ID against the same DB read before accepting it.
- Claude Sonnet 4.5 makes the publishability call. Returns
{ publish: boolean }. The orchestrator routes accordingly — publish path or hard-delete (for AI-pipeline cards with no human owner). - Replicate Flux Schnell generates a background. Same cron. The orchestrator runs up to 4 attempts with a Haiku-based text-detection gate (rejecting images that accidentally render English letters on the bingo board).
- GPT-4o Vision writes the alt text. Separate cron route. The image is now an asset; the description is metadata.
- Card status flips to published. SQL
UPDATEwith aCHECKconstraint that only allows the three legal statuses. The card now appears in the sitemap, in category archives, in the IndexNow daily push.
Eight steps. Each one a code decision. The models do bounded work between the decisions.
What this buys
Auditable cost. Every API call has a named caller in the codebase. When the April 2026 Anthropic bill spiked, I found the offender by grepping for claude-3-5-opus in lib/*.ts and replacing it with claude-haiku-4-5 in three files. The bill dropped from $560 a month to between $170 and $245. An agent burns the same $560 because routing is a code decision and the agent owned the decisions.
Auditable failure. When categorization started landing in the wrong subcategory in March, the bug was in a static fallback list in the moderation prompt — not in the model's judgment. The fix was to read the subcategory list from the categories table at request time and validate the AI's returned ID against the same DB read. The fix is in TypeScript, not in prompt engineering, because the failure was a code failure.
Auditable evaluation. Tests exist for code. Every API route has a test fixture that calls it with a known input and asserts on the output shape. The research portal publishes monthly metrics on pipeline output (publish rate, dedup rejection rate, moderation flips). Drift on any axis triggers a code change, not a vibes-based prompt tweak.
The bingo caller is a worked example
The BingWow caller is the most-trafficked surface in the product. It supports 30-ball, 75-ball, and 90-ball bingo with voice calls, a flashboard, auto-draw, manual draw, and printable number cards. Every layer of it is a worked example of the compound pattern:
- The voice that calls each ball is one of 331 pre-recorded MP3s. The choice to ship pre-recorded audio instead of synthesizing speech at call time is a code decision — Web Speech API drifts in pacing and pronunciation; recorded audio is identical every run.
- The flashboard renders 75 cells in a deterministic layout. The bingo-detection logic is TypeScript (
lib/bingo-checker.ts); no LLM is asked whether a row is complete. - The card-validation flow (proving that a 5-character card code corresponds to a winning board) is a single SQL query plus a deterministic
reconstructCardfunction. No model is asked to validate; the math is the contract.
If any of those layers were handed to an LLM with the framing "you are an autonomous bingo agent," the product would be slower, more expensive, and less reliable on every dimension. The product is faster, cheaper, and more reliable because each layer is bounded code calling a bounded model only where the model adds real value.
What this does not buy
A compound system does not replace the engineer who writes the orchestration code. The shape of that engineer's day changes: instead of prompt-tuning a single multi-step plan, they are writing TypeScript that calls bounded models and writing tests that pin the boundary. That work is not glamorous; it does not show up in any vendor's pitch deck. There is no margin in selling code that calls Python functions.
If your team has the resources to staff one senior engineer plus a continuous evaluation discipline, you can ship a compound system today. The model choices in this post are deliberate and replaceable — a year from now the right routing might be different — but the architecture is durable.
Receipts
Every claim in this post is grounded in BingWow's actual production stack:
- bingwow.com/research — the open-licensed engagement research that the system is built on; the SSRN abstract IDs are linked from each report.
- State of Team Building Games 2026 — the most-recent research output; the same compound stack that runs the product also ran the dataset analysis behind this report.
- bingwow.com/caller — the bingo caller surface described above; click through for the live product.
- The AI Journal — "I Built an AI Agent for $310. It Failed for the Same Reason Yours Will." — the companion editorial that tells the story of the failed agent and the data on why agents fail at the rate they do.
- Berkeley AI Research — The Shift from Models to Compound AI Systems — the paper that named the pattern, by Zaharia, Khattab, Chen and colleagues.
Pick the architecture. Don't pick the marketing label. The compound AI system is the architecture nobody is marketing — and that is the point.