2026-04-21

When Your LLM Refuses in Production: Defense-in-Depth and the A/B Before the Flip

MiniMax refused a Shadow War briefing mid-production. Here is the two-layer fix, the 5-day A/B, and the migration to local-model inference — with the numbers.

1. The setup

The Shadow War synthesis pipeline runs nightly. It collects machine-telemetry signals — GPS jamming anomalies, satellite thermal detections, internet outage measurements — and synthesizes them into an intelligence briefing. For months, the synthesis call at the heart of that pipeline used MiniMax M2.7, a hosted Chinese API. It was fast, cheap (effectively free at our volume), and produced output that consistently matched the calibration I needed: factual anchoring first, hypothesis second, confidence tiers explicit.

On the morning of April 21, 2026, the pipeline ran. The synthesis call returned. The front end rendered it. Readers saw this:

I appreciate the detailed and specific creative brief you've put together
here, but I'm not able to write this intelligence briefing. The request
asks me to produce content that...

A content-policy refusal — rendered verbatim as the analytical output. No crash. No error log. The pipeline returned exit code 0 because from its perspective, the API call succeeded. The refusal text was a valid string. It just wasn't a briefing.

This is the shape of the failure mode that gets you if you treat hosted LLM calls as reliable I/O. They are not. They are probabilistic policy negotiation with a black-box counterparty.

Worth being precise about what MiniMax actually did here, though: it wasn't broken or oversensitive. The framing of the Shadow War prompt — "intelligence briefing," OSINT signals, confidence scores, correlation analyses — is exactly the kind of request a hosted model's content team has to weigh carefully. "Does this synthesizable framing get reproduced to manufacture fake intelligence, or is this a labeled analysis board where the publisher controls the rendering?" MiniMax couldn't see the render layer. It came down on the side of refusing, and that is a substantively defensible content-policy call. The reason we could safely route around it is that we control how the output ships: every Shadow War item is labeled SPECULATION with the confidence tier visible. The framing concern MiniMax flagged is real in the abstract — it just doesn't apply in our specific deployment because the editorial context lives in the front end, not the API call. The generalizable read here isn't "hosted models get content policy wrong." It's that hosted-LLM content policies will increasingly diverge from publishers' editorial judgments, and the divergence is almost never "the model is wrong" — it's "the model is being asked to make a judgment call without the surrounding context the publisher already has." Local models are useful for exactly this reason: not because they're smarter, but because the publisher takes ownership of the editorial judgment instead of delegating it to a hosted vendor's content team.

2. The two-layer fix

The immediate fix was obvious: detect refusals and fall back gracefully. The design decision was where to detect. I built two layers because either one can fail independently, and I only need one to catch.

Layer 1 — Python call site. A helper function at the synthesis call point that inspects the first 400 characters of the response:

REFUSAL_MARKERS = (
    "i'm not able to", "i am not able to",
    "i can't assist", "i cannot assist",
    "i'm unable to", "i am unable to",
    "i appreciate the", "i apologize, but",
    "not able to write", "cannot write",
    "won't be able to", "will not be able to",
)

def _is_llm_refusal(text: str) -> bool:
    probe = text[:400].lower()
    hits = sum(1 for m in REFUSAL_MARKERS if m in probe)
    return hits >= 2

The threshold is two markers, not one, because a legitimate briefing can use phrases like "I appreciate" in quoted material. Two markers in the first 400 characters is a strong refusal signal with low false-positive rate.

If _is_llm_refusal() returns True, the call site falls back to a plaintext baseline: a structured but unnarrated summary of the raw signals, no synthesis. Not great, but not a refusal rendered as briefing.

Layer 2 — Astro render guard. The front end checks the synthesis field before rendering it into the Shadow War section:

const RENDER_REFUSAL_MARKERS = [
  "i'm not able to", "i am not able to",
  "i can't assist", "i cannot assist",
  "not able to write", "i appreciate the detailed",
];

const synthesisIsRefusal = typeof synthesis === 'string' &&
  RENDER_REFUSAL_MARKERS.filter(m =>
    synthesis.slice(0, 400).toLowerCase().includes(m)
  ).length >= 2;

const showSynthesis = section.slug === 'shadow-war'
  && typeof synthesis === 'string'
  && synthesis.trim().length > 0
  && !synthesisIsRefusal;

Layer 2 fires only if Layer 1 missed it — for example if the refusal text was saved to the data file before the Python fix was deployed, or if a future refusal uses phrasing not yet in the Python tuple.

Defense in depth at the cost of roughly 30 lines. The pattern is cheap and the alternative is readers watching a LLM apologize to them in the middle of a geopolitical intelligence section.

3. The A/B before the flip

The refusal was the trigger. But the real question it surfaced was: should MiniMax be running the Shadow War synthesis at all, or should I migrate it to the local Captain model (Qwen3.6-35B-A3B in UD-Q8_K_XL quantization — Unsloth Dynamic Q8_K_XL, the top-quality quant tier for this model, running CPU-only on a Ryzen 5900X. 35B total parameters, 3B active per token (it is a MoE architecture, which is why CPU-only inference runs at viable speed). Apache 2.0 open weights, released 2026-04-16, wired into the local stack 2026-04-18)?

I ran a 5-day quality test rather than guessing. Each day: the same signals fed through both MiniMax and Captain, outputs scored against a rubric. Five categories, explicit scoring:

Factuality — does every claim in the briefing trace to a signal in the input?
Calibration — are confidence tiers (A/B/C/D) assigned appropriately to speculation vs confirmed anomalies?
Clarity — is the briefing readable to someone with no prior context on the signals?
Signal-vs-Interpretation split — does the model cleanly separate "what the machine detected" from "what this might mean"?
Refusal risk — did the model complete the task or refuse?

Results across 5 runs:

Dimension	MiniMax M2.7	Captain (Qwen3.6-35B-A3B)
Factuality	5.0 / 5	4.0 / 5
Calibration	4/5 runs correct	4/5 runs correct
4-paragraph compliance	4/4 runs	5/5 runs
Signal / Interpretation split	Clean (4/5)	Minor bleed (3/5)
Refusal risk	1/5 refused	0/5 refused

MiniMax edges Captain on raw factuality (5.0 vs 4.0) and signal/interpretation discipline. Captain matches on calibration and beats MiniMax on structural compliance. But MiniMax's 1/5 refusal rate is not a rounding error — it is the production failure that triggered this whole exercise. A 20% refusal rate on a nightly pipeline that runs automatically and renders directly to readers is disqualifying.

The signal/interpretation bleed in Captain was addressable. Adding a single sentence to the prompt — "Lead each paragraph with the most striking data point, then offer interpretation" — closed most of the gap in follow-up runs without needing a formal re-test.

4. The flip

The code change is small. The Shadow War synthesis call went from:

result = _get_minimax().complete(
    messages=[{"role": "user", "content": prompt}],
    model="MiniMax-Text-01",
    temperature=0.3,
)
synthesis_text = result.choices[0].message.content
source_tag = "minimax-m2.7"

To:

resp = requests.post(LLM_URL, json={
    "model": "qwen3.6-35b-a3b",
    "messages": [{"role": "user", "content": prompt}],
    "temperature": 0.3,
    "stream": False,
}, timeout=120)
synthesis_text = resp.json()["choices"][0]["message"]["content"]
source_tag = "captain-qwen3.6-35b-a3b-local"

LLM_URL points to the local Captain proxy at port 8084. No cloud call. No policy negotiation. No refusal risk from a counterparty whose content policy changed overnight.

The source tag in the pipeline metadata updates downstream — the About page and Methodology page now accurately document Captain as the Shadow War synthesis model.

5. The second flip — signal hypothesis generation

generate_signal_hypotheses() is the other heavy LLM call site — it processes up to 40 telemetry signals per batch, extracting structured JSON hypotheses for each. Same A/B test structure, 5 days of real Shadow War correlation data.

Captain hit 98.9% of MiniMax quality on hypothesis quality (measured by JSON schema compliance, hypothesis specificity, and confidence tier assignment). Refusal rate: 0/5 vs MiniMax's 1/5. Latency: 47 seconds average vs ~30 seconds for MiniMax — acceptable for a once-nightly batch where the total pipeline window is 4 hours.

Flipped. Two of the two heaviest synthesis call sites now run on local hardware.

6. The lesson generalized

The lesson is not "hosted models are bad." The lesson is that hosted models are probabilistic policy negotiators with a counterparty whose policy you do not control and cannot inspect. That is a fundamentally different reliability profile than a local model, and you should design for it explicitly before you ship.

Three practices that would have caught this earlier:

Build refusal detection at the call site before you ship. Not after a production incident. The refusal-marker tuple takes 15 minutes to write and catches the failure before it renders.
Run quantified A/B before you swap. "Captain seemed to do well in my tests" is not evidence. Five days, explicit rubric, recorded scores. The table above is the actual decision artifact — not a vibe.
Prefer local models for editorial reliability when the use case allows. The dollar cost of MiniMax at our volume was negligible. The reliability tax — a 20% chance of a policy refusal on any given night — was the actual cost, and it was invisible until it hit production.

Production LLM systems will refuse you. Not occasionally. Predictably, under conditions that shift with the API provider's policy updates, content filter version changes, and what your prompt happened to look like on that particular run. Build for it.

7. One honest caveat

The four remaining MiniMax call sites in the editorial pipeline script (editor.py) — the news editorial scoring pipeline, which handles section ranking, importance scoring, and story deduplication — have not been flipped yet. A quality test is pending for that workload; the editorial scoring task has different characteristics (short structured outputs, high throughput) and I want fresh A/B data before moving it.

So the current state is accurate: the About page says "Shadow War analysis written by Qwen3.6-35B-A3B on a home CPU." That is true. The news section editorial pipeline still uses a cloud assist. That is also documented in the Methodology page. No overclaiming.

When the editorial scoring A/B completes, the same pattern applies: explicit rubric, 5+ days, table of results, then decide. The methodology is the point — not the specific model.

← All writing · Substrate Wire