The release gate that found three latent bugs

2026-05-22

We were about to publish @gopigeon/[email protected] — a small version bump. Four new MCP tools and some owned-mode branching, all behind one npm publish. The last step before the irreversible git push --tags is a release gate: one curl against a health endpoint that has to come back green.

It came back as a 404. A SvelteKit 404 page, specifically — HTML, not JSON. Three hours and three bugs later we tagged v0.4.0, and the backend underneath that “small version bump” was meaningfully better than when we started.

Publishing a patch release should be boring. This one wasn’t, and the un-boring parts are the interesting parts.

The publish gate

Here’s the ritual. The whole v1.4 surface — four new tools plus owned-mode branching — ships in a single npm publish of @gopigeon/mcp. One package version covers everything. Before the irreversible step, git push --tags, there’s a gate: hit https://api.gopigeon.dev/healthz/self-test against production, slice .checks[21:27] for the six docs-drift prongs, and confirm every one reports ok: true.

It is a deliberately small ritual. The entire idea is that by the time you tag, everything that ships has been exercised against prod. One curl, six green checks, tag. That’s the design.

The curl came back 404.

Bug 1 — the hostname (the easy one)

curl https://gopigeon.dev/healthz/self-test returned a SvelteKit 404 page.

The reason is dumb and immediate: the backend isn’t at gopigeon.dev. That hostname is the frontend — the marketing site and dashboard, a SvelteKit app on Cloudflare. The backend lives at api.gopigeon.dev. The RUNBOOK had the wrong hostname written into the gate step.

A thirty-second fix. But notice what kind of bug this is. It is only visible the first time the gate runs against prod — and we had never run it, because the gate itself was brand new. We wrote the gate, wrote the RUNBOOK step, and the first time anyone executed it for real was this release.

Lesson. Release rituals only catch what you have actually rehearsed. If a gate has never been run in anger, the gate itself probably has bugs.

Bug 2 — the pod crashed mid-selftest (the medium one)

Re-aimed the curl at api.gopigeon.dev/healthz/self-test. This time it didn’t 404 — it hung, then failed. The pod logs:

amqp declared — queue=q_selftest_49c0f87c
wal-checkpoint thread exiting
monthly-reset thread exiting
stream closed: EOF

Those last three lines are a pod dying. The Kubernetes liveness probe was timing out while the selftest ran, and k8s did what liveness probes do — it SIGTERM-ed the pod out from under us, mid-selftest.

Why was the selftest slow enough to trip the probe? It was blocked. amqp-basic-get was stuck in a C-level recv() inside cl-rabbit, the Lisp binding to the RabbitMQ C client. We had a timeout around it — sb-ext:with-timeout — and the timeout did nothing.

This is a classic FFI-meets-Lisp-signals gotcha, and it earns its own paragraph. SBCL’s with-timeout is built on signals: it schedules an interrupt and unwinds the stack when the interrupt fires. That works beautifully for Lisp code. It does not work for a thread parked inside a blocking C call — especially one in a library that retries on EINTR. When the signal arrives, the C library catches the EINTR, shrugs, and calls recv() again. with-timeout is a Lisp-layer construct; the block is a kernel-layer condition. They never meet. Your timeout is decoration.

The real kernel-level fix is SO_RCVTIMEO on the socket — but plumbing that through cl-rabbit is genuine work. The pragmatic fix is the one we already used everywhere else: don’t depend on the timeout, depend on the retry.

Because here is the actual root cause. We have a wrapper, %amqp-call-with-retry, that handles wedged channels — when a channel gets into a bad state, the wrapper tears it down and retries on a fresh one. amqp-publish used it. amqp-basic-get used it. But the selftest’s amqp-declare-queue-pair — the function that declares the exchange and queue the selftest needs — didn’t. One missing wrapper. A channel wedged by a prior operation made declare fail permanently, while every call site around it recovered from the same wedge state and survived.

The fix is one wrapper:

- (handler-case
-     (progn
-       (cl-rabbit:exchange-declare ...)
-       (cl-rabbit:queue-declare ...))
+ (handler-case
+     (%amqp-call-with-retry
+      "queue-declare-pair"
+      (lambda ()
+        (cl-rabbit:exchange-declare ...)
+        (cl-rabbit:queue-declare ...)))

While we were in there: the broker had ten zombie q_selftest_* queues, accumulated from earlier crashed selftest runs. Not the cause of the hang — but a smell, so we cleaned them up and gave the selftest a teardown step.

Lesson. Defensive patterns drift. When you write a retry wrapper and apply it to three call sites, the fourth one — the one you add six months later — won't have it. The fourth call site is where the bug lives.

Bug 3 — the handler was silently 500-ing (the deep one)

Selftest’s declare fixed. The gate got further — and then we tried the thing the gate is ultimately a proxy for: publishing a real test message through the public HTTP API.

500. A generic Hunchentoot “An error has occurred” page. No detail.

kubectl logs --tail=50 showed the boot sequence, the queue-create, and then… nothing. The handler error wasn’t in the logs. It wasn’t anywhere.

The reason is a configuration decision that was, in isolation, reasonable. Hunchentoot has *log-lisp-errors-p*, *show-lisp-errors-p*, and a message-log-destination. In production, *log-lisp-errors-p* and message-log-destination were both NIL. The dev-mode setup turned logging on; the prod setup didn’t.

There’s a comment in the code explaining the intent:

In dev mode, surface request-handler errors to stdout so 500s don’t vanish silently; in prod, keep them suppressed (logged only via log-error).

The thinking is sound — you don’t want raw backtraces leaking into HTTP responses in production. But read it again. We turned off leaking backtraces to clients. We also, in the same stroke, turned off logging them for operators. We optimized for “don’t leak to clients” and accidentally also bought “don’t tell anyone.” Every 500 the backend had ever served had vanished without a trace.

One commit fixed the config — *log-lisp-errors-p* on, *show-lisp-errors-p* dev-only, message-log-destination pointed at *standard-output* — and the next 500 finally said something:

The value "string" is not of type LIST
0: (JONATHAN.UTIL:MY-PLIST-P ("event" . "string"))
1: jonathan.encode:%to-json on (("event" . "string") ("note" . "string"))
3: GOPIGEON:SCHEMA-HASH
4: GOPIGEON:BUILD-ENVELOPE
5: GOPIGEON:HANDLE-QUEUE-PUBLISH

Here is what that is. schema-hash walks a submitted payload, replaces each value with its type name — "string", "number" — and produces an alist describing the payload’s shape. Then, to turn that shape into a stable hash, it encoded the alist as JSON and hashed the resulting string.

Jonathan is our JSON library. Its encoder, handed an alist, tries to treat it as a plist. It validates each element — and an alist’s elements are dotted pairs, ("event" . "string"), not two-element lists. MY-PLIST-P says no. The encoder raises.

So every single /q/publish was returning a 500. Not some of them. All of them. Forever. The queue publish path — a headline feature of the v1.4 surface we were about to announce — had never once worked over HTTP, and we did not know, because the error that proved it had been swallowed before it reached a log.

The fix isn’t to coax Jonathan into encoding an alist. The fix is that JSON was the wrong tool. We only needed a deterministic, canonical string to hash. write-to-string produces exactly that — a stable printed representation of the structure — for free, with no encoder, no plist validation, no dependency. One line.

Lesson. Silent error suppression in production is worse than the alternative. The fear of leaking backtraces to clients is legitimate — but the fix is to log them where operators can see them, not to swallow them entirely. A 500 with no observability is a strictly worse mystery than a 500 with information.

Three bugs, in a line

Each bug was only visible after the previous one was fixed.

Wrong hostname in the RUNBOOK — hidden behind the fact that the RUNBOOK had never been run.
Missing retry wrapper on amqp-declare-queue-pair — hidden behind the fact that every neighboring AMQP call site had the wrapper and survived the same wedge.
schema-hash crashing on JSON encoding — hidden behind production error suppression that had hidden every 500 we’d ever served.

Each of these had been latent for months. None had been caught by a test. All three would have shipped straight to the first real customer who tried to use queues.

What forced them out wasn’t clever testing or great observability. It was the release gate — a forcing function that exercised a production code path nobody had ever stressed. We didn’t find these bugs. The gate found them, because the only alternative to fixing them was not shipping.

What to take to your own backend

Run your release gates against the production hostname at least once before you need them. The first real run of a gate is when you discover the gate has bugs. Rehearse it.

Watch for defensive-pattern drift. If you have a retry wrapper, a sanitizer, an auth check that’s supposed to be on every call site — it isn’t. Write a lint rule, or at minimum a comment at the wrapper’s definition listing where it must be used. The call site you add later is the one that won’t have it.

Never swallow errors silently in production. “Don’t leak backtraces to clients” and “don’t log errors at all” are different decisions that are easy to make at once by accident. Log them where operators can see them. Diagnostics-to-operators beats zero-visibility every time.

FFI plus signal-based timeouts equals no timeout. If your language does timeouts with signals and your dependency is a blocking C library that retries on EINTR, your timeouts are decoration. Use the kernel-level mechanism, or design around the block with retries instead.

Release gates are unreasonably useful. Even a gate that is nothing more than “curl this URL, check the JSON” is exercising routing, auth, the database, the broker, and your logging all at once. If anything on that path is broken, the gate trips. That is the entire point of a gate, and it is worth far more than its tiny size suggests.

Worth it

By the time we tagged v0.4.0, the backend had a more robust AMQP declare path (every customer creating a queue benefits, not just the selftest), production error visibility (every future 500, not just this one), and a schema-hash that actually works (anyone using queues at all).

A “small version bump” took three hours to ship and left the whole system meaningfully better than it found it. We’ll take that trade every release.

If you want to see the surface all this was gating, the queue docs cover publish, pull, ack, and DLQ — and the error reference is, fittingly, in much better shape than it was a week ago.

← Back to writing