2026 OpenClaw Gateway Kubernetes Deployment & Acceptance Runbook: Version Pinning, Resource Quotas, bind=lan, port-forward, and Typical OOM/NotReady Rollback (FAQ + Remote Physical Mac Bare-Metal Contrast)
Platform and SRE teams shipping OpenClaw Gateway on Kubernetes often stall sign-off on image drift, misaligned probes and bind addresses, port-forward smoke tests that do not match real traffic, and whether to roll back or retune on OOM/NotReady. This article provides a scannable Kubernetes vs remote physical Mac matrix, a seven-step runbook, change-ticket-ready thresholds, and a symptom-based FAQ.
1. Introduction: why gateway acceptance on Kubernetes needs network and cgroup evidence
On bare metal, “the port is up” often equals a successful bind. On Kubernetes, the same log line still passes through Service endpoints, kube-proxy (or your CNI datapath), NetworkPolicy, and cgroup memory accounting. If OpenClaw Gateway keeps a 127.0.0.1 mental model from a laptop, you get false negatives: curl into the Pod works while traffic via Service fails, or readiness stays red while the process is alive.
This guide chains evidence you can sign: image digest and Helm values hash, requests/limits aligned with OOM events, bind=lan consistent with targetPort, and port-forward smoke tests cross-checked with in-cluster probes. If you are also evaluating macOS node placement for global latency, use this matrix as the parent template for “same version, two tracks” acceptance.
2. Three pain points: version drift, bind vs probes, quotas and noisy traffic
- Version drift and irreproducibility: Production uses
:latestor tags without digests; two weeks later the same tag rebuilds with different behavior. Rollbacks cannot prove the old ReplicaSet matches the incident binary. - Bind address, Service, and probes in conflict: The gateway listens on loopback while readiness hits the Pod IP; or bind=lan is correct but NetworkPolicy only allows the Ingress CIDR and kubelet probes are dropped—NotReady coexists with traffic blackouts.
- Resource quotas and hidden cost: Missing requests schedules “successfully” until the node packs and delayed OOM follows; tiny limits kill the process at tool-call peaks with exit 137 and little in logs—hard to separate leaks from normal spikes without metrics.
3. Decision matrix: Kubernetes vs remote physical Mac bare metal
Align “who wins under which constraint” so launchd habits are not pasted verbatim into Pods.
| Dimension | Kubernetes (Deployment + Service) | Remote physical Mac / launchd |
|---|---|---|
| Listen bind | Use bind=lan (or 0.0.0.0) so Service can reach the Pod; loopback only for same-Pod sidecars |
Often 127.0.0.1 behind nginx/Caddy terminating TLS |
| Version pinning | Image digest + chart version + values hash in the change record | Checksum + lockfiles + launchd plist version fields |
| Isolation | cgroup OOMKilled and CPU throttle are auditable | Unified memory and swap policy; watch memory pressure and thermal throttling |
| Ad-hoc acceptance | kubectl port-forward for smoke tests—not a substitute for in-cluster paths |
Local curl or SSH tunnels; shorter path, fewer replica angles |
| Typical rollback | kubectl rollout undo or pinned previous digest |
Replace binary/image tag + launchctl kickstart -k; mind single-instance locks |
4. Seven-step runbook (bind=lan and port-forward)
- Pin versions: CI writes image
repo@sha256:…, Helm chart version, andvalues.yamlgit SHA into the ticket; block production pipelines on floating tags. - Declare resources: Set
requests.memorynear P95 resident working set;limits.memorycovers tool calls and JSON buffers; CPU requests avoid scheduling onto already saturated nodes. - Align bind and ports: If traffic enters via Service/Ingress, configure
bind=lan(or documented dual-stack listen) and verifycontainerPort,targetPort, and probe ports match. - Configure probes: Readiness uses the same protocol/host/path tuple as real traffic; add
initialDelaySeconds/startupProbefor cold starts so skills loading does not flip NotReady. - port-forward smoke test: From an ops machine run
kubectl port-forward deploy/openclaw-gateway 18789:18789(replace port), complete minimal health and one tool call; repeat an in-cluster probe and record whether both paths agree. - Observe and alert: Tie restart count, OOMKilled, readiness=false duration, 5xx, and gateway queue depth to one dashboard; keep 24h before/after change windows.
- Bare-metal contrast sign-off: On a remote physical Mac, repeat key health signals with the same digest using native install or Compose, and document deltas—see 2026 OpenClaw on Windows and Linux: PowerShell vs WSL2, Enterprise HTTPS Proxy, Node Pinning, Remote macOS Gateway Runbook for client-to-macOS gateway alignment.
5. Cite-ready thresholds and parameters
- Image pinning: Production tickets include digest and build id; rollbacks cross-check incident timestamps.
- Memory headroom: With observed tool-call spikes, keep limits at least ~25–40% above explainable P95 resident; prefer throttling before blind doubling.
- Probe startup: Gateways with >30s cold start should use startupProbe or ≥40–60s grace, aligned with OpenClaw workspace/skills load time.
- port-forward: Smoke only; sign-off SLOs must include in-cluster Service DNS and Ingress/TLS paths.
6. FAQ: OOM, NotReady, and rollback
Does bind=lan widen exposure?
Listening on a non-loopback address inside the Pod does not equal public Internet exposure; surface area is defined by Service type, Ingress, NetworkPolicy, and egress policy. Audit “process bind” and “who can route to the Pod” separately.
First step on OOMKilled?
Read kubectl describe pod Last State, node memory pressure, and exit 137; correlate with concurrent tool calls and payload sizes to avoid mistaking spikes for leaks.
NotReady but logs say listening—what to fix first?
Check probe URL port/path, missing startupProbe, NetworkPolicy blocking kubelet sources, and whether HTTP routes mount only after ready.
How to document rollback for compliance?
Keep the previous digest and Helm revision; after kubectl rollout undo, run in-cluster health and minimal business handshake and attach evidence to the ticket.
7. Why aligning the same gateway version on Mac mini is easier
Kubernetes covers replicas, rolling updates, and quota audits; remote physical Mac nodes—often Mac mini M4—remain the practical default for signing, Screen Sharing, and Apple-ecosystem integration. Running the same digest under launchd before promoting the cluster image reduces late-stage bind and probe surprises.
On macOS, Unix tooling and SSH work out of the box; Apple Silicon unified memory keeps long-lived gateway processes stable versus many small PCs, and ~4W idle-class power makes 7×24 contrast tests affordable. Gatekeeper, SIP, and FileVault reduce malware risk versus typical Windows fleet images. If you want equivalent health signals on cluster and bare metal, a quiet, efficient Mac mini M4 is a strong reference node.
If you are ready to validate this runbook on real hardware, Mac mini M4 is a cost-effective standard contrast node to run in parallel with production clusters.
Use Mac mini as your off-cluster “golden” contrast node
Sign off the same gateway build on remote macOS first, then promote the Kubernetes image—fewer probe and bind incidents.