Prometheus shows up=0 but curl /metrics works on the Mac—what should I check first?

Verify the scrape target hostname is reachable from Prometheus: if the gateway listens only on 127.0.0.1 but Prometheus runs in Docker or on another host, scrapes fail. Bind to a LAN interface, expose read-only metrics behind a reverse proxy, or run the scraper in the same network namespace as the gateway.

Can macOS power settings cause false latency spikes at night?

Yes. Preventing sleep when the display is off does not guarantee disk and network stay fully awake; App Nap, Wi‑Fi power saving, and display power features can still add jitter. For 7x24 nodes use wired Ethernet, disable Wi‑Fi, turn off disk sleep in Energy settings, and decouple p95 alerts from quiet hours with longer for windows or hourly silences.

Should Grafana use rate or irate for RPS?

Use rate(...[5m]) for smoother trends; irate(...[2m]) highlights spikes but is noisier. Prefer rate in alert rules and add for to suppress chatter.

Should I expose /metrics on the public internet?

No direct exposure. Use mTLS, SSH tunnels, Tailscale, or private reverse proxies with IP allowlists or bearer tokens; metric text often leaks build versions and paths.

Deployment Guide 2026-04-13 · ~10 min

2026 OpenClaw Gateway Prometheus Metrics & Grafana Dashboards: Enable /metrics Scraping, Alert Thresholds & False-Positive Triage on Remote Physical Mac 7×24 (Copy-Paste Scrape, Panel Queries + FAQ)

Platform and SRE teams hosting OpenClaw Gateway on remote physical Macs often see the gateway look healthy over SSH while Prometheus reports up=0, or Grafana “night spikes” page everyone at 3am. This post gives a scrape-topology and alert-forgiveness decision matrix, a seven-step reproducible runbook, copy-paste scrape / PromQL / alert snippets, and an FAQ.

2026 OpenClaw Gateway Prometheus and Grafana observability on a remote physical Mac

1. Introduction & metric name prefixes

This guide is for teams running OpenClaw on rented physical Macs (for example ZoneMac nodes). You need HTTP latency, error rates, and process health as queryable, auditable, alertable time series—not only ad-hoc curl over SSH.

Metric names vary by OpenClaw version and runtime (Node.js prom-client, OpenTelemetry, or built-in gateway stats). The PromQL below assumes typical HTTP server metrics such as http_request_duration_seconds_* and http_requests_total (or *_bucket/_sum/_count). Run curl -s localhost:<port>/metrics | grep -E 'http_|process_' on your host and replace names in queries to match reality.

If you are still hardening the install and process baseline, start with OpenClaw Installation Guide: Mac, Windows & Linux (2026); for Node tuning and file descriptors, see How to Run OpenClaw Efficiently on Mac: Environment & Performance.

2. Pain points

Constraint: bind address vs. where Prometheus runs. If the gateway listens only on 127.0.0.1 but Prometheus is in a container or on a separate observability host, scrapes fail—up=0—while local curl on the Mac still works.
Hidden cost: high-cardinality labels & aggressive scrape_interval. Putting raw paths or API-key fragments into labels explodes TSDB cardinality; 1s scrapes over Wi‑Fi or tunnels amplify jitter into Grafana “hair” charts.
Stability & false positives: macOS power saving & deploy windows. Display sleep, disk sleep, or a hot reload can spike p95 briefly; alerts with for: 0s will wake on-call for noise.

3. Scraping & alerts: decision matrix

Before go-live, align on “who scrapes from which network namespace” and whether alerts should respect business hours.

Dimension	Local Prometheus / vmagent	Centralized remote Prometheus
Network path	Same machine or bridge as the gateway—easiest to use `127.0.0.1`	Gateway must listen on LAN or expose read-only metrics via reverse proxy / Tailscale
Operational load	Per-node config; upgrades touch many files	Rules and dashboards in one place; depends on stable reachability
Typical fit	Single-tenant, strong isolation, first observability loop	Multi-region pools with unified SLOs and alert routing

Alert style	Best for	Main risk
for: 2–5m + rate-based expr	HTTP 5xx, timeouts, elevated error ratio	Detects sustained outages a few minutes later (acceptable if SLA allows)
for: 0 + up probe	Process gone, port not accepting	Noisy during rolling restarts unless silenced or merged
Time-based silence	Known maintenance, backups, GC spikes	Forgotten silences hide real incidents

4. Seven-step runbook (remote physical Mac)

Confirm endpoint & bind. Enable metrics in gateway config; check metrics_path (often /metrics) and listen IP. For remote scrapes, avoid 127.0.0.1-only unless the scraper shares that namespace.
Local smoke test. curl -sS http://127.0.0.1:<port>/metrics | head -n 40 and verify HELP/TYPE lines for HTTP and process metrics.
Choose scraper placement. Single-node PoC: co-located Prometheus. Many nodes: vmagent remote-write or centralized Prometheus with firewall allowlists.
Author scrape configs (below). Use scrape_interval: 15s–30s, sensible scrape_timeout, stable instance relabeling.
Wire Grafana. Add the Prometheus datasource; build panels for global RPS, p95, 5xx ratio, up, and scrape duration.
Define alerts & routes. Use different for for error ratios vs. raw up; use Alertmanager silences or CI-driven maintenance flags during deploys.
Align macOS 7×24 settings. Prefer wired Ethernet, disable disk sleep, fix timezone & NTP. Separate permission and SecretsRef issues from metrics: when the scrape looks healthy but the app returns 500, check application logs before blaming Prometheus.

5. Copy-paste scrape, PromQL & alerts

5.1 Prometheus `scrape_configs` snippet

Replace targets and port with your gateway listen address. Behind a reverse proxy, allow /metrics through and disable caching for that path.

scrape_configs:
  - job_name: openclaw-gateway
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics
    static_configs:
      - targets:
          - '127.0.0.1:18789'
        labels:
          env: prod
          role: gateway
    # relabel_configs:
    #   - source_labels: [__address__]
    #     target_label: instance

5.2 Grafana panel queries (replace metric names to match your host)

Requests per second (RPS): sum(rate(http_requests_total{job="openclaw-gateway"}[5m]))
p95 latency (histogram): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="openclaw-gateway"}[5m])) by (le))
5xx ratio (example): sum(rate(http_requests_total{job="openclaw-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="openclaw-gateway"}[5m]))
Scrape health: up{job="openclaw-gateway"} and scrape_duration_seconds{job="openclaw-gateway"}

5.3 Alert rules (Prometheus `rule_files`)

groups:
  - name: openclaw_gateway
    rules:
      - alert: GatewayScrapeDown
        expr: up{job="openclaw-gateway"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw metrics endpoint not reachable"
      - alert: GatewayHigh5xxRatio
        expr: sum(rate(http_requests_total{job="openclaw-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="openclaw-gateway"}[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "5xx ratio > 5% (adjust labels to your schema)"

If your metrics use code or a custom outcome label instead of status, rewrite the selectors accordingly—validate non-empty vectors in Grafana Explore before promoting to alerts.

6. Typical false-positive triage

Symptom	Suspect first	Mitigation
Intermittent `up=0`	Tunnel reconnect, Wi‑Fi power save, loopback-only bind	Use Ethernet, fix bind, lengthen `for` for tunnel-dependent jobs
Nighttime p95 spikes	Disk sleep, Time Machine, Spotlight	Disable disk sleep; move backups; require “two consecutive windows” in alert expr
5xx alerts during deploy	Rolling restart, readiness not yet true	CI-triggered Alertmanager silence; split readiness vs. metrics scrape jobs
TSDB memory growth	Path or user id in labels	Reduce cardinality in app; `metric_relabel_configs` to drop labels

7. Cite-ready thresholds

scrape_interval: 15s in production (do not go faster than your gateway’s own tail latency without reason); 30s on intercontinental links.
scrape_timeout: keep ≤ ~2/3 of scrape_interval; start at 10s for tunneled paths.
Alert for: availability-style rules from 2m; ratio rules from 5m, then tighten after aligning with SLO burn rates.

8. FAQ

Q: Should I put Basic Auth in front of /metrics?

A: Yes, if the path must cross an untrusted or multi-tenant network. Prefer private networks + mTLS or SSH tunnels and keep auth complexity at the edge.

Q: Does macOS timezone affect PromQL?

A: Prometheus stores UTC; pick the display timezone in Grafana. If you correlate with local-time logs, use one consistent timezone to avoid “alert before log line” confusion.

Q: Can I add node_exporter for host metrics?

A: Yes—add a separate job_name: node to correlate gateway 5xx with disk fullness or CPU saturation.

9. Wrap-up & node choice

Once OpenClaw Gateway HTTP and process metrics live in Prometheus and Grafana, failures on 24/7 physical nodes shift from “SSH and guess” to checking whether up, error rates, and p95 move together. The recurring themes are matching bind addresses with scraper network paths, using for on alerts, and stripping macOS power-saving and deploy windows out of SLO noise.

This “gateway plus observability stack” workflow fits macOS naturally: launchd, logs, and Unix tooling share one stack; Apple Silicon idles at very low power—ideal for always-on gateways. Versus small x86 boxes at similar price, you usually get better stability and power efficiency, while Gatekeeper, SIP, and FileVault harden credentials and tunnel endpoints—exactly what remote gateway operators care about.

If you want the scrape configs and dashboards from this post running on quiet, efficient, 24/7-capable hardware, Mac mini M4 remains one of the best value entry points in 2026—get a remote physical Mac through ZoneMac and fold gateway metrics into your production baseline in one pass.

Remote Mac nodes

Run OpenClaw and Prometheus end-to-end on real Mac hardware?

ZoneMac offers multi-region physical Macs for 24/7 gateways and observability rollouts—on-demand capacity with the same acceptance bar as the scrape configs in this guide.

Pay as you go Physical isolation Auditable