SLO and availability

Use this runbook when an OpenBao synthetic probe, availability burn-rate, or synthetic latency alert fires. These alerts point to user-facing availability or latency symptoms from a selected probe path.

Before you begin

Get access to Prometheus or the metrics backend that evaluates the alert.
Get access to the OpenBao SLO and availability dashboard.
Get access to the OpenBao overview and Kubernetes platform dashboards.
Know which synthetic probe target, network location, and OpenBao endpoint the alert represents.
Know the approved availability SLO target for the affected environment.

Confirm the alert

Check which availability alert fired.

ALERTS{alertstate="firing", alertname=~"OpenBao(SyntheticProbeFailing|AvailabilityFastBurn|AvailabilitySlowBurn|SyntheticProbeLatencyElevated)"}

Open the OpenBao SLO and availability dashboard.
Confirm the selected synthetic probe target.
```
probe_success{job=~".*openbao.*"}
```

Confirm the probe duration.

probe_duration_seconds{job=~".*openbao.*"}

Investigate probe failure

Check whether every OpenBao probe target fails or only one target fails.
```
min_over_time(probe_success{job=~".*openbao.*"}[5m])
```
Check whether Prometheus can still scrape OpenBao metrics.
```
up{job=~"openbao.*"}
```

Check OpenBao cluster state.

openbao:core_active:sum
openbao:core_unsealed:sum
openbao:autopilot_healthy:max

Query the same health path from a comparable network location.
```
curl -fsS http://<openbao_address>/v1/sys/health
```
- <openbao_address>: Address used by the affected synthetic probe, when you can safely reach it from the same network path.

Investigate error-budget burn

Check short-window burn.

(1 - avg_over_time(probe_success{job=~".*openbao.*"}[5m])) / 0.001

Check one-hour burn.

(1 - avg_over_time(probe_success{job=~".*openbao.*"}[1h])) / 0.001

Check six-hour burn.

(1 - avg_over_time(probe_success{job=~".*openbao.*"}[6h])) / 0.001

Confirm that the 99.9 percent target in the generated alert matches the approved SLO for the affected environment. If it does not match, treat the generated alert as a reference signal and use the local SLO policy for incident severity.

Investigate synthetic latency

Compare current probe duration with the recent baseline.

avg_over_time(probe_duration_seconds{job=~".*openbao.*"}[5m])
avg_over_time(probe_duration_seconds{job=~".*openbao.*"}[1h])

Compare with OpenBao request latency.

openbao:core_handle_request:avg5m
openbao:core_handle_login_request:avg5m
openbao:core_check_token:avg5m

Check storage, audit, and runtime context.

openbao:barrier_get:avg5m
openbao:audit_log_request:avg5m
openbao:runtime_sys_bytes:max

Search operational logs for correlated symptoms.

{log_stream="openbao.operational"} |~ "(?i)(timeout|unavailable|latency|slow|error|failed|connection refused)"

Restore the baseline

If all probes fail and OpenBao metrics are also unavailable, use the metrics scrape and cluster-health runbooks before you tune SLO alerts.
If probes fail but OpenBao metrics and health are normal, investigate the load balancer, DNS, TLS, route, NetworkPolicy, proxy, and probe location.
If probe latency correlates with OpenBao request latency, investigate storage, audit, auth backend, token checks, runtime, and Raft health.
If only one probe location fails, route the incident through the owner of that network path or region while continuing to watch the global budget.
If the alert target does not represent the approved SLO path, update the synthetic probe contract and alert selector after the incident.

Verify the result

Confirm probes return to success.

min_over_time(probe_success{job=~".*openbao.*"}[5m])

Confirm short-window burn returns toward zero.

(1 - avg_over_time(probe_success{job=~".*openbao.*"}[5m])) / 0.001

Confirm probe duration returns toward baseline.

avg_over_time(probe_duration_seconds{job=~".*openbao.*"}[5m])

Confirm OpenBao internal latency is stable.

openbao:core_handle_request:avg5m
openbao:core_handle_login_request:avg5m
openbao:core_check_token:avg5m

Wait for the alert window to pass and confirm the alert resolves.

Troubleshooting

Synthetic probe metrics are empty

Confirm that the synthetic probe job exists and that Prometheus scrapes it. The generated dashboard and alerts expect probe_success and probe_duration_seconds.

The probe succeeds but users still report failures

The probe path may be too narrow or the probe location may not match the user network path. Add probes for the affected route, region, or load balancer path, but keep labels bounded.

Burn-rate alerts do not match local SLO policy

The generated alerts assume a 99.9 percent availability target. Update the alert contract when your approved target is different.

Probe latency is high but OpenBao latency is normal

Investigate network, DNS, TLS, load balancer, proxy, and probe-location dependencies before changing OpenBao.

What’s next

Use OpenBao SLO and availability dashboard to inspect availability, burn rate, and latency context.
Use OpenBao overview dashboard for cluster, request, audit, and runtime context.
Use OpenBao Kubernetes platform dashboard when the symptom may come from Kubernetes, PVCs, nodes, or collectors.
Use OpenBao metrics scrape failing when the observability path is also degraded.
Use Synthetic probe example to configure the optional probe metric surface.

Source: Prometheus documents blackbox-style multi-target probes with probe_success and probe_duration_seconds in the Prometheus multi-target exporter guide . Prometheus documents alerting rules and the for clause in the Prometheus alerting rules documentation .