Observability for operator and workload
OpenBao Operator has two observability layers: the operator control plane itself, and the OpenBao workload it renders. Use this page to wire both layers into your monitoring stack, choose the scrape model your platform already supports, and focus on the signals that matter for upgrades, backups, and recovery.
Decision matrix
Observe the right surface
| Surface | Configure it through | Use it for | Watch for |
|---|---|---|---|
| Operator metrics | Helm or Kustomize settings on the operator installation | Controller and provisioner reconcile health, errors, and platform-level backup or upgrade counters. | The endpoint is HTTPS and RBAC-protected. Your scraper needs both network reachability and permission to GET /metrics. |
| OpenBao workload telemetry | The spec.observability.metrics block on each OpenBaoCluster, plus optional spec.telemetry overrides. | Application-level metrics from the OpenBao Pods themselves. | Configure this separately from operator metrics. Enabling one layer does not configure the other. |
| Logs and health probes | Operator install values such as log level and health probe settings. | Fast incident triage when the issue is not obvious from metrics alone. | Use debug logging intentionally and temporarily, then return to the normal log level. |
| Dashboards and alerts | Grafana assets under config/grafana/ and your own Prometheus or Alertmanager rules. | A small, repeatable operator cockpit for upgrades, backups, and cluster readiness. | Use dashboards for context and alerts for time-sensitive failures. |
Wire operator metrics
- Prometheus Operator
- VictoriaMetrics Operator
- Plain Prometheus
Configure
Enable operator metrics with ServiceMonitor resources
metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: prometheus-k8s
namespace: monitoring
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true
This is the cleanest path when Prometheus Operator is already your cluster standard. It creates ServiceMonitors for the controller and, in multi-tenant mode, the provisioner metrics Services.
Configure
Enable operator metrics with VMServiceScrape resources
metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: vmagent
namespace: monitoring
victoriaMetrics:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true
Use this when VictoriaMetrics is your standard scrape controller. The same HTTPS and RBAC constraints still apply to the metrics endpoint.
Configure
Scrape the operator metrics services directly
scrape_configs:
- job_name: openbao-operator-controller
scheme: https
metrics_path: /metrics
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- <controller-metrics-service>.<operator-namespace>.svc:8443
- job_name: openbao-operator-provisioner
scheme: https
metrics_path: /metrics
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- <provisioner-metrics-service>.<operator-namespace>.svc:8443
Use this path when you do not run a scrape operator. Keep the ServiceAccount permission to GET /metrics and the TLS assumptions explicit.
If operator network policy is enabled, the monitoring namespace must carry the labels expected by networkPolicy.metricsAllowedNamespaceLabels so the scraper can actually reach the HTTPS metrics service.
Enable OpenBao workload telemetry deliberately
Configure
Turn on workload telemetry for an OpenBaoCluster
apiVersion: openbao.org/v1alpha1
kind: OpenBaoCluster
metadata:
name: prod-cluster
spec:
observability:
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: "30s"
scrapeTimeout: "10s"
This enables the OpenBao telemetry stanza with safe defaults and creates a Prometheus Operator ServiceMonitor when that is the scrape model you use. Reach for spec.telemetry when you need lower-level OpenBao telemetry tuning.
Reference table
Promote a small set of signals first
| Concern | What to watch | Why it matters |
|---|---|---|
| Availability | openbao_cluster_ready_replicas and cluster conditions such as Available or Degraded | This tells you whether the cluster is serving traffic rather than only whether Pods exist. |
| Steady read pool | openbao_cluster_read_replicas_desired, _ready, _registered, _healthy, plus read-replica conditions such as ReadServingAvailable and ReadReplicasAutopilotHealthy | This tells you whether the steady read tier exists, has actually joined, and is still healthy enough for the topology you placed it in. |
| Backup freshness | openbao_backup_last_success_timestamp, openbao_backup_consecutive_failures, and the backup status conditions | These signals show whether the snapshots you plan to restore are current and successful. |
| Upgrade safety | openbao_upgrade_in_progress, openbao_upgrade_failure_total, and rollback counters | These signals distinguish orchestrated upgrade activity from normal steady state. |
| Controller health | openbao_reconcile_errors_total and sustained reconcile duration spikes | This exposes control-plane failures before they become cluster-wide drift or stalled operations. |
Apply
Start with focused alert rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openbao-operator-alerts
spec:
groups:
- name: openbao-operator
rules:
- alert: OpenBaoClusterDown
expr: openbao_cluster_ready_replicas == 0
for: 1m
- alert: OpenBaoBackupStale
expr: time() - openbao_backup_last_success_timestamp > 86400
for: 15m
- alert: OpenBaoReadReplicaPoolDegraded
expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_healthy
for: 10m
- alert: OpenBaoReadReplicaPoolNotRegistered
expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_registered
for: 10m
- alert: OpenBaoReconcileErrors
expr: rate(openbao_reconcile_errors_total[5m]) > 0.1
for: 10m
Keep the first alert set small. Availability, backup freshness, sustained read-pool degradation, and sustained reconcile failure are the highest-value starting signals.
Dashboards, logs, and health
Apply
Install the bundled Grafana dashboards
kubectl apply -k config/grafana -n monitoring
The per-feature dashboards under config/grafana/dashboards/ are the better starting point. The old monolithic dashboard still exists, but it is no longer the recommended default.
Configure
Raise log detail temporarily during investigation
controller:
extraArgs:
- --zap-log-level=debug
- --zap-stacktrace-level=error
Use debug logging only long enough to capture the behavior you need. Reset to the normal log level once the incident or rollout check is complete.
Build dashboards that show both surfaces together and still make it obvious whether a failure is in the operator control plane or in the OpenBao workload itself.
config/grafana/dashboards/overview.json now shows desired, ready, registered, and Autopilot-healthy read-replica counts next to the existing cluster-level signals. Use that view for the first operational pass, then build more topology-specific dashboards if your placement strategy needs them.
Keep the operational loop tight
You are reading docs for version 0.2.x. Use the version menu to switch to next or another archived release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.