Observe both the operator and the workload before you call the cluster ready.
OpenBao Operator has two observability layers: the operator control plane itself, and the OpenBao workload it renders. Use this page to wire both layers into your monitoring stack, choose the scrape model your platform already supports, and promote only the signals that help you operate upgrades, backups, and recovery.
Decision matrix
Observe the right surface
| Surface | Configure it through | Use it for | Watch for |
|---|---|---|---|
| Operator metrics | Helm or Kustomize settings on the operator installation | Controller and provisioner reconcile health, errors, and platform-level backup or upgrade counters. | The endpoint is HTTPS and RBAC-protected. Your scraper needs both network reachability and permission to GET /metrics. |
| OpenBao workload telemetry | The spec.observability.metrics block on each OpenBaoCluster, plus optional spec.telemetry overrides. | Application-level metrics from the OpenBao Pods themselves. | This is separate from operator metrics. Do not assume enabling one layer automatically covers the other. |
| Logs and health probes | Operator install values such as log level and health probe settings. | Fast incident triage when the issue is not obvious from metrics alone. | Use debug logging intentionally and temporarily. Do not leave broad debug enabled as the long-term default. |
| Dashboards and alerts | Grafana assets under config/grafana/ and your own Prometheus or Alertmanager rules. | A small, repeatable operator cockpit for upgrades, backups, and cluster readiness. | Dashboards should support decisions. They should not become an excuse to avoid explicit alerts on the failure modes that matter. |
Wire operator metrics
- Prometheus Operator
- VictoriaMetrics Operator
- Plain Prometheus
Configure
Enable operator metrics with ServiceMonitor resources
metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: prometheus-k8s
namespace: monitoring
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true
This is the cleanest path when Prometheus Operator is already your cluster standard. It creates ServiceMonitors for the controller and, in multi-tenant mode, the provisioner metrics Services.
Configure
Enable operator metrics with VMServiceScrape resources
metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: vmagent
namespace: monitoring
victoriaMetrics:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true
Use this when VictoriaMetrics is your standard scrape controller. The same HTTPS and RBAC constraints still apply to the metrics endpoint.
Configure
Scrape the operator metrics services directly
scrape_configs:
- job_name: openbao-operator-controller
scheme: https
metrics_path: /metrics
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- <controller-metrics-service>.<operator-namespace>.svc:8443
- job_name: openbao-operator-provisioner
scheme: https
metrics_path: /metrics
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- <provisioner-metrics-service>.<operator-namespace>.svc:8443
Use this only when you do not run a scrape operator. Keep the ServiceAccount permission to GET /metrics and the TLS assumptions explicit.
If operator network policy is enabled, the monitoring namespace must carry the labels expected by networkPolicy.metricsAllowedNamespaceLabels so the scraper can actually reach the HTTPS metrics service.
Enable OpenBao workload telemetry deliberately
Configure
Turn on workload telemetry for an OpenBaoCluster
apiVersion: openbao.org/v1alpha1
kind: OpenBaoCluster
metadata:
name: prod-cluster
spec:
observability:
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: "30s"
scrapeTimeout: "10s"
This enables the OpenBao telemetry stanza with safe defaults and creates a Prometheus Operator ServiceMonitor when that is the scrape model you use. Use spec.telemetry only when you need lower-level OpenBao telemetry tuning.
Reference table
Promote a small set of signals first
| Concern | What to watch | Why it matters |
|---|---|---|
| Availability | openbao_cluster_ready_replicas and cluster conditions such as Available or Degraded | This tells you whether the cluster is actually serving, not just whether Pods exist. |
| Backup freshness | openbao_backup_last_success_timestamp, openbao_backup_consecutive_failures, and the backup status conditions | Restore is only as real as the last backup you can prove succeeded. |
| Upgrade safety | openbao_upgrade_in_progress, openbao_upgrade_failure_total, and rollback counters | Upgrades should be observable as controlled workflows, not silent StatefulSet churn. |
| Controller health | openbao_reconcile_errors_total and sustained reconcile duration spikes | This exposes control-plane failures before they become cluster-wide drift or stalled operations. |
Apply
Start with focused alert rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openbao-operator-alerts
spec:
groups:
- name: openbao-operator
rules:
- alert: OpenBaoClusterDown
expr: openbao_cluster_ready_replicas == 0
for: 1m
- alert: OpenBaoBackupStale
expr: time() - openbao_backup_last_success_timestamp > 86400
for: 15m
- alert: OpenBaoReconcileErrors
expr: rate(openbao_reconcile_errors_total[5m]) > 0.1
for: 10m
Keep the first alert set small. Availability, backup freshness, and sustained reconcile failure are the signals that change operator behavior fastest.
Dashboards, logs, and health
Apply
Install the bundled Grafana dashboards
kubectl apply -k config/grafana -n monitoring
The per-feature dashboards under config/grafana/dashboards/ are the better starting point. The old monolithic dashboard still exists, but it is no longer the recommended default.
Configure
Raise log detail temporarily during investigation
controller:
extraArgs:
- --zap-log-level=debug
- --zap-stacktrace-level=error
Use debug logging only long enough to capture the behavior you need. Reset to the normal log level once the incident or rollout check is complete.
The most useful dashboards show both surfaces together, but they should still make it obvious whether a failure is in the operator control plane or in the OpenBao workload itself.
Keep the operational loop tight
You are reading the unreleased main docs. Use the version menu for the newest published release, or check the release notes for what is already out.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.