Version: 0.2.x

Observability for operator and workload

OpenBao Operator has two observability layers: the operator control plane itself, and the OpenBao workload it renders. Use this page to wire both layers into your monitoring stack, choose the scrape model your platform already supports, and focus on the signals that matter for upgrades, backups, and recovery.

Observe the right surface.
Surface	Configure it through	Use it for	Watch for
Operator metrics	Helm or Kustomize settings on the operator installation	Controller and provisioner reconcile health, errors, and platform-level backup or upgrade counters.	The endpoint is HTTPS and RBAC-protected. Your scraper needs both network reachability and permission to GET `/metrics`.
OpenBao workload telemetry	The `spec.observability.metrics` block on each `OpenBaoCluster`, plus optional `spec.telemetry` overrides.	Application-level metrics from the OpenBao Pods themselves.	Configure this separately from operator metrics. Enabling one layer does not configure the other.
Logs and health probes	Operator install values such as log level and health probe settings.	Fast incident triage when the issue is not obvious from metrics alone.	Use debug logging intentionally and temporarily, then return to the normal log level.
Dashboards and alerts	Grafana assets under `config/grafana/` and your own Prometheus or Alertmanager rules.	A small, repeatable operator cockpit for upgrades, backups, and cluster readiness.	Use dashboards for context and alerts for time-sensitive failures.

Wire operator metrics

Prometheus Operator
VictoriaMetrics Operator
Plain Prometheus

metrics:
enabled: true
rbac:
  enabled: true
  subjects:
    - name: prometheus-k8s
      namespace: monitoring
serviceMonitor:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s
  tlsConfig:
    insecureSkipVerify: true

This is the cleanest path when Prometheus Operator is already your cluster standard. It creates ServiceMonitors for the controller and, in multi-tenant mode, the provisioner metrics Services.

metrics:
enabled: true
rbac:
  enabled: true
  subjects:
    - name: vmagent
      namespace: monitoring
victoriaMetrics:
  enabled: true
  namespace: monitoring
  interval: 30s
  scrapeTimeout: 10s
  tlsConfig:
    insecureSkipVerify: true

Use this when VictoriaMetrics is your standard scrape controller. The same HTTPS and RBAC constraints still apply to the metrics endpoint.

scrape_configs:
- job_name: openbao-operator-controller
  scheme: https
  metrics_path: /metrics
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  static_configs:
    - targets:
        - <controller-metrics-service>.<operator-namespace>.svc:8443

- job_name: openbao-operator-provisioner
  scheme: https
  metrics_path: /metrics
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  static_configs:
    - targets:
        - <provisioner-metrics-service>.<operator-namespace>.svc:8443

Use this path when you do not run a scrape operator. Keep the ServiceAccount permission to GET /metrics and the TLS assumptions explicit.

Network policy still applies to scrapers

If operator network policy is enabled, the monitoring namespace must carry the labels expected by networkPolicy.metricsAllowedNamespaceLabels so the scraper can actually reach the HTTPS metrics service.

Enable OpenBao workload telemetry deliberately

apiVersion: openbao.org/v1alpha1
kind: OpenBaoCluster
metadata:
name: prod-cluster
spec:
observability:
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
      interval: "30s"
      scrapeTimeout: "10s"

This enables the OpenBao telemetry stanza with safe defaults and creates a Prometheus Operator ServiceMonitor when that is the scrape model you use. Reach for spec.telemetry when you need lower-level OpenBao telemetry tuning.

Promote a small set of signals first.
Concern	What to watch	Why it matters
Availability	`openbao_cluster_ready_replicas` and cluster conditions such as `Available` or `Degraded`	This tells you whether the cluster is serving traffic rather than only whether Pods exist.
Steady read pool	`openbao_cluster_read_replicas_desired`, `_ready`, `_registered`, `_healthy`, plus read-replica conditions such as `ReadServingAvailable` and `ReadReplicasAutopilotHealthy`	This tells you whether the steady read tier exists, has actually joined, and is still healthy enough for the topology you placed it in.
Backup freshness	`openbao_backup_last_success_timestamp`, `openbao_backup_consecutive_failures`, and the backup status conditions	These signals show whether the snapshots you plan to restore are current and successful.
Upgrade safety	`openbao_upgrade_in_progress`, `openbao_upgrade_failure_total`, and rollback counters	These signals distinguish orchestrated upgrade activity from normal steady state.
Controller health	`openbao_reconcile_errors_total` and sustained reconcile duration spikes	This exposes control-plane failures before they become cluster-wide drift or stalled operations.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openbao-operator-alerts
spec:
groups:
  - name: openbao-operator
    rules:
      - alert: OpenBaoClusterDown
        expr: openbao_cluster_ready_replicas == 0
        for: 1m
      - alert: OpenBaoBackupStale
        expr: time() - openbao_backup_last_success_timestamp > 86400
        for: 15m
      - alert: OpenBaoReadReplicaPoolDegraded
        expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_healthy
        for: 10m
      - alert: OpenBaoReadReplicaPoolNotRegistered
        expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_registered
        for: 10m
      - alert: OpenBaoReconcileErrors
        expr: rate(openbao_reconcile_errors_total[5m]) > 0.1
        for: 10m

Keep the first alert set small. Availability, backup freshness, sustained read-pool degradation, and sustained reconcile failure are the highest-value starting signals.

Dashboards, logs, and health

kubectl apply -k config/grafana -n monitoring

The per-feature dashboards under config/grafana/dashboards/ are the better starting point. The old monolithic dashboard still exists, but it is no longer the recommended default.

controller:
extraArgs:
  - --zap-log-level=debug
  - --zap-stacktrace-level=error

Use debug logging only long enough to capture the behavior you need. Reset to the normal log level once the incident or rollout check is complete.

Keep operator metrics and workload telemetry separate in your dashboards

Build dashboards that show both surfaces together and still make it obvious whether a failure is in the operator control plane or in the OpenBao workload itself.

The overview dashboard now includes the steady read pool

config/grafana/dashboards/overview.json now shows desired, ready, registered, and Autopilot-healthy read-replica counts next to the existing cluster-level signals. Use that view for the first operational pass, then build more topology-specific dashboards if your placement strategy needs them.

Keep the operational loop tight

Configure backupsConfigure backup telemetry before restore depends on it.Plan upgradesUse upgrade metrics and rollback signals to make rollout behavior observable before the first version change.Troubleshoot the clusterMove from baseline telemetry into symptom-driven incident routing when the service stops behaving normally.

Published release documentation

You are reading docs for version 0.2.x. Use the version menu to switch to next or another archived release.

Wire operator metrics​

Enable OpenBao workload telemetry deliberately​

Dashboards, logs, and health​

Wire operator metrics

Enable OpenBao workload telemetry deliberately

Dashboards, logs, and health