Skip to main content
Version: 0.2.x

Decision matrix

Observe the right surface

Observe the right surface.
SurfaceConfigure it throughUse it forWatch for
OpenBao workload telemetryThe spec.observability.metrics block on each OpenBaoCluster, plus optional spec.telemetry overrides.Application-level metrics from the OpenBao Pods themselves.Configure this separately from operator metrics. Enabling one layer does not configure the other.
Logs and health probesOperator install values such as log level and health probe settings.Fast incident triage when the issue is not obvious from metrics alone.Use debug logging intentionally and temporarily, then return to the normal log level.
Dashboards and alertsGrafana assets under config/grafana/ and your own Prometheus or Alertmanager rules.A small, repeatable operator cockpit for upgrades, backups, and cluster readiness.Use dashboards for context and alerts for time-sensitive failures.

Wire operator metrics

Configure

Enable operator metrics with ServiceMonitor resources

yaml

metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: prometheus-k8s
namespace: monitoring
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true

This is the cleanest path when Prometheus Operator is already your cluster standard. It creates ServiceMonitors for the controller and, in multi-tenant mode, the provisioner metrics Services.

Network policy still applies to scrapers

If operator network policy is enabled, the monitoring namespace must carry the labels expected by networkPolicy.metricsAllowedNamespaceLabels so the scraper can actually reach the HTTPS metrics service.

Enable OpenBao workload telemetry deliberately

Configure

Turn on workload telemetry for an OpenBaoCluster

yaml

apiVersion: openbao.org/v1alpha1
kind: OpenBaoCluster
metadata:
name: prod-cluster
spec:
observability:
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: "30s"
scrapeTimeout: "10s"

This enables the OpenBao telemetry stanza with safe defaults and creates a Prometheus Operator ServiceMonitor when that is the scrape model you use. Reach for spec.telemetry when you need lower-level OpenBao telemetry tuning.

Reference table

Promote a small set of signals first

Promote a small set of signals first.
ConcernWhat to watchWhy it matters
Steady read poolopenbao_cluster_read_replicas_desired, _ready, _registered, _healthy, plus read-replica conditions such as ReadServingAvailable and ReadReplicasAutopilotHealthyThis tells you whether the steady read tier exists, has actually joined, and is still healthy enough for the topology you placed it in.
Backup freshnessopenbao_backup_last_success_timestamp, openbao_backup_consecutive_failures, and the backup status conditionsThese signals show whether the snapshots you plan to restore are current and successful.
Upgrade safetyopenbao_upgrade_in_progress, openbao_upgrade_failure_total, and rollback countersThese signals distinguish orchestrated upgrade activity from normal steady state.
Controller healthopenbao_reconcile_errors_total and sustained reconcile duration spikesThis exposes control-plane failures before they become cluster-wide drift or stalled operations.

Apply

Start with focused alert rules

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openbao-operator-alerts
spec:
groups:
- name: openbao-operator
rules:
- alert: OpenBaoClusterDown
expr: openbao_cluster_ready_replicas == 0
for: 1m
- alert: OpenBaoBackupStale
expr: time() - openbao_backup_last_success_timestamp > 86400
for: 15m
- alert: OpenBaoReadReplicaPoolDegraded
expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_healthy
for: 10m
- alert: OpenBaoReadReplicaPoolNotRegistered
expr: openbao_cluster_read_replicas_desired > openbao_cluster_read_replicas_registered
for: 10m
- alert: OpenBaoReconcileErrors
expr: rate(openbao_reconcile_errors_total[5m]) > 0.1
for: 10m

Keep the first alert set small. Availability, backup freshness, sustained read-pool degradation, and sustained reconcile failure are the highest-value starting signals.

Dashboards, logs, and health

Apply

Install the bundled Grafana dashboards

bash

kubectl apply -k config/grafana -n monitoring

The per-feature dashboards under config/grafana/dashboards/ are the better starting point. The old monolithic dashboard still exists, but it is no longer the recommended default.

Configure

Raise log detail temporarily during investigation

yaml

controller:
extraArgs:
- --zap-log-level=debug
- --zap-stacktrace-level=error

Use debug logging only long enough to capture the behavior you need. Reset to the normal log level once the incident or rollout check is complete.

Keep operator metrics and workload telemetry separate in your dashboards

Build dashboards that show both surfaces together and still make it obvious whether a failure is in the operator control plane or in the OpenBao workload itself.

The overview dashboard now includes the steady read pool

config/grafana/dashboards/overview.json now shows desired, ready, registered, and Autopilot-healthy read-replica counts next to the existing cluster-level signals. Use that view for the first operational pass, then build more topology-specific dashboards if your placement strategy needs them.

Keep the operational loop tight

Published release documentation

You are reading docs for version 0.2.x. Use the version menu to switch to next or another archived release.

Was this page helpful?

Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.