Skip to main content
Version: 0.1.0

Decision matrix

Observe the right surface

Observe the right surface.
SurfaceConfigure it throughUse it forWatch for
OpenBao workload telemetryThe spec.observability.metrics block on each OpenBaoCluster, plus optional spec.telemetry overrides.Application-level metrics from the OpenBao Pods themselves.This is separate from operator metrics. Do not assume enabling one layer automatically covers the other.
Logs and health probesOperator install values such as log level and health probe settings.Fast incident triage when the issue is not obvious from metrics alone.Use debug logging intentionally and temporarily. Do not leave broad debug enabled as the long-term default.
Dashboards and alertsGrafana assets under config/grafana/ and your own Prometheus or Alertmanager rules.A small, repeatable operator cockpit for upgrades, backups, and cluster readiness.Dashboards should support decisions. They should not become an excuse to avoid explicit alerts on the failure modes that matter.

Wire operator metrics

Configure

Enable operator metrics with ServiceMonitor resources

yaml

metrics:
enabled: true
rbac:
enabled: true
subjects:
- name: prometheus-k8s
namespace: monitoring
serviceMonitor:
enabled: true
namespace: monitoring
interval: 30s
scrapeTimeout: 10s
tlsConfig:
insecureSkipVerify: true

This is the cleanest path when Prometheus Operator is already your cluster standard. It creates ServiceMonitors for the controller and, in multi-tenant mode, the provisioner metrics Services.

Network policy still applies to scrapers

If operator network policy is enabled, the monitoring namespace must carry the labels expected by networkPolicy.metricsAllowedNamespaceLabels so the scraper can actually reach the HTTPS metrics service.

Enable OpenBao workload telemetry deliberately

Configure

Turn on workload telemetry for an OpenBaoCluster

yaml

apiVersion: openbao.org/v1alpha1
kind: OpenBaoCluster
metadata:
name: prod-cluster
spec:
observability:
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: "30s"
scrapeTimeout: "10s"

This enables the OpenBao telemetry stanza with safe defaults and creates a Prometheus Operator ServiceMonitor when that is the scrape model you use. Use spec.telemetry only when you need lower-level OpenBao telemetry tuning.

Reference table

Promote a small set of signals first

Promote a small set of signals first.
ConcernWhat to watchWhy it matters
Backup freshnessopenbao_backup_last_success_timestamp, openbao_backup_consecutive_failures, and the backup status conditionsRestore is only as real as the last backup you can prove succeeded.
Upgrade safetyopenbao_upgrade_in_progress, openbao_upgrade_failure_total, and rollback countersUpgrades should be observable as controlled workflows, not silent StatefulSet churn.
Controller healthopenbao_reconcile_errors_total and sustained reconcile duration spikesThis exposes control-plane failures before they become cluster-wide drift or stalled operations.

Apply

Start with focused alert rules

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: openbao-operator-alerts
spec:
groups:
- name: openbao-operator
rules:
- alert: OpenBaoClusterDown
expr: openbao_cluster_ready_replicas == 0
for: 1m
- alert: OpenBaoBackupStale
expr: time() - openbao_backup_last_success_timestamp > 86400
for: 15m
- alert: OpenBaoReconcileErrors
expr: rate(openbao_reconcile_errors_total[5m]) > 0.1
for: 10m

Keep the first alert set small. Availability, backup freshness, and sustained reconcile failure are the signals that change operator behavior fastest.

Dashboards, logs, and health

Apply

Install the bundled Grafana dashboards

bash

kubectl apply -k config/grafana -n monitoring

The per-feature dashboards under config/grafana/dashboards/ are the better starting point. The old monolithic dashboard still exists, but it is no longer the recommended default.

Configure

Raise log detail temporarily during investigation

yaml

controller:
extraArgs:
- --zap-log-level=debug
- --zap-stacktrace-level=error

Use debug logging only long enough to capture the behavior you need. Reset to the normal log level once the incident or rollout check is complete.

Keep operator metrics and workload telemetry separate in your dashboards

The most useful dashboards show both surfaces together, but they should still make it obvious whether a failure is in the operator control plane or in the OpenBao workload itself.

Keep the operational loop tight

Published release documentation

You are reading docs for version 0.1.0. Use the version menu to switch to next or another archived release.

Was this page helpful?

Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.