OpenBao Operator integration contract

Use this reference when you connect OpenBao clusters managed by dc-tec/openbao-operator to this observability reference architecture. It defines the workload telemetry, resource, label, dashboard, alert, and log contract that lets the operator and this repository stay complementary.

Contract scope

This page defines the contract for observing OpenBao workloads that the operator creates. It does not define the operator control-plane metrics, controller dashboards, backup controller alerts, restore workflows, upgrade workflows, or tenant admission signals.

Use this contract when you need one of these outcomes:

  • The operator renders OpenBao workload telemetry in a way this repository can consume.
  • This repository publishes dashboards, alerts, and runbooks that work against operator-managed clusters.
  • Platform teams can decide which resources belong to the operator and which resources belong to the observability delivery pipeline.

Validate field availability against the operator version you run. If an operator version uses different API field names, preserve the behavior in this contract rather than the literal field path.

Ownership model

Keep ownership separate so an operator upgrade does not silently redefine workload observability semantics.

AreaOwnerContract expectation
OpenBao lifecycleOpenBao OperatorThe operator owns StatefulSets, Services, TLS wiring, unseal mode, self-init, backups, restores, upgrades, read replicas, and OpenBaoCluster status.
Workload telemetry configurationOpenBao OperatorThe operator renders OpenBao telemetry, metrics listener configuration, metrics Services, and optional scrape resources from OpenBaoCluster configuration.
OpenBao signal semanticsThis repositoryThis repository owns metric, log, audit, dashboard, alert, and runbook intent for the OpenBao workload.
Artifact deliveryPlatform pipelineThe platform applies generated Prometheus, Loki, and Grafana artifacts from this repository or ports their intent to another backend.
Operator control-plane observabilityOpenBao OperatorOperator dashboards and alerts cover reconciliation, backups, restores, upgrades, tenant onboarding, and read-replica lifecycle.

Do not treat operator health as OpenBao workload health. A healthy controller can manage a sealed or leaderless OpenBao cluster, and a serving OpenBao cluster can still have a stalled backup, restore, upgrade, or read-replica workflow.

OpenBaoCluster configuration contract

The operator-facing configuration must let you express these OpenBao workload observability decisions.

ConcernExpected operator surfaceRequired behavior
Enable workload metricsspec.observability.metrics.enabledRender OpenBao Prometheus telemetry and create the workload metrics Service when enabled.
Choose scrape profilespec.observability.metrics.scrapeProfileSupport Active for the secure baseline and AllNodes for per-node HA/Raft visibility.
Configure a metrics-only listenerspec.observability.metrics.metricsOnlyListenerRender a dedicated OpenBao listener for metrics collection when all-node scraping or listener separation requires it.
Control metrics listener accessmetricsOnlyListener.unauthenticatedMetricsAccessAllow unauthenticated metrics only for a private metrics path with network isolation.
Create scrape resourcesspec.observability.metrics.serviceMonitorCreate a Prometheus Operator ServiceMonitor when the platform uses Prometheus Operator.
Configure scrape authenticationserviceMonitor.authorization.credentialsSecretReference a Secret containing a scoped OpenBao token for authenticated /v1/sys/metrics access.
Configure scrape TLSserviceMonitor.tlsConfigReference a CA ConfigMap or Secret and set serverName when TLS validates a service DNS name.
Configure metric prefixspec.telemetry.metricsPrefixAllow either the default vault source prefix or an explicit openbao prefix.
Configure Prometheus retentionspec.telemetry.prometheusRetentionTimeRetain OpenBao metrics long enough for the scrape interval.
Configure declarative audit devicesspec.audit[]Render file, syslog, socket, or HTTP audit devices without requiring imperative post-start API calls.
Configure read replicasspec.readReplicasExpose enough workload and status context for read-replica dashboards and alerts to separate quorum health from read capacity.

Keep the secure active scrape as the default production baseline. Add all-node scraping only when the platform explicitly needs standby, sealed-node, follower, read-replica, or per-node runtime visibility.

Metrics resource contract

The operator can own the Kubernetes resources that expose OpenBao workload metrics, or the platform can apply equivalent resources. Either path must preserve the same labels, target shape, and scrape semantics.

ResourceExpected shapeRequired behavior
Metrics Service<cluster-name>-metrics in the OpenBaoCluster namespace.Expose port https-metrics and select the correct pods for the chosen scrape profile.
Active metrics ServiceClusterIP Service that selects the active OpenBao pod.Select openbao-active: "true" when Kubernetes service registration supplies that label.
All-node metrics ServiceHeadless Service with publishNotReadyAddresses: true.Select every OpenBao pod that should expose metrics, including sealed or not-yet-ready pods.
ServiceMonitor<cluster-name>-metrics in the OpenBaoCluster namespace.Scrape /v1/sys/metrics with format=prometheus from the metrics Service.
ServiceMonitor endpointPort https-metrics, path /v1/sys/metrics, and format=prometheus.Include interval, timeout, authorization, TLS, and relabeling based on the operator configuration.
All-node relabelingPrometheus target labels pod and node.Preserve pod and node context for HA/Raft, runtime, and read-replica diagnostics.

Use ServiceMonitor when Prometheus Operator is your platform standard. If you use plain Prometheus, VictoriaMetrics, Grafana Agent, Grafana Alloy, or another collector, preserve the same endpoint, parameters, labels, and target profile.

Label contract

Use labels for stable source identity and routing. Do not use labels to expose request paths, secret paths, token accessors, entity identifiers, auth accessors, client addresses, or unbounded policy names.

LabelExpected valueApplies toPurpose
app.kubernetes.io/nameopenbaoWorkload resources.Identifies the application.
app.kubernetes.io/instanceOpenBaoCluster name.Workload resources.Identifies the cluster instance.
app.kubernetes.io/managed-byopenbao-operatorOperator-managed resources.Identifies the lifecycle owner.
openbao.org/clusterOpenBaoCluster name.Operator-managed OpenBao resources.Provides a stable cluster identity for selectors and dashboards.
app.kubernetes.io/componentmetrics on metrics resources.Metrics Service and scrape resources.Distinguishes metrics exposure from API Services.
openbao.org/componentmetrics on metrics resources.Metrics Service and scrape resources.Gives operator-owned resources an OpenBao-specific component label.
openbao.org/scrape-profileActive or AllNodes.Metrics Service and scrape resources.Identifies the scrape profile.
openbao.org/workload-poolvoter or read-replica when used.Workload pods and Services.Separates quorum participants from read-replica pools.
openbao-active"true" on the active OpenBao pod.Active scrape selector.Lets the active metrics Service target exactly one active pod.

Prometheus and Loki labels do not need to match every Kubernetes label. Promote only the bounded dimensions needed for routing, dashboards, and incident triage.

Active scrape contract

The Active scrape profile is the production baseline.

RequirementContract
Target countOne target per OpenBao cluster.
Target selectionActive OpenBao pod only.
AuthenticationPrefer a scoped OpenBao token with access to sys/metrics.
ListenerUse the API listener or a dedicated authenticated metrics-only listener.
Service shapeClusterIP Service that does not publish not-ready addresses.
Best dashboard coverageOverview, audit health, token and lease health, request health, and high-level HA state.
Known limitationStandby, sealed-node, follower, read-replica, and per-node runtime detail is incomplete.

Use this profile when you need the lowest-risk metrics exposure model. It is also the right fallback when an older operator version or platform profile does not support a private all-node listener.

All-node scrape contract

The AllNodes scrape profile is an advanced profile for HA/Raft, runtime, and read-replica diagnostics.

RequirementContract
Target countOne target per selected OpenBao pod.
Target selectionAll OpenBao pods in the selected workload pool.
ListenerDedicated metrics-only listener on every selected pod.
Service shapeHeadless Service with publishNotReadyAddresses: true.
Standby behaviorStandby metrics need a private metrics path that OpenBao permits on standby nodes.
Network controlRestrict the metrics listener to Prometheus or an equivalent collector path.
Best dashboard coverageHA/Raft, runtime/storage, read-replica, sealed-node, standby, and per-node diagnostics.

If you enable unauthenticated metrics access for all-node scraping, network isolation becomes part of the security boundary. Use NetworkPolicy, private routing, firewall rules, mTLS proxying, or sidecar-local scraping so ordinary clients cannot reach the metrics-only listener.

Audit and log contract

The operator can configure audit devices, but the platform still owns log collection, audit archive delivery, retention, and access control.

StreamSourceContract
openbao.operationalOpenBao container logs.Keep separate from operator controller logs.
openbao.completed_requestsOpenBao completed request logs when enabled.Treat as temporary troubleshooting data, not as an audit-log replacement.
openbao.auditOpenBao audit devices used for investigation.Restrict access and preserve request/response entries for security workflows.
openbao.audit_archiveCompliance or long-term audit archive path.Keep separate from short-term Loki or dashboard exploration.
Operator logsOperator controller and provisioner logs.Keep in operator-owned dashboards and runbooks.

For file audit devices, the audit path must be stable enough for the collector to tail after pod restarts. If the operator renders audit devices from spec.audit[], the platform must still provision the volume, file permissions, collector mount, archive path, and access policy.

Leave ordinary OpenBao operational logs on stderr/stdout for Kubernetes workloads unless you deliberately mount and manage a writable log volume. A configured operational log file without a compatible mount can prevent OpenBao from starting. This is separate from file audit devices, which need explicit storage and access controls because they contain security records.

Dashboard contract

Use dashboard families that keep control-plane and workload questions distinct.

Dashboard familyOwnerUse it for
Operator dashboardsOpenBao OperatorReconcile health, controller errors, backup freshness, restore state, upgrade progress, read-replica lifecycle, and CR status.
OpenBao workload dashboardsThis repositoryOpenBao availability, seal state, active node count, request latency, HA/Raft health, runtime pressure, token and lease pressure, audit health, and security investigation.
Platform dashboardsPlatform teamPod readiness, restarts, node pressure, PVC pressure, NetworkPolicy reachability, collector health, and Prometheus target health. Use the generated Kubernetes platform dashboard as the reference workload-context view.

When you link dashboards together, pass bounded context such as cluster, Kubernetes namespace, pod, node, scrape profile, and source prefix. Do not pass request paths, secret paths, token accessors, entity identifiers, auth accessors, or client addresses as dashboard variables.

Alert contract

Keep alert ownership clear even when a single incident involves both the operator and the OpenBao workload.

Alert classOwnerFirst question
Operator lifecycle alertsOperator or platform team.Is the controller failing to converge the desired state?
OpenBao workload alertsOpenBao service owner.Is the OpenBao workload unhealthy after the desired state exists?
Security investigation alertsSecurity or secrets platform responders.Is there evidence of audit failure, risky activity, or missing security records?
Platform alertsPlatform team.Is Kubernetes, storage, network, DNS, or collection infrastructure preventing either layer from working?

Runbooks can link across ownership boundaries, but the alert name and primary owner should not change. This keeps paging, escalation, and post-incident review clear.

Minimum acceptance checklist

An operator-managed OpenBao cluster fits this contract when all of these checks pass:

  • The OpenBao workload has Prometheus telemetry enabled.
  • The metrics source prefix is documented as vault or openbao.
  • The active scrape exposes exactly one healthy target per cluster.
  • The all-node scrape, when enabled, exposes one target per selected OpenBao pod and preserves pod and node labels.
  • The metrics listener is authenticated, privately reachable, or both.
  • The metrics Service and scrape resource carry stable cluster and scrape profile labels.
  • OpenBao operational logs and operator logs land in different streams.
  • Audit logs land in a restricted stream and have a separate archive decision.
  • Generated Prometheus, Loki, and Grafana artifacts from this repository are validated against the operator-managed staging cluster.
  • Operator control-plane alerts and OpenBao workload alerts route to owners who can act on their first diagnostic step.

Compatibility notes

Raw OpenBao metrics do not expose a consistent cluster label across every metric family and deployment profile. Use the generated recording rules from this repository before you make dashboards or alerts depend on normalized cluster-level signals.

OpenBao deployments commonly emit vault_* metrics unless you configure an openbao metrics prefix. This repository generates artifacts for both source prefixes.

The local OpenBao fixture validates basic Raft non-voter behavior with one read replica. Operator-managed read replicas still need live Kubernetes validation before you page on operator-specific role labels. Use all-node scraping for diagnosis, then keep quorum alerts separate from read-capacity alerts.

What’s next