OpenBao Operator integration contract

Use this reference when you connect OpenBao clusters managed by dc-tec/openbao-operator to this observability reference architecture. It defines the workload telemetry, resource, label, dashboard, alert, and log contract that lets the operator and this repository stay complementary.

Contract scope

This page defines the contract for observing OpenBao workloads that the operator creates. It does not define the operator control-plane metrics, controller dashboards, backup controller alerts, restore workflows, upgrade workflows, or tenant admission signals.

Use this contract when you need one of these outcomes:

The operator renders OpenBao workload telemetry in a way this repository can consume.
This repository publishes dashboards, alerts, and runbooks that work against operator-managed clusters.
Platform teams can decide which resources belong to the operator and which resources belong to the observability delivery pipeline.

Validate field availability against the operator version you run. If an operator version uses different API field names, preserve the behavior in this contract rather than the literal field path.

Ownership model

Keep ownership separate so an operator upgrade does not silently redefine workload observability semantics.

Area	Owner	Contract expectation
OpenBao lifecycle	OpenBao Operator	The operator owns StatefulSets, Services, TLS wiring, unseal mode, self-init, backups, restores, upgrades, read replicas, and `OpenBaoCluster` status.
Workload telemetry configuration	OpenBao Operator	The operator renders OpenBao telemetry, metrics listener configuration, metrics Services, and optional scrape resources from `OpenBaoCluster` configuration.
OpenBao signal semantics	This repository	This repository owns metric, log, audit, dashboard, alert, and runbook intent for the OpenBao workload.
Artifact delivery	Platform pipeline	The platform applies generated Prometheus, Loki, and Grafana artifacts from this repository or ports their intent to another backend.
Operator control-plane observability	OpenBao Operator	Operator dashboards and alerts cover reconciliation, backups, restores, upgrades, tenant onboarding, and read-replica lifecycle.

Do not treat operator health as OpenBao workload health. A healthy controller can manage a sealed or leaderless OpenBao cluster, and a serving OpenBao cluster can still have a stalled backup, restore, upgrade, or read-replica workflow.

OpenBaoCluster configuration contract

The operator-facing configuration must let you express these OpenBao workload observability decisions.

Concern	Expected operator surface	Required behavior
Enable workload metrics	`spec.observability.metrics.enabled`	Render OpenBao Prometheus telemetry and create the workload metrics Service when enabled.
Choose scrape profile	`spec.observability.metrics.scrapeProfile`	Support `Active` for the secure baseline and `AllNodes` for per-node HA/Raft visibility.
Configure a metrics-only listener	`spec.observability.metrics.metricsOnlyListener`	Render a dedicated OpenBao listener for metrics collection when all-node scraping or listener separation requires it.
Control metrics listener access	`metricsOnlyListener.unauthenticatedMetricsAccess`	Allow unauthenticated metrics only for a private metrics path with network isolation.
Create scrape resources	`spec.observability.metrics.serviceMonitor`	Create a Prometheus Operator `ServiceMonitor` when the platform uses Prometheus Operator.
Configure scrape authentication	`serviceMonitor.authorization.credentialsSecret`	Reference a Secret containing a scoped OpenBao token for authenticated `/v1/sys/metrics` access.
Configure scrape TLS	`serviceMonitor.tlsConfig`	Reference a CA ConfigMap or Secret and set `serverName` when TLS validates a service DNS name.
Configure metric prefix	`spec.telemetry.metricsPrefix`	Allow either the default `vault` source prefix or an explicit `openbao` prefix.
Configure Prometheus retention	`spec.telemetry.prometheusRetentionTime`	Retain OpenBao metrics long enough for the scrape interval.
Configure declarative audit devices	`spec.audit[]`	Render file, syslog, socket, or HTTP audit devices without requiring imperative post-start API calls.
Configure read replicas	`spec.readReplicas`	Expose enough workload and status context for read-replica dashboards and alerts to separate quorum health from read capacity.

Keep the secure active scrape as the default production baseline. Add all-node scraping only when the platform explicitly needs standby, sealed-node, follower, read-replica, or per-node runtime visibility.

Metrics resource contract

The operator can own the Kubernetes resources that expose OpenBao workload metrics, or the platform can apply equivalent resources. Either path must preserve the same labels, target shape, and scrape semantics.

Resource	Expected shape	Required behavior
Metrics Service	`<cluster-name>-metrics` in the `OpenBaoCluster` namespace.	Expose port `https-metrics` and select the correct pods for the chosen scrape profile.
Active metrics Service	ClusterIP Service that selects the active OpenBao pod.	Select `openbao-active: "true"` when Kubernetes service registration supplies that label.
All-node metrics Service	Headless Service with `publishNotReadyAddresses: true`.	Select every OpenBao pod that should expose metrics, including sealed or not-yet-ready pods.
ServiceMonitor	`<cluster-name>-metrics` in the `OpenBaoCluster` namespace.	Scrape `/v1/sys/metrics` with `format=prometheus` from the metrics Service.
ServiceMonitor endpoint	Port `https-metrics`, path `/v1/sys/metrics`, and `format=prometheus`.	Include interval, timeout, authorization, TLS, and relabeling based on the operator configuration.
All-node relabeling	Prometheus target labels `pod` and `node`.	Preserve pod and node context for HA/Raft, runtime, and read-replica diagnostics.

Use ServiceMonitor when Prometheus Operator is your platform standard. If you use plain Prometheus, VictoriaMetrics, Grafana Agent, Grafana Alloy, or another collector, preserve the same endpoint, parameters, labels, and target profile.

Label contract

Use labels for stable source identity and routing. Do not use labels to expose request paths, secret paths, token accessors, entity identifiers, auth accessors, client addresses, or unbounded policy names.

Label	Expected value	Applies to	Purpose
`app.kubernetes.io/name`	`openbao`	Workload resources.	Identifies the application.
`app.kubernetes.io/instance`	`OpenBaoCluster` name.	Workload resources.	Identifies the cluster instance.
`app.kubernetes.io/managed-by`	`openbao-operator`	Operator-managed resources.	Identifies the lifecycle owner.
`openbao.org/cluster`	`OpenBaoCluster` name.	Operator-managed OpenBao resources.	Provides a stable cluster identity for selectors and dashboards.
`app.kubernetes.io/component`	`metrics` on metrics resources.	Metrics Service and scrape resources.	Distinguishes metrics exposure from API Services.
`openbao.org/component`	`metrics` on metrics resources.	Metrics Service and scrape resources.	Gives operator-owned resources an OpenBao-specific component label.
`openbao.org/scrape-profile`	`Active` or `AllNodes`.	Metrics Service and scrape resources.	Identifies the scrape profile.
`openbao.org/workload-pool`	`voter` or `read-replica` when used.	Workload pods and Services.	Separates quorum participants from read-replica pools.
`openbao-active`	`"true"` on the active OpenBao pod.	Active scrape selector.	Lets the active metrics Service target exactly one active pod.

Prometheus and Loki labels do not need to match every Kubernetes label. Promote only the bounded dimensions needed for routing, dashboards, and incident triage.

Active scrape contract

The Active scrape profile is the production baseline.

Requirement	Contract
Target count	One target per OpenBao cluster.
Target selection	Active OpenBao pod only.
Authentication	Prefer a scoped OpenBao token with access to `sys/metrics`.
Listener	Use the API listener or a dedicated authenticated metrics-only listener.
Service shape	ClusterIP Service that does not publish not-ready addresses.
Best dashboard coverage	Overview, audit health, token and lease health, request health, and high-level HA state.
Known limitation	Standby, sealed-node, follower, read-replica, and per-node runtime detail is incomplete.

Use this profile when you need the lowest-risk metrics exposure model. It is also the right fallback when an older operator version or platform profile does not support a private all-node listener.

All-node scrape contract

The AllNodes scrape profile is an advanced profile for HA/Raft, runtime, and read-replica diagnostics.

Requirement	Contract
Target count	One target per selected OpenBao pod.
Target selection	All OpenBao pods in the selected workload pool.
Listener	Dedicated metrics-only listener on every selected pod.
Service shape	Headless Service with `publishNotReadyAddresses: true`.
Standby behavior	Standby metrics need a private metrics path that OpenBao permits on standby nodes.
Network control	Restrict the metrics listener to Prometheus or an equivalent collector path.
Best dashboard coverage	HA/Raft, runtime/storage, read-replica, sealed-node, standby, and per-node diagnostics.

If you enable unauthenticated metrics access for all-node scraping, network isolation becomes part of the security boundary. Use NetworkPolicy, private routing, firewall rules, mTLS proxying, or sidecar-local scraping so ordinary clients cannot reach the metrics-only listener.

Audit and log contract

The operator can configure audit devices, but the platform still owns log collection, audit archive delivery, retention, and access control.

Stream	Source	Contract
`openbao.operational`	OpenBao container logs.	Keep separate from operator controller logs.
`openbao.completed_requests`	OpenBao completed request logs when enabled.	Treat as temporary troubleshooting data, not as an audit-log replacement.
`openbao.audit`	OpenBao audit devices used for investigation.	Restrict access and preserve request/response entries for security workflows.
`openbao.audit_archive`	Compliance or long-term audit archive path.	Keep separate from short-term Loki or dashboard exploration.
Operator logs	Operator controller and provisioner logs.	Keep in operator-owned dashboards and runbooks.

For file audit devices, the audit path must be stable enough for the collector to tail after pod restarts. If the operator renders audit devices from spec.audit[], the platform must still provision the volume, file permissions, collector mount, archive path, and access policy.

Leave ordinary OpenBao operational logs on stderr/stdout for Kubernetes workloads unless you deliberately mount and manage a writable log volume. A configured operational log file without a compatible mount can prevent OpenBao from starting. This is separate from file audit devices, which need explicit storage and access controls because they contain security records.

Dashboard contract

Use dashboard families that keep control-plane and workload questions distinct.

Dashboard family	Owner	Use it for
Operator dashboards	OpenBao Operator	Reconcile health, controller errors, backup freshness, restore state, upgrade progress, read-replica lifecycle, and CR status.
OpenBao workload dashboards	This repository	OpenBao availability, seal state, active node count, request latency, HA/Raft health, runtime pressure, token and lease pressure, audit health, and security investigation.
Platform dashboards	Platform team	Pod readiness, restarts, node pressure, PVC pressure, NetworkPolicy reachability, collector health, and Prometheus target health. Use the generated Kubernetes platform dashboard as the reference workload-context view.

When you link dashboards together, pass bounded context such as cluster, Kubernetes namespace, pod, node, scrape profile, and source prefix. Do not pass request paths, secret paths, token accessors, entity identifiers, auth accessors, or client addresses as dashboard variables.

Alert contract

Keep alert ownership clear even when a single incident involves both the operator and the OpenBao workload.

Alert class	Owner	First question
Operator lifecycle alerts	Operator or platform team.	Is the controller failing to converge the desired state?
OpenBao workload alerts	OpenBao service owner.	Is the OpenBao workload unhealthy after the desired state exists?
Security investigation alerts	Security or secrets platform responders.	Is there evidence of audit failure, risky activity, or missing security records?
Platform alerts	Platform team.	Is Kubernetes, storage, network, DNS, or collection infrastructure preventing either layer from working?

Runbooks can link across ownership boundaries, but the alert name and primary owner should not change. This keeps paging, escalation, and post-incident review clear.

Minimum acceptance checklist

An operator-managed OpenBao cluster fits this contract when all of these checks pass:

The OpenBao workload has Prometheus telemetry enabled.
The metrics source prefix is documented as vault or openbao.
The active scrape exposes exactly one healthy target per cluster.
The all-node scrape, when enabled, exposes one target per selected OpenBao pod and preserves pod and node labels.
The metrics listener is authenticated, privately reachable, or both.
The metrics Service and scrape resource carry stable cluster and scrape profile labels.
OpenBao operational logs and operator logs land in different streams.
Audit logs land in a restricted stream and have a separate archive decision.
Generated Prometheus, Loki, and Grafana artifacts from this repository are validated against the operator-managed staging cluster.
Operator control-plane alerts and OpenBao workload alerts route to owners who can act on their first diagnostic step.

Compatibility notes

Raw OpenBao metrics do not expose a consistent cluster label across every metric family and deployment profile. Use the generated recording rules from this repository before you make dashboards or alerts depend on normalized cluster-level signals.

OpenBao deployments commonly emit vault_* metrics unless you configure an openbao metrics prefix. This repository generates artifacts for both source prefixes.

The local OpenBao fixture validates basic Raft non-voter behavior with one read replica. Operator-managed read replicas still need live Kubernetes validation before you page on operator-specific role labels. Use all-node scraping for diagnosis, then keep quorum alerts separate from read-capacity alerts.

What’s next

Use OpenBao Operator companion profile to understand the high-level companion model.
Use Operator-managed OpenBao examples when you need patch-based examples for active scraping, all-node scraping, declarative audit devices, and generated artifact adoption.
Use Configure a secure metrics scrape for the active scrape baseline.
Use Configure an all-node metrics scrape for HA/Raft and read-replica diagnostics.
Use Metric compatibility matrix before you rely on feature-specific metrics.
Use Namespaces and scale observability before you expose namespace or read-replica dimensions broadly.
Use OpenBao Kubernetes platform dashboard for workload pod, PVC, node, collector, and Kubernetes event context.
Use Loki label strategy for OpenBao before you promote log fields to labels.