Make the cluster boring before you call it production.
Use this checklist after the first cluster path succeeds and before teams depend on the service. The goal is to lock down the security posture, protect data, prove observability, and confirm that the operator reports a clean converged state.
Decision matrix
Production gates
| Gate | What must be true | Why it matters | Go deeper |
|---|---|---|---|
| Security posture | The cluster runs the hardened path: secure profile, external seal, deliberate TLS, and self-init with real auth methods. | The defaults that help evaluation can become long-lived risk in production. | Security profiles, self-init, and workload TLS configuration. |
| Durability | Storage, replica count, and scheduled backups are all deliberate and already tested. | Upgrades, restore workflows, and voter recovery all assume the data path is stable. | Backups, storage, and topology spread. |
| Observability | Metrics, logs, and alerts reach the systems operators actually watch. | Incidents are slower and riskier when the first debugging session starts after go-live. | Observability and network egress configuration. |
| Cluster readiness | The status conditions show healthy convergence and no unresolved integration blockers. | A production launch should start from a stable status surface, not an optimistic assumption. | Use the final verification commands on this page. |
Lock down the security baseline
- Set
spec.profile: Hardenedso the workload starts from the strict controller posture rather than the evaluation defaults. - Use a non-static external seal such as Transit, cloud KMS,
ocikms,kmip, orpkcs11. Do not keep long-lived unseal keys in Kubernetes Secrets for the production path. - Confirm your Kubernetes cluster already encrypts Secrets at rest. The operator cannot compensate for an unencrypted control plane.
- Use
ACMEorExternalTLS for public or shared edges. AvoidOperatorManagedcertificates for public-facing production entry points. - Enable
spec.selfInitand configure real user authentication inspec.selfInit.requestsso the first operator-driven bootstrap does not end in a lockout. - If you rely on operator lifecycle auth for backups and upgrades, enable
spec.selfInit.oidc.enabled: trueor deliberately provision the equivalent JWT roles yourself.
A cluster that initializes successfully is not automatically ready for production. The production gate is the combination of security hardening, backup readiness, and clean status conditions, not the fact that pods started once.
Enforce the tenant guardrails
- Verify the ValidatingAdmissionPolicies and related guardrails are installed and enforced, including:
openbao-validate-openbaoclusteropenbao-validate-openbao-tenantopenbao-validate-openbaorestoreopenbao-lock-controller-statefulset-mutationsopenbao-lock-managed-resource-mutationsopenbao-enforce-managed-image-digestsopenbao-restrict-provisioner-rbacopenbao-restrict-provisioner-namespace-mutationsopenbao-restrict-provisioner-tenant-governanceopenbao-restrict-controller-rbacopenbao-restrict-controller-secret-writes
- Confirm that the operator namespace, tenant onboarding flow, and shared-controller trust boundaries match the tenancy model you chose during
Get Started.
Inspect
Inspect the control-plane baseline
kubectl get validatingadmissionpolicy | grep openbao
kubectl get deploy -n <operator-namespace>
kubectl get openbaotenant -A
The exact number of policies and controller Deployments depends on the features you enabled, but the OpenBao guardrail set should be visible before you bring real tenants onto the platform.
Make the cluster durable
- Set explicit CPU and memory
requestsandlimits. A cluster that only works under zero pressure is not production-ready. - Choose a low-latency StorageClass and set
spec.storage.storageClassNameexplicitly for new clusters. The effective storage class is not something you want to discover by accident after PVC creation. - Use at least three replicas for a highly available Raft cluster and verify the Kubernetes nodes span the intended zones or failure domains.
- Configure scheduled backups and test a restore path before the first risky upgrade.
- Confirm
spec.network.egressRulesallow the cluster to reach the services it really depends on: cloud KMS, OIDC discovery, backup storage, and any external gateway edges.
Treat backup success and restore confidence as part of the launch checklist, not as follow-up work for a later sprint.
Prove observability and operational response
- Configure metrics scraping through Prometheus Operator (
ServiceMonitor) or VictoriaMetrics Operator (VMServiceScrape). - Grant the scraping identity permission to read
/metricsand keep TLS verification strict in production. - Make sure structured logs including
cluster_nameandcluster_namespacereach the log system your operators actually use. - Alert on backup staleness, degradation, reconciliation failures, and other conditions that should wake a human before tenants feel the failure.
Verify the cluster before routing traffic
Verify
Inspect the final readiness surface
kubectl describe openbaocluster <name> -n <namespace>
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{.status.phase}{"\n"}{range .status.conditions[*]}{.type}={.status}{"\n"}{end}'
Run both commands from the target namespace so you can see the reconciler status, recent events, and the final condition set in one pass.
Reference table
Signals to see before go-live
| Signal | Healthy state | Why it is important |
|---|---|---|
| Phase | Running | The cluster has converged past bootstrap and is not stuck in an intermediate lifecycle state. |
| Available | True | The workload is up and the operator believes the service is available to consumers. |
| ProductionReady | True | This is the clearest signal that the cluster passed the production-readiness gate. |
| Integration-specific conditions | Healthy for the features you enabled, such as CloudUnsealIdentityReady, GatewayIntegrationReady, APIServerNetworkReady, or BackupConfigurationReady. | These conditions expose dependency problems that may not show up as plain pod readiness failures. |
Continue operating
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.