Start with the symptom, then decide whether you need a fix or a recovery path.
Most day 2 incidents start as configuration drift, edge integration failures, or capability mismatches. Use this page to capture the failing surface, route the symptom to the right fix, and escalate into recovery only when normal convergence is no longer realistic.
Inspect
Capture the failure surface first
kubectl describe openbaocluster <name> -n <namespace>
kubectl get pods -n <namespace>
kubectl describe pod <pod> -n <namespace>
Start here before you jump into a fix. The cluster status, recent events, and the specific pod failures usually tell you whether the problem is inside the workload, at the edge, or in cluster policy.
Decision matrix
Choose the first troubleshooting route
| Symptom | Start here | Likely surface | Escalate when |
|---|---|---|---|
| Probe failures with x509 or hostname mismatch | Check the service hostname, SNI source, and the SANs on the serving certificate. | TLS and probe configuration. | Pods still cannot become Ready after the hostname and certificate path are corrected. |
| ACME challenge or join failures | Check DNS, SAN coverage, and whether the chosen ACME endpoint is actually reachable from the right network. | ACME domain planning or private/public edge reachability. | The cluster cannot issue or validate certs and the workload remains blocked behind the edge. |
| Gateway integration stays degraded | Inspect passthrough mode, Gateway listener compatibility, and Gateway programming state. | Gateway API integration. | Traffic still cannot reach the workload after the Gateway listener model is corrected. |
| Kubernetes API calls fail in a hardened cluster | Check apiServerCIDR, apiServerEndpointIPs, and the actual egress behavior of your CNI. | Cluster network policy and API egress. | The operator cannot reconcile core resources because it cannot talk to the Kubernetes API reliably. |
| Node hardening or capability mismatch | Check AppArmor and other workload hardening assumptions against the cluster capabilities. | Node capability mismatch. | The runtime cannot support the hardened profile you selected and the workload cannot start cleanly. |
Common failure routes
- 01Open
Probe TLS failures
Readiness or liveness probes fail because the serving cert does not match the name the probe actually uses.
- 02Open
ACME domain and reachability
Certificates do not issue, the private ACME name does not resolve, or the public CA cannot hit the endpoint you exposed.
- 03Open
Gateway passthrough mismatch
The Gateway listener mode or controller capability does not match the TLS mode the cluster expects.
- 04Open
Kubernetes API egress
The hardened network policy path blocks the controller or workload from reaching the API server.
- 05Open
Node hardening mismatch
AppArmor or another hardening requirement is not available on the target cluster.
- 06Open
Switch to recovery
Normal troubleshooting is no longer enough because leadership, restore, or rollback paths are now in play.
Probe TLS failures
Symptom
Pod events show probe failures such as x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs.
What it usually means
The serving certificate includes DNS SANs only, but the probe is connecting through loopback or another name the cert does not cover.
Check first
spec.gateway.hostnamespec.ingress.host- whether an external Service exists
- whether the selected hostname is present in the certificate SANs
Fix
Make sure the operator has a real hostname it can use for probe SNI and that the certificate includes that hostname. If the certificate comes from an external PKI, reissue it with the correct SAN set instead of weakening probe verification.
ACME domain and reachability
Private ACME CA does not resolve
If ConditionDegraded=True with reason ACMEDomainNotResolvable, the configured spec.tls.acme.domains likely does not resolve from cluster DNS.
Use an internal domain such as <cluster>-acme.<namespace>.svc for private in-cluster ACME issuers and make sure CoreDNS understands any non-.svc override you chose for local clusters.
ACME join or certificate verification fails
If Raft join shows certificate signed by unknown authority or server-name mismatch errors:
- make sure
spec.tls.acme.domainscontains names that are actually present in the issued cert - for private ACME CAs, set
spec.configuration.acmeCARoot - mount
pki-ca.crtalongside that root so the operator can use it forretry_joinand probe verification
Public ACME cannot reach the endpoint
If the logs show Timeout during connect, secondary validation, or repeated tls-alpn-01 failures:
- expose the hardened hostname on a dedicated public passthrough listener on port
443 - do not source-restrict the ACME validation path to a single client IP
- keep restricted admin edges separate from the public challenge edge if you need both
Gateway passthrough mismatch
If ConditionDegraded=True with reason ACMEGatewayNotConfiguredForPassthrough, or GatewayIntegrationReady=False with reasons such as GatewayListenerIncompatible, GatewayFeatureUnsupported, or GatewayNotProgrammed, start at the Gateway listener model.
For tls.mode: ACME:
- set
spec.gateway.tlsPassthrough: true - make sure the referenced Gateway has a
TLSlistener withtls.mode: Passthrough - verify the Gateway controller actually supports the required route feature
TLS termination at the Gateway breaks OpenBao's ability to complete ACME challenges. Fix the passthrough model first instead of debugging the workload Pods.
Kubernetes API egress
If APIServerNetworkReady=False with reason APIServerNetworkConfigurationInvalid, or if API calls fail in a hardened cluster:
- set
spec.network.apiServerCIDRwhen the in-cluster service VIP cannot be discovered or you want an explicit allow-list - add
spec.network.apiServerEndpointIPswhen your CNI enforces egress on post-DNAT traffic and the service-VIP path still fails - treat
APIServerEndpointIPsRecommendedas advice, not automatically as a hard failure
Inspect
Inspect the current API-network conditions
kubectl get openbaocluster <name> -n <namespace> -o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\t"}{.reason}{"\n"}{end}'
This gives you the condition reasons in one pass so you can see whether the controller is failing on service-VIP discovery, endpoint-IP guidance, or a more direct API connectivity problem.
Node hardening mismatch
If ConditionNodeSecurityCapabilityMismatch=True with reason AppArmorUnsupported, the hardened profile expects a node capability your environment does not provide.
For evaluation or constrained dev clusters only, you can disable AppArmor:
Configure
Disable AppArmor for unsupported dev clusters
spec:
workloadHardening:
appArmorEnabled: false
Do not use this as a quiet production workaround. The right production fix is to run on nodes that satisfy the hardening profile you selected.
Switch to recovery
Normal troubleshooting stops being the right tool when the incident is no longer about a single configuration defect and is now about getting the service back into a safe known state.
Reference table
When to move from troubleshooting into recovery
| Trigger | Go next | Why |
|---|---|---|
| The cluster cannot elect or keep a leader | No-leader recovery | This is no longer just a config issue. You need a controlled Raft recovery path. |
| The cluster remains sealed and the normal unseal path is not recovering it | Sealed-cluster recovery | The incident has crossed into availability recovery, not ordinary convergence repair. |
| A blue-green or rollback path failed and the controller needs operator action | Failed rollback recovery | Version change failure modes need a dedicated incident path to avoid making the situation worse. |
| The fastest safe recovery is to restore from a snapshot | Restore operations | You are now operating a destructive recovery workflow and should stop iterating on local fixes. |
External references
Escalate or go deeper
This version tracks a prerelease build. Features and behavior may change before the next stable release.
Was this page helpful?
Use Needs work to open a structured GitHub issue for this page. The Yes button only acknowledges the signal locally.