Troubleshooting
This page maps common symptoms to likely causes and recovery steps. For the comprehensive failure-mode catalog with detection signals, mitigations, and impact analysis, see Architecture: Failure Modes .
Start with the least destructive checks:
curl -sf http://127.0.0.1:8082/live
curl -sf http://127.0.0.1:8082/ready
curl -sf http://127.0.0.1:8081/metrics | grep -E 'openbao_kms_status_key_id_hash|openbao_kms_status_cache_age_seconds'
bao-kms-provider doctor \
--config /etc/openbao-kms/config.yaml \
--encryption-config /etc/kubernetes/encryption-config.yaml
Do not change identity-bearing fields, recreate Transit keys, or change Kubernetes encryption configuration until the failing layer is known.
API Server Cannot Connect To KMS
Symptoms:
kube-apiserverlogs mention KMS endpoint connection failure,/run/openbao-kms/kms.sockis missing,- the plugin service or static pod is not running.
Check:
systemctl status bao-kms-provider.service
journalctl -u kubelet --since -10m
crictl ps --name bao-kms-provider
ls -l /run/openbao-kms
bao-kms-provider doctor --config /etc/openbao-kms/config.yaml
Use the systemd command for host-service deployments. Use kubelet and container-runtime tooling for static-pod deployments.
Recovery:
- Start or restart the plugin.
- Fix socket directory ownership and mode (see Deployment: Linux Identity Model ).
- Confirm the API server endpoint path matches
server.socketPathin the provider configuration. - Restart
kube-apiserverif it does not reconnect.
Socket Permission Denied
Symptoms:
- the socket exists,
kube-apiservercannot connect,- permission denied errors appear in API server or plugin logs.
Check:
ls -ld /run/openbao-kms
ls -l /run/openbao-kms/kms.sock
getent group openbao-kms-socket
Recovery:
- Ensure the API server runtime identity is a member of
openbao-kms-socket. - Set the runtime directory group to
openbao-kms-socketand mode2750. - In static-pod mode, ensure the numeric socket group GID matches
spec.securityContext.supplementalGroupsandserver.socketGroup. - Set the socket mode to
0660. - Restart the plugin.
- Restart
kube-apiserverif it does not reconnect.
OpenBao Unavailable Or Sealed
Symptoms:
- plugin
/readyreturns non-200, - KMS Status is unhealthy,
- OpenBao request errors appear in metrics,
- decrypt or encrypt operations time out.
Check:
bao status
curl -sf http://127.0.0.1:8082/ready
bao-kms-provider doctor --config /etc/openbao-kms/config.yaml
Recovery:
- Restore OpenBao reachability.
- Unseal or repair OpenBao.
- Verify TLS and DNS.
- Run
bao-kms-provider verify-key --config /etc/openbao-kms/config.yaml. - Restart the plugin only if it does not recover on its own after OpenBao is healthy.
Transit Profile Fails Closed
Symptoms:
- plugin
/readyreturns non-200, - KMS Status is unhealthy,
doctororverify-keyreportstransit.profilefailure,- API server writes fail while existing reads may continue from cache.
Check:
bao-kms-provider verify-key --config /etc/openbao-kms/config.yaml
bao-kms-provider doctor --config /etc/openbao-kms/config.yaml
Recovery:
- Read the finding impact prefix.
- For
cryptographic_safety, restore the validated Transit profile before routing Kubernetes writes through the provider. - For
api_server_availability, repair the setting that prevents required encrypt or decrypt operations, such as key deletion, unsupported operations, or version restrictions. - Re-run
verify-key. - Confirm
/readyand KMS Status return healthy before restarting or reloadingkube-apiserver.
Fail-closed behavior prevents new encryption under unvalidated settings, but it can also make API server writes unavailable until OpenBao metadata is repaired.
Auth Login Fails
Symptoms:
- OpenBao auth errors appear in plugin logs,
- plugin
/readyreturns non-200, - token refresh failures.
Check:
- for JWT auth, the JWT file exists and is readable by the plugin process,
- for JWT auth, the JWT
expclaim is not near expiry, - for JWT auth,
iss,aud, andsubclaims match the OpenBao role configuration, - for JWT auth, OpenBao has the current signing keys through JWKS, OIDC discovery, or pinned public keys,
- for certificate auth, the OpenBao listener requests client certificates,
- for certificate auth, the certificate is not expired, has client-auth usage, and matches the configured role constraints,
- for PKCS#11 auth, the module path, token label, key label, and PIN file are correct,
- host, OpenBao, issuer, and CA clocks are synchronized.
Recovery:
- Replace or restore the configured auth material.
- Fix OpenBao auth role constraints if they are wrong.
- Fix issuer, JWKS, OIDC discovery, certificate authority, or PKCS#11 reachability.
- Restart the plugin if the current in-memory token does not recover. The provider re-reads auth material before re-login.
Transit Key Missing
Symptoms:
verify-keyfails,- OpenBao metadata read returns not found,
- decrypt or encrypt operations fail.
Recovery:
- Confirm the Transit mount path and key name match the provider configuration.
- Confirm the OpenBao namespace if applicable.
- Confirm the token policy grants metadata read on the configured key path.
- If the key was deleted, restore the OpenBao backup containing the original key. See Disaster Recovery: Transit Key Loss .
Do not recreate the key with the same name and expect old data to decrypt. Recreated keys produce a new lineage; old ciphertext is bound to the previous lineage.
Unknown Key ID
Symptoms:
- decrypt is rejected before the Transit call,
- the metric
openbao_kms_decrypt_key_id_errors_totalincreases, - old Kubernetes objects fail to read after a configuration change.
Likely causes:
- provider name changed,
- cluster ID changed,
- OpenBao instance ID changed,
- Transit mount ID changed,
- key lineage ID changed,
- the local key registry state or checkpoint is missing, corrupted, or rolled back,
- the object was encrypted by a different provider.
Recovery:
- Restore the original identity-bearing configuration; see Configuration: Identity-Bearing Fields .
- Restore the key registry state file and checkpoint if they were lost.
- Verify active and historical key snapshots are present.
- Restart the plugin.
- Retry the Kubernetes read.
After Transit rotation, current preview releases do not support synthesizing replacement registry state. Restore the state/checkpoint pair from backup or a known-good peer with matching identity scope; otherwise the provider fails closed.
Transit Version Creation Time Changed
Symptoms:
/ready,doctor, orverify-keyreports that Transit version creation time changed,- KMS Status is unhealthy and does not publish a
key_id, - OpenBao was restored, imported, or modified before the failure.
The provider stores the first observed Transit version creation time in local state and compares it with live OpenBao metadata after Unix-second normalization. Sub-second precision differences are tolerated, but a different Unix second is treated as identity drift.
Recovery:
- Confirm the OpenBao backup contains the original Transit key material and metadata.
- Restore the matching provider state file and checkpoint.
- Confirm the configured Transit key lineage ID still identifies the restored key lineage.
- Run
bao-kms-provider verify-key --config /etc/openbao-kms/config.yaml. - Keep the provider stopped if the original metadata cannot be restored.
Do not edit key_id, creation timestamps, or local state by hand to force a
match. That can make Kubernetes objects reference a key epoch that cannot
decrypt them.
Intermediate Transit Version Metadata Missing
Symptoms:
rotation-plan,verify-rotation,/ready, or startup reports invalid Transit metadata,- the reported Transit
latest_versionjumped over one or more versions, - KMS Status is unhealthy and does not publish a
key_id.
Likely causes:
- multiple Transit rotations happened before every control-plane node converged,
- OpenBao restore or import omitted an intermediate version creation timestamp,
- Transit metadata is temporarily inconsistent during restore.
Recovery:
- Stop further Transit rotations.
- Keep all existing Transit versions decryptable.
- Restore compatible OpenBao metadata that includes every intermediate version creation timestamp at Unix-second precision.
- Restore the provider state file and checkpoint from backup or a known-good peer if local state was lost.
- Run
bao-kms-provider rotation-plan --config /etc/openbao-kms/config.yamlon every control-plane node. - Resume only after every node reports a healthy, converged active
key_idhash.
Do not synthesize intermediate snapshots by hand. If the provider cannot prove the skipped version identities from OpenBao metadata and local state, it fails closed to avoid advertising a registry that might not decrypt Kubernetes data.
AAD Mismatch
Symptoms:
- decrypt rejects the object with an AAD error,
- the Transit decrypt call returns an authentication failure if validation reaches Transit.
Likely causes:
- annotations were modified or corrupted,
- provider, cluster, or key scope changed,
- a bug in canonical AAD serialization.
Recovery:
- Do not disable AAD globally.
- Compare object annotations with the expected key snapshot hashes.
- Restore the correct configuration.
- Do not modify code or local state to bypass AAD; that is unsafe as an incident response.
- File a bug if canonical serialization changed unexpectedly.
Status Key ID Differs From Encrypt Key ID
Symptoms:
- the API server marks the plugin unhealthy,
- encrypt responses are discarded by the API server,
- KMS v2 conformance tests fail.
Likely causes:
- a race in active key snapshot handling,
- a rotation promotion bug,
- multiple plugin instances running with inconsistent configuration,
- Transit metadata observed inconsistently between probes.
Recovery:
- Stop any in-progress rotation; see Operations: Rotation .
- Compare configuration on every control-plane node.
- Compare plugin versions across nodes.
- Restart the affected plugin instance.
- Roll back the plugin only if the older version supports the current
key_idand AAD formats.
min_decryption_version Raised Too Early
Symptoms:
- old objects fail to decrypt,
- old
key_idreferences fail after rotation, - OpenBao decrypt returns version restriction errors.
Recovery:
- Lower
min_decryption_versionif the old key version still exists and policy allows it. - Restore an OpenBao backup if the old key version no longer exists.
- Rerun storage migration only after reads through the KMS path are healthy; see Operations: Rotation .
- Verify old backups are either expired or still decryptable.
If old key material no longer exists, restore the OpenBao backup. See Disaster Recovery: Transit Key Loss .
Do not treat verify-rotation as proof that raising
min_decryption_version was safe. It reports local registry and Transit
metadata only.
Static Pod Image Missing
Symptoms:
- kubelet cannot start the plugin static pod,
- image pull errors appear in kubelet logs,
- the socket is missing.
Recovery:
- Load the image on the node.
- Use the immutable digest already present locally.
- Set image pull policy appropriately for air-gapped environments.
- Restart kubelet if needed.
See Deployment: Static Pod Deployment for the image preload and digest-pinning rules.
Identity Fallback Issues
If identity fallback remains enabled too long:
- plaintext writes become more likely after future misconfiguration,
- audits may miss resources that were never migrated.
If identity fallback is removed too early:
- old plaintext objects may become unreadable depending on the provider set and migration state.
Recovery:
- Restore the last known-good
EncryptionConfiguration. - Restart or reload
kube-apiserver. - Complete resource migration; see Kubernetes Encryption Config: Migrate Existing Resources .
- Remove the fallback after migration verification.
Do Not Do This During Incidents
- Do not delete encrypted etcd data.
- Do not recreate Transit keys with the same name.
- Do not change the provider name to clear errors.
- Do not raise
min_decryption_version. - Do not hand-edit or invent key registry state after rotation.
- Do not log plaintext or full ciphertext.
- Do not disable AAD globally.