Failure Modes
This page is the comprehensive failure-mode catalog the design considers. Each row pairs a failure scenario with its detection signal, mitigation, and recovery action. For the operator runbooks that act on these signals see Operations: Troubleshooting and Operations: Disaster Recovery .
How To Use This Page
Each row in the catalog answers four questions:
- What goes wrong?
- How does an operator detect it?
- What design control mitigates it?
- What recovery action restores service?
Two columns flag operational severity:
- Blocks API server startup? indicates whether the API server cannot decrypt previously encrypted resources during startup if this failure is active.
- Permanent data loss risk? indicates whether the failure can leave Kubernetes resources unrecoverable.
Bootstrap And Runtime
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
| Plugin unavailable | Service not installed, crash, disabled | API server cannot reach KMS | systemd or kubelet status, socket missing, KMS unhealthy | Restart policy, health checks | Start plugin, fix configuration | Yes | No |
| Socket unavailable | Directory missing, listener failed | API server cannot call KMS | API server logs, /live failure | Pre-created runtime directory, safe socket setup | Fix path or permissions, restart plugin | Yes | No |
| Kubelet or container runtime unavailable for static pod | Host boot failure | Plugin static pod cannot start | kubelet or CRI logs | systemd mode, local image cache | Fix kubelet or CRI, or run plugin as host service | Yes | No |
| systemd ordering wrong | Plugin starts after API server | API server fails or retries | Boot logs | Before=kubelet.service where appropriate, tested units | Correct unit dependencies | Yes | No |
| Stale socket | Crash left socket path | Startup failure or wrong listener | Socket check | Safe stale cleanup | Remove verified-dead socket | Yes | No |
| Wrong socket permissions | API server cannot connect | KMS unavailable | API server permission errors | Mode and group validation | Fix group or mode | Yes | No |
| SELinux or AppArmor block | Host policy denies socket or file | KMS unavailable | Audit logs | Policy profiles, tests | Adjust policy | Yes | No |
| Configuration file permissions unsafe | World-readable or world-writable | Secret or topology exposure or tamper | Startup validation | Fail closed | Fix permissions | Yes | No |
| Plugin crash loop | Bug, bad configuration, OpenBao error path | KMS unavailable | Service logs | Supervisor, tests | Fix configuration or bug | Yes | No |
| Image unavailable for static pod | Pull failure, air gap | Plugin not started | kubelet events or logs | Preloaded image, IfNotPresent or Never | Load image | Yes | No |
| Package upgrade restarts systemd plugin | Maintenance event | Transient KMS outage | Service logs | Controlled rollout | Restart one node at a time | Possible | No |
OpenBao And Transit
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
| OpenBao unavailable | Network, DNS, load balancer, outage | Encrypt and decrypt fail | Readiness, metrics, OpenBao request errors | HA OpenBao, local routing, retries | Restore OpenBao reachability | Yes for encrypted data | No |
| OpenBao sealed | Manual seal, restart not unsealed | Transit unavailable | OpenBao health, plugin readiness | Auto-unseal, alerting | Unseal or restore OpenBao | Yes | No, unless key unavailable permanently |
| OpenBao inside same protected cluster | Circular dependency | KMS unavailable before API server | Bootstrap failure | External management plane | Start OpenBao independently or restore an external service | Yes | Possible if unrecoverable |
| Audit backend pressure | OpenBao audit device slow or failing | Transit latency or errors | OpenBao metrics, plugin latency | HA audit sinks, capacity planning | Repair audit backend | Possible | No |
| OpenBao leader failover | HA event | Transient errors or latency | OpenBao status, plugin retries | HA tuning, bounded retries | Wait or fix cluster | Possible | No |
| TLS certificate expired | Certificate not renewed | Plugin cannot connect | TLS errors | Certificate monitoring | Renew certificate, reload plugin | Yes | No |
| DNS or LB misrouting | Wrong backend or stale DNS | Auth or Transit errors | TLS or SNI errors, metadata mismatch | Pinned CA and SNI, instance ID checks | Fix DNS or load balancer | Yes | No |
Transit Key Material
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
| Transit key deleted | Destructive admin action | Old ciphertext undecryptable | Metadata read fails, decrypt failures | deletion_allowed=false, no delete permission | Restore OpenBao backup with key material | Yes | Yes if no valid backup |
| Transit key soft-deleted | Key archived or disabled | Encrypt and decrypt fail | Metadata state, decrypt errors | Change control | Restore key if possible | Yes | No if restored |
| Transit key recreated same name | Key lineage lost | Old data undecryptable; key_id collision risk | Lineage mismatch, decrypt failures | Key lineage ID, delete protection | Restore original key; do not accept new lineage | Yes | Yes if original key lost |
min_decryption_version raised too early | Operator error | Old ciphertext undecryptable | Decrypt failures for old key_id values | Verify migration first | Lower setting if key versions still exist | Yes | Possible |
| Key backup missing | Disaster restore lacks Transit key versions | Data undecryptable | DR test failure | Coordinated OpenBao backups | Restore from valid backup | Yes | Yes |
Authentication And Issuer State
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
| JWT expired and API server down | Protected cluster issued JWT and cannot renew | OpenBao login fails | Auth metrics, JWT expiry check | External issuer, sufficient TTL, file refresh | Replace JWT from external issuer or restore enough API server function to issue a token | Yes | No |
| JWT file missing | Provisioning error | Login impossible | Startup validation | File checks, configuration management | Restore JWT file | Yes | No |
| JWT wrong audience | Issuer or configuration mismatch | Login denied | Auth error | Bound audience tests | Issue correct JWT or fix role | Yes | No |
| JWT wrong subject or claims | Role mismatch | Login denied | Auth error | Claim binding documentation | Issue correct JWT or fix role | Yes | No |
| Issuer changed | OIDC or JWT issuer rotation | Login denied | Auth logs | Planned overlap | Update OpenBao configuration and JWT source | Yes | No |
| JWKS rotated | New signing key unknown | Login denied | JWT auth errors | JWKS monitoring, overlapping keys | Refresh JWKS or OpenBao configuration | Yes | No |
| OpenBao cannot reach JWKS or OIDC discovery | Network failure | Login denied or cache expiry | Auth errors | Pinned public keys for recovery | Restore discovery or configure keys | Possible | No |
| Clock skew | Host, OpenBao, or issuer clocks differ | JWT invalid | Auth errors, NTP alerts | NTP or chrony, leeway | Fix clocks | Yes | No |
| Revoked JWT still cryptographically valid | JWT auth lacks TokenReview | Token may be accepted until expiry | Hard to detect | Short TTL, external issuer controls | Rotate JWT and signing keys if needed | No immediate | No |
| Certificate expired | Certificate or SVID not renewed | OpenBao cert login fails | Auth metrics, certificate TTL metric | Certificate monitoring, SPIFFE or PKCS#11 renewal process | Renew certificate or restore SPIFFE/PKCS#11 issuer | Yes | No |
| Certificate identity drift | Wrong certificate, SPIFFE ID, or trust domain | Local validation or OpenBao role rejects login | Auth error | Exact local identity checks and OpenBao role constraints | Restore correct certificate identity or update config and role together before use | Yes | No |
| PKCS#11 module or token unavailable | Module path, token label, key label, PIN, or hardware failure | Login impossible | Startup validation, auth error | Provider-only PIN file, token monitoring | Restore module or token, fix labels, replace hardware through planned procedure | Yes | No |
| SPIFFE Workload API unavailable | SPIFFE agent or socket unavailable | Login impossible | Auth error, certificate TTL metric approaches zero | Independent SPIFFE availability and socket permissions | Restore SPIFFE agent or socket access | Yes | No |
KMS Contract, Registry, And Decrypt
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
Status key_id differs from encrypt response | Race or bug | API server discards encrypt result, marks unhealthy | API server logs, plugin metrics | Snapshot consistency | Fix bug, restart with stable registry | Possible | No |
key_id flip-flops | Unstable rotation observation | Stale marking oscillates | Metrics or logs hash changes | Stable observation count, activation delay | Pin active key, fix watcher | Possible | No |
Unknown key_id on decrypt | Configuration or provider changed, old data | Decrypt rejected | Decrypt key_id errors | Preserve key history | Restore old configuration or registry | Yes for affected data | Possible |
| Registry state missing | State file removed or first startup after restore | Provider may need to rebuild snapshots before serving | doctor, verify-key, rotation-plan, startup logs with auto-bootstrap eligibility reason | Strict state file and checkpoint; auto-bootstrap only from initial Transit metadata | Restore state file and checkpoint from backup or a known-good peer; after rotation or brownfield import, current preview releases fail closed rather than rebuilding state | Possible | No |
| Registry state corrupt or rolled back | Disk corruption, unsafe restore, replayed file | Startup or probe fails closed | State load errors, hash mismatch, rollback error | State hash chain, checkpoint, permission checks, monotonic generation | Restore last valid state file and checkpoint from backup or a known-good peer; controlled recover-state is deferred | Yes | No |
| Registry state and checkpoint both replaced | Privileged host compromise or unsafe full-directory restore | Local replay guard can be bypassed by a self-consistent pair | Node comparison, backup metadata, host integrity monitoring | Host filesystem controls, immutable backups, external evidence; local checkpoint is not tamper-proof | Rebuild trust from known-good backup or peer evidence; investigate host compromise | Possible | Possible |
| Missing or malformed annotations | Bug, corruption, or mismatched provider | Decrypt rejected when AAD required | AAD validation metrics | AAD required in the current release line | Restore matching configuration or recover from a valid backup | Yes for affected data | No if recoverable backup exists |
| AAD mismatch | Wrong cluster, key, or provider metadata | Decrypt rejected | AAD error | Stable configuration, validation | Restore matching configuration | Yes for affected data | Possible |
| API server decrypt storm | Startup with many encrypted objects | Latency, timeouts | Duration metrics, API server logs | OpenBao capacity, local routing, soak evidence; batching remains future work | Scale OpenBao, tune timeout, reduce retries; implement batching only if release evidence requires it | Yes if severe | No |
| Provider name changed | EncryptionConfiguration drift | Old encrypted data may not match provider | API server errors | Immutable provider name | Restore old name or configure migration | Yes for affected data | Possible |
Kubernetes Encryption Scope And Migration
| Failure mode | Cause | Impact | Detection | Control | Recovery | Startup block? | Data loss risk |
|---|---|---|---|---|---|---|---|
identity fallback left enabled permanently | Migration incomplete | Plaintext writes possible if provider order changes or KMS is unavailable with identity first | Configuration audit | Remove after migration | Rewrite resources and remove fallback | No | Confidentiality loss |
identity fallback removed too early | Old plaintext or misordered data | Reads may fail depending on provider set | API errors | Verify migration first | Restore fallback temporarily and migrate | Possible | No |
| Only some resources encrypted | Configuration scope incomplete | Unprotected resources in etcd | Encryption configuration review | Explicit resource list | Update configuration, rewrite resources | No | Confidentiality loss |
| Existing resources not rewritten | Encryption only applies on write | Old data remains under old provider or plaintext | Audit and migration checks | Storage migration | Rewrite resources | No | Confidentiality loss |
| Mixed plaintext and encrypted backups | Backups taken across migration | Inconsistent confidentiality | Backup audit | Backup labeling | Handle according to sensitivity | No | Confidentiality loss |