Failure Modes

This page is the comprehensive failure-mode catalog the design considers. Each row pairs a failure scenario with its detection signal, mitigation, and recovery action. For the operator runbooks that act on these signals see Operations: Troubleshooting and Operations: Disaster Recovery .

How To Use This Page

Each row in the catalog answers four questions:

  • What goes wrong?
  • How does an operator detect it?
  • What design control mitigates it?
  • What recovery action restores service?

Two columns flag operational severity:

  • Blocks API server startup? indicates whether the API server cannot decrypt previously encrypted resources during startup if this failure is active.
  • Permanent data loss risk? indicates whether the failure can leave Kubernetes resources unrecoverable.

Bootstrap And Runtime

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
Plugin unavailableService not installed, crash, disabledAPI server cannot reach KMSsystemd or kubelet status, socket missing, KMS unhealthyRestart policy, health checksStart plugin, fix configurationYesNo
Socket unavailableDirectory missing, listener failedAPI server cannot call KMSAPI server logs, /live failurePre-created runtime directory, safe socket setupFix path or permissions, restart pluginYesNo
Kubelet or container runtime unavailable for static podHost boot failurePlugin static pod cannot startkubelet or CRI logssystemd mode, local image cacheFix kubelet or CRI, or run plugin as host serviceYesNo
systemd ordering wrongPlugin starts after API serverAPI server fails or retriesBoot logsBefore=kubelet.service where appropriate, tested unitsCorrect unit dependenciesYesNo
Stale socketCrash left socket pathStartup failure or wrong listenerSocket checkSafe stale cleanupRemove verified-dead socketYesNo
Wrong socket permissionsAPI server cannot connectKMS unavailableAPI server permission errorsMode and group validationFix group or modeYesNo
SELinux or AppArmor blockHost policy denies socket or fileKMS unavailableAudit logsPolicy profiles, testsAdjust policyYesNo
Configuration file permissions unsafeWorld-readable or world-writableSecret or topology exposure or tamperStartup validationFail closedFix permissionsYesNo
Plugin crash loopBug, bad configuration, OpenBao error pathKMS unavailableService logsSupervisor, testsFix configuration or bugYesNo
Image unavailable for static podPull failure, air gapPlugin not startedkubelet events or logsPreloaded image, IfNotPresent or NeverLoad imageYesNo
Package upgrade restarts systemd pluginMaintenance eventTransient KMS outageService logsControlled rolloutRestart one node at a timePossibleNo

OpenBao And Transit

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
OpenBao unavailableNetwork, DNS, load balancer, outageEncrypt and decrypt failReadiness, metrics, OpenBao request errorsHA OpenBao, local routing, retriesRestore OpenBao reachabilityYes for encrypted dataNo
OpenBao sealedManual seal, restart not unsealedTransit unavailableOpenBao health, plugin readinessAuto-unseal, alertingUnseal or restore OpenBaoYesNo, unless key unavailable permanently
OpenBao inside same protected clusterCircular dependencyKMS unavailable before API serverBootstrap failureExternal management planeStart OpenBao independently or restore an external serviceYesPossible if unrecoverable
Audit backend pressureOpenBao audit device slow or failingTransit latency or errorsOpenBao metrics, plugin latencyHA audit sinks, capacity planningRepair audit backendPossibleNo
OpenBao leader failoverHA eventTransient errors or latencyOpenBao status, plugin retriesHA tuning, bounded retriesWait or fix clusterPossibleNo
TLS certificate expiredCertificate not renewedPlugin cannot connectTLS errorsCertificate monitoringRenew certificate, reload pluginYesNo
DNS or LB misroutingWrong backend or stale DNSAuth or Transit errorsTLS or SNI errors, metadata mismatchPinned CA and SNI, instance ID checksFix DNS or load balancerYesNo

Transit Key Material

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
Transit key deletedDestructive admin actionOld ciphertext undecryptableMetadata read fails, decrypt failuresdeletion_allowed=false, no delete permissionRestore OpenBao backup with key materialYesYes if no valid backup
Transit key soft-deletedKey archived or disabledEncrypt and decrypt failMetadata state, decrypt errorsChange controlRestore key if possibleYesNo if restored
Transit key recreated same nameKey lineage lostOld data undecryptable; key_id collision riskLineage mismatch, decrypt failuresKey lineage ID, delete protectionRestore original key; do not accept new lineageYesYes if original key lost
min_decryption_version raised too earlyOperator errorOld ciphertext undecryptableDecrypt failures for old key_id valuesVerify migration firstLower setting if key versions still existYesPossible
Key backup missingDisaster restore lacks Transit key versionsData undecryptableDR test failureCoordinated OpenBao backupsRestore from valid backupYesYes

Authentication And Issuer State

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
JWT expired and API server downProtected cluster issued JWT and cannot renewOpenBao login failsAuth metrics, JWT expiry checkExternal issuer, sufficient TTL, file refreshReplace JWT from external issuer or restore enough API server function to issue a tokenYesNo
JWT file missingProvisioning errorLogin impossibleStartup validationFile checks, configuration managementRestore JWT fileYesNo
JWT wrong audienceIssuer or configuration mismatchLogin deniedAuth errorBound audience testsIssue correct JWT or fix roleYesNo
JWT wrong subject or claimsRole mismatchLogin deniedAuth errorClaim binding documentationIssue correct JWT or fix roleYesNo
Issuer changedOIDC or JWT issuer rotationLogin deniedAuth logsPlanned overlapUpdate OpenBao configuration and JWT sourceYesNo
JWKS rotatedNew signing key unknownLogin deniedJWT auth errorsJWKS monitoring, overlapping keysRefresh JWKS or OpenBao configurationYesNo
OpenBao cannot reach JWKS or OIDC discoveryNetwork failureLogin denied or cache expiryAuth errorsPinned public keys for recoveryRestore discovery or configure keysPossibleNo
Clock skewHost, OpenBao, or issuer clocks differJWT invalidAuth errors, NTP alertsNTP or chrony, leewayFix clocksYesNo
Revoked JWT still cryptographically validJWT auth lacks TokenReviewToken may be accepted until expiryHard to detectShort TTL, external issuer controlsRotate JWT and signing keys if neededNo immediateNo
Certificate expiredCertificate or SVID not renewedOpenBao cert login failsAuth metrics, certificate TTL metricCertificate monitoring, SPIFFE or PKCS#11 renewal processRenew certificate or restore SPIFFE/PKCS#11 issuerYesNo
Certificate identity driftWrong certificate, SPIFFE ID, or trust domainLocal validation or OpenBao role rejects loginAuth errorExact local identity checks and OpenBao role constraintsRestore correct certificate identity or update config and role together before useYesNo
PKCS#11 module or token unavailableModule path, token label, key label, PIN, or hardware failureLogin impossibleStartup validation, auth errorProvider-only PIN file, token monitoringRestore module or token, fix labels, replace hardware through planned procedureYesNo
SPIFFE Workload API unavailableSPIFFE agent or socket unavailableLogin impossibleAuth error, certificate TTL metric approaches zeroIndependent SPIFFE availability and socket permissionsRestore SPIFFE agent or socket accessYesNo

KMS Contract, Registry, And Decrypt

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
Status key_id differs from encrypt responseRace or bugAPI server discards encrypt result, marks unhealthyAPI server logs, plugin metricsSnapshot consistencyFix bug, restart with stable registryPossibleNo
key_id flip-flopsUnstable rotation observationStale marking oscillatesMetrics or logs hash changesStable observation count, activation delayPin active key, fix watcherPossibleNo
Unknown key_id on decryptConfiguration or provider changed, old dataDecrypt rejectedDecrypt key_id errorsPreserve key historyRestore old configuration or registryYes for affected dataPossible
Registry state missingState file removed or first startup after restoreProvider may need to rebuild snapshots before servingdoctor, verify-key, rotation-plan, startup logs with auto-bootstrap eligibility reasonStrict state file and checkpoint; auto-bootstrap only from initial Transit metadataRestore state file and checkpoint from backup or a known-good peer; after rotation or brownfield import, current preview releases fail closed rather than rebuilding statePossibleNo
Registry state corrupt or rolled backDisk corruption, unsafe restore, replayed fileStartup or probe fails closedState load errors, hash mismatch, rollback errorState hash chain, checkpoint, permission checks, monotonic generationRestore last valid state file and checkpoint from backup or a known-good peer; controlled recover-state is deferredYesNo
Registry state and checkpoint both replacedPrivileged host compromise or unsafe full-directory restoreLocal replay guard can be bypassed by a self-consistent pairNode comparison, backup metadata, host integrity monitoringHost filesystem controls, immutable backups, external evidence; local checkpoint is not tamper-proofRebuild trust from known-good backup or peer evidence; investigate host compromisePossiblePossible
Missing or malformed annotationsBug, corruption, or mismatched providerDecrypt rejected when AAD requiredAAD validation metricsAAD required in the current release lineRestore matching configuration or recover from a valid backupYes for affected dataNo if recoverable backup exists
AAD mismatchWrong cluster, key, or provider metadataDecrypt rejectedAAD errorStable configuration, validationRestore matching configurationYes for affected dataPossible
API server decrypt stormStartup with many encrypted objectsLatency, timeoutsDuration metrics, API server logsOpenBao capacity, local routing, soak evidence; batching remains future workScale OpenBao, tune timeout, reduce retries; implement batching only if release evidence requires itYes if severeNo
Provider name changedEncryptionConfiguration driftOld encrypted data may not match providerAPI server errorsImmutable provider nameRestore old name or configure migrationYes for affected dataPossible

Kubernetes Encryption Scope And Migration

Failure modeCauseImpactDetectionControlRecoveryStartup block?Data loss risk
identity fallback left enabled permanentlyMigration incompletePlaintext writes possible if provider order changes or KMS is unavailable with identity firstConfiguration auditRemove after migrationRewrite resources and remove fallbackNoConfidentiality loss
identity fallback removed too earlyOld plaintext or misordered dataReads may fail depending on provider setAPI errorsVerify migration firstRestore fallback temporarily and migratePossibleNo
Only some resources encryptedConfiguration scope incompleteUnprotected resources in etcdEncryption configuration reviewExplicit resource listUpdate configuration, rewrite resourcesNoConfidentiality loss
Existing resources not rewrittenEncryption only applies on writeOld data remains under old provider or plaintextAudit and migration checksStorage migrationRewrite resourcesNoConfidentiality loss
Mixed plaintext and encrypted backupsBackups taken across migrationInconsistent confidentialityBackup auditBackup labelingHandle according to sensitivityNoConfidentiality loss