Audit archive degraded
Use this runbook when the OpenBaoAuditArchiveDegraded alert fires because the
durable audit archive path is enabled but missing, stale, failing, or
dead-lettering records. The steps help you protect audit evidence while you
restore archive delivery.
Before you begin
- Get access to Prometheus or the alert evaluator that reads archive health metrics.
- Get access to the collector, gateway, SIEM, object-store writer, or security pipeline that emits archive health metrics.
- Get access to the local OpenBao audit file or replay source.
- Get security approval before changing archive retention, archive writer credentials, dead-letter handling, or audit-device configuration.
[!WARNING] Audit records are security evidence. Do not paste raw audit records into tickets, chat, or public logs while you investigate archive delivery.
Understand the alert metrics
The alert uses reference health metrics from the archive delivery pipeline, not OpenBao server metrics. Publish these metrics from your collector, archive gateway, SIEM forwarder, object-store writer, or another controlled component.
| Metric | Type | Meaning |
|---|---|---|
openbao_audit_archive_enabled | Gauge | Set to 1 only when this environment expects archive delivery. Leave absent or 0 for local and exempt environments. |
openbao_audit_archive_delivery_success | Gauge | Set to 1 when the archive path is currently healthy and 0 when delivery is degraded. |
openbao_audit_archive_last_success_timestamp_seconds | Gauge | Unix timestamp for the last successful archive delivery or acknowledgement. |
openbao_audit_archive_delivery_failures_total | Counter | Count of failed archive writes, rejected batches, or failed delivery acknowledgements. |
openbao_audit_archive_dead_letter_records_total | Counter | Count of records sent to a dead-letter path instead of the durable archive. |
If your backend exposes different metric names, add recording rules that map them to these reference names before you enable this alert. The audit archive health example shows both a small exporter and a recording-rule mapping pattern.
Confirm the degradation
Confirm that archive delivery is enabled for this environment.
max(openbao_audit_archive_enabled)Check the current archive delivery status.
min(openbao_audit_archive_delivery_success)Check how long it has been since the last successful archive delivery.
time() - max(openbao_audit_archive_last_success_timestamp_seconds)Check delivery failures over the alert window.
sum( increase(openbao_audit_archive_delivery_failures_total[15m]) )Check dead-lettered records over the alert window.
sum( increase(openbao_audit_archive_dead_letter_records_total[15m]) )Record whether the alert is caused by missing health metrics, stale delivery, failed delivery, or dead-lettered records.
Check OpenBao audit health
Confirm that OpenBao is still writing audit records.
sum( increase(${p}_audit_log_request_failure[5m]) )${p}: Metric prefix for your deployment. Usevaultfor the OpenBao default prefix oropenbaowhen you configuredmetrics_prefix = "openbao".
Check response audit failures.
sum( increase(${p}_audit_log_response_failure[5m]) )If either counter increases, use Audit request and response failures before you focus on downstream archive delivery.
For file audit devices, confirm that the local audit file is still growing and that the volume has enough space to buffer records while archive delivery is degraded.
stat <audit_log_file><audit_log_file>: Full path to the OpenBao audit log file.
df -h <audit_log_directory><audit_log_directory>: Directory that contains the audit log file.
Restore archive delivery
Keep the local audit file or replay source intact. Do not delete collector positions, buffered files, queues, or dead-letter records until security responders approve the recovery plan.
Restore collector health when the collector cannot read the local audit file or send batches to the archive path.
Restore archive backend connectivity when object storage, SIEM ingestion, or the archive gateway is unavailable.
Rotate or restore archive writer credentials when authentication failures cause delivery errors.
Fix parser, schema, size, or policy errors when records are rejected or dead-lettered.
Replay records from the local audit file, collector queue, or dead-letter path after the delivery path is healthy.
Increase local buffer capacity or collector throughput when backlog grows faster than the archive path can drain.
Verify the result
Confirm that the archive path reports healthy delivery.
min(openbao_audit_archive_delivery_success) == 1Confirm that the last successful delivery is recent.
time() - max(openbao_audit_archive_last_success_timestamp_seconds) < 300Confirm that delivery failures and dead-lettered records stop increasing.
sum( increase(openbao_audit_archive_delivery_failures_total[15m]) )sum( increase(openbao_audit_archive_dead_letter_records_total[15m]) )Confirm that the local backlog drains and that replayed records reach the archive backend.
Confirm that OpenBao audit failure counters are not increasing.
Wait for the alert window to pass and confirm that
OpenBaoAuditArchiveDegradedresolves.
Troubleshooting
The alert fires in a local or exempt environment
Do not publish openbao_audit_archive_enabled=1 for environments that do not
require durable archive delivery. Use absence or 0 to keep the alert quiet.
Delivery success is healthy but the timestamp is stale
Check whether the archive writer updates the timestamp only on successful batches. If the environment is quiet, add an audited archive canary or update the writer to report explicit heartbeat delivery.
Dead-letter records increase
Inspect dead-letter metadata without exposing raw audit records. Common causes include parser changes, schema changes, record-size limits, SIEM policy rejections, and object-store permission errors.
Loki is healthy but archive delivery is degraded
Treat this as an evidence-retention incident. Loki exploration does not replace the durable archive path.
The archive backend is in maintenance
Silence the alert only when the security owner confirms the maintenance window, local backlog capacity, replay plan, and evidence handling.
What’s next
- Use Audit archive reference design to validate the archive pattern and failure modes.
- Use the audit archive health example to publish the reference metrics expected by this alert.
- Use Audit log stream missing when the short-term audit exploration stream is missing.
- Use Audit request and response failures when OpenBao reports audit-device failures.
- Use Audit canary missing when the Loki-backed audit canary is absent.
Source: OpenBao documents audit devices and audit blocking behavior in the OpenBao audit device documentation . The reference archive metrics and failure model come from the Audit archive reference design .