Audit archive degraded

Use this runbook when the OpenBaoAuditArchiveDegraded alert fires because the durable audit archive path is enabled but missing, stale, failing, or dead-lettering records. The steps help you protect audit evidence while you restore archive delivery.

Before you begin

Get access to Prometheus or the alert evaluator that reads archive health metrics.
Get access to the collector, gateway, SIEM, object-store writer, or security pipeline that emits archive health metrics.
Get access to the local OpenBao audit file or replay source.
Get security approval before changing archive retention, archive writer credentials, dead-letter handling, or audit-device configuration.

[!WARNING] Audit records are security evidence. Do not paste raw audit records into tickets, chat, or public logs while you investigate archive delivery.

Understand the alert metrics

The alert uses reference health metrics from the archive delivery pipeline, not OpenBao server metrics. Publish these metrics from your collector, archive gateway, SIEM forwarder, object-store writer, or another controlled component.

Metric	Type	Meaning
`openbao_audit_archive_enabled`	Gauge	Set to `1` only when this environment expects archive delivery. Leave absent or `0` for local and exempt environments.
`openbao_audit_archive_delivery_success`	Gauge	Set to `1` when the archive path is currently healthy and `0` when delivery is degraded.
`openbao_audit_archive_last_success_timestamp_seconds`	Gauge	Unix timestamp for the last successful archive delivery or acknowledgement.
`openbao_audit_archive_delivery_failures_total`	Counter	Count of failed archive writes, rejected batches, or failed delivery acknowledgements.
`openbao_audit_archive_dead_letter_records_total`	Counter	Count of records sent to a dead-letter path instead of the durable archive.

If your backend exposes different metric names, add recording rules that map them to these reference names before you enable this alert. The audit archive health example shows both a small exporter and a recording-rule mapping pattern.

Confirm the degradation

Confirm that archive delivery is enabled for this environment.
```
max(openbao_audit_archive_enabled)
```

Check the current archive delivery status.

min(openbao_audit_archive_delivery_success)

Check how long it has been since the last successful archive delivery.

time() - max(openbao_audit_archive_last_success_timestamp_seconds)

Check delivery failures over the alert window.

sum(
  increase(openbao_audit_archive_delivery_failures_total[15m])
)

Check dead-lettered records over the alert window.

sum(
  increase(openbao_audit_archive_dead_letter_records_total[15m])
)

Record whether the alert is caused by missing health metrics, stale delivery, failed delivery, or dead-lettered records.

Check OpenBao audit health

Confirm that OpenBao is still writing audit records.
```
sum(
  increase(${p}_audit_log_request_failure[5m])
)
```
- ${p}: Metric prefix for your deployment. Use vault for the OpenBao default prefix or openbao when you configured metrics_prefix = "openbao".

Check response audit failures.

sum(
  increase(${p}_audit_log_response_failure[5m])
)

If either counter increases, use Audit request and response failures before you focus on downstream archive delivery.
For file audit devices, confirm that the local audit file is still growing and that the volume has enough space to buffer records while archive delivery is degraded.
```
stat <audit_log_file>
```
- <audit_log_file>: Full path to the OpenBao audit log file.
```
df -h <audit_log_directory>
```
- <audit_log_directory>: Directory that contains the audit log file.

Restore archive delivery

Keep the local audit file or replay source intact. Do not delete collector positions, buffered files, queues, or dead-letter records until security responders approve the recovery plan.
Restore collector health when the collector cannot read the local audit file or send batches to the archive path.
Restore archive backend connectivity when object storage, SIEM ingestion, or the archive gateway is unavailable.
Rotate or restore archive writer credentials when authentication failures cause delivery errors.
Fix parser, schema, size, or policy errors when records are rejected or dead-lettered.
Replay records from the local audit file, collector queue, or dead-letter path after the delivery path is healthy.
Increase local buffer capacity or collector throughput when backlog grows faster than the archive path can drain.

Verify the result

Confirm that the archive path reports healthy delivery.
```
min(openbao_audit_archive_delivery_success) == 1
```

Confirm that the last successful delivery is recent.

time() - max(openbao_audit_archive_last_success_timestamp_seconds) < 300

Confirm that delivery failures and dead-lettered records stop increasing.

sum(
  increase(openbao_audit_archive_delivery_failures_total[15m])
)

sum(
  increase(openbao_audit_archive_dead_letter_records_total[15m])
)

Confirm that the local backlog drains and that replayed records reach the archive backend.
Confirm that OpenBao audit failure counters are not increasing.
Wait for the alert window to pass and confirm that OpenBaoAuditArchiveDegraded resolves.

Troubleshooting

The alert fires in a local or exempt environment

Do not publish openbao_audit_archive_enabled=1 for environments that do not require durable archive delivery. Use absence or 0 to keep the alert quiet.

Delivery success is healthy but the timestamp is stale

Check whether the archive writer updates the timestamp only on successful batches. If the environment is quiet, add an audited archive canary or update the writer to report explicit heartbeat delivery.

Dead-letter records increase

Inspect dead-letter metadata without exposing raw audit records. Common causes include parser changes, schema changes, record-size limits, SIEM policy rejections, and object-store permission errors.

Loki is healthy but archive delivery is degraded

Treat this as an evidence-retention incident. Loki exploration does not replace the durable archive path.

The archive backend is in maintenance

Silence the alert only when the security owner confirms the maintenance window, local backlog capacity, replay plan, and evidence handling.

What’s next

Use Audit archive reference design to validate the archive pattern and failure modes.
Use the audit archive health example to publish the reference metrics expected by this alert.
Use Audit log stream missing when the short-term audit exploration stream is missing.
Use Audit request and response failures when OpenBao reports audit-device failures.
Use Audit canary missing when the Loki-backed audit canary is absent.

Source: OpenBao documents audit devices and audit blocking behavior in the OpenBao audit device documentation . The reference archive metrics and failure model come from the Audit archive reference design .