Audit archive degraded

Use this runbook when the OpenBaoAuditArchiveDegraded alert fires because the durable audit archive path is enabled but missing, stale, failing, or dead-lettering records. The steps help you protect audit evidence while you restore archive delivery.

Before you begin

  • Get access to Prometheus or the alert evaluator that reads archive health metrics.
  • Get access to the collector, gateway, SIEM, object-store writer, or security pipeline that emits archive health metrics.
  • Get access to the local OpenBao audit file or replay source.
  • Get security approval before changing archive retention, archive writer credentials, dead-letter handling, or audit-device configuration.

[!WARNING] Audit records are security evidence. Do not paste raw audit records into tickets, chat, or public logs while you investigate archive delivery.

Understand the alert metrics

The alert uses reference health metrics from the archive delivery pipeline, not OpenBao server metrics. Publish these metrics from your collector, archive gateway, SIEM forwarder, object-store writer, or another controlled component.

MetricTypeMeaning
openbao_audit_archive_enabledGaugeSet to 1 only when this environment expects archive delivery. Leave absent or 0 for local and exempt environments.
openbao_audit_archive_delivery_successGaugeSet to 1 when the archive path is currently healthy and 0 when delivery is degraded.
openbao_audit_archive_last_success_timestamp_secondsGaugeUnix timestamp for the last successful archive delivery or acknowledgement.
openbao_audit_archive_delivery_failures_totalCounterCount of failed archive writes, rejected batches, or failed delivery acknowledgements.
openbao_audit_archive_dead_letter_records_totalCounterCount of records sent to a dead-letter path instead of the durable archive.

If your backend exposes different metric names, add recording rules that map them to these reference names before you enable this alert. The audit archive health example shows both a small exporter and a recording-rule mapping pattern.

Confirm the degradation

  1. Confirm that archive delivery is enabled for this environment.

    max(openbao_audit_archive_enabled)
    
  2. Check the current archive delivery status.

    min(openbao_audit_archive_delivery_success)
    
  3. Check how long it has been since the last successful archive delivery.

    time() - max(openbao_audit_archive_last_success_timestamp_seconds)
    
  4. Check delivery failures over the alert window.

    sum(
      increase(openbao_audit_archive_delivery_failures_total[15m])
    )
    
  5. Check dead-lettered records over the alert window.

    sum(
      increase(openbao_audit_archive_dead_letter_records_total[15m])
    )
    
  6. Record whether the alert is caused by missing health metrics, stale delivery, failed delivery, or dead-lettered records.

Check OpenBao audit health

  1. Confirm that OpenBao is still writing audit records.

    sum(
      increase(${p}_audit_log_request_failure[5m])
    )
    
    • ${p}: Metric prefix for your deployment. Use vault for the OpenBao default prefix or openbao when you configured metrics_prefix = "openbao".
  2. Check response audit failures.

    sum(
      increase(${p}_audit_log_response_failure[5m])
    )
    
  3. If either counter increases, use Audit request and response failures before you focus on downstream archive delivery.

  4. For file audit devices, confirm that the local audit file is still growing and that the volume has enough space to buffer records while archive delivery is degraded.

    stat <audit_log_file>
    
    • <audit_log_file>: Full path to the OpenBao audit log file.
    df -h <audit_log_directory>
    
    • <audit_log_directory>: Directory that contains the audit log file.

Restore archive delivery

  1. Keep the local audit file or replay source intact. Do not delete collector positions, buffered files, queues, or dead-letter records until security responders approve the recovery plan.

  2. Restore collector health when the collector cannot read the local audit file or send batches to the archive path.

  3. Restore archive backend connectivity when object storage, SIEM ingestion, or the archive gateway is unavailable.

  4. Rotate or restore archive writer credentials when authentication failures cause delivery errors.

  5. Fix parser, schema, size, or policy errors when records are rejected or dead-lettered.

  6. Replay records from the local audit file, collector queue, or dead-letter path after the delivery path is healthy.

  7. Increase local buffer capacity or collector throughput when backlog grows faster than the archive path can drain.

Verify the result

  1. Confirm that the archive path reports healthy delivery.

    min(openbao_audit_archive_delivery_success) == 1
    
  2. Confirm that the last successful delivery is recent.

    time() - max(openbao_audit_archive_last_success_timestamp_seconds) < 300
    
  3. Confirm that delivery failures and dead-lettered records stop increasing.

    sum(
      increase(openbao_audit_archive_delivery_failures_total[15m])
    )
    
    sum(
      increase(openbao_audit_archive_dead_letter_records_total[15m])
    )
    
  4. Confirm that the local backlog drains and that replayed records reach the archive backend.

  5. Confirm that OpenBao audit failure counters are not increasing.

  6. Wait for the alert window to pass and confirm that OpenBaoAuditArchiveDegraded resolves.

Troubleshooting

The alert fires in a local or exempt environment

Do not publish openbao_audit_archive_enabled=1 for environments that do not require durable archive delivery. Use absence or 0 to keep the alert quiet.

Delivery success is healthy but the timestamp is stale

Check whether the archive writer updates the timestamp only on successful batches. If the environment is quiet, add an audited archive canary or update the writer to report explicit heartbeat delivery.

Dead-letter records increase

Inspect dead-letter metadata without exposing raw audit records. Common causes include parser changes, schema changes, record-size limits, SIEM policy rejections, and object-store permission errors.

Loki is healthy but archive delivery is degraded

Treat this as an evidence-retention incident. Loki exploration does not replace the durable archive path.

The archive backend is in maintenance

Silence the alert only when the security owner confirms the maintenance window, local backlog capacity, replay plan, and evidence handling.

What’s next

Source: OpenBao documents audit devices and audit blocking behavior in the OpenBao audit device documentation . The reference archive metrics and failure model come from the Audit archive reference design .