Files
ScadaBridge/src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/FallbackAuditWriter.cs
T
Joseph Doherty 635461c0fd chore(audit): ScadaBridge C7 — perf re-baseline + CollapseAuditLogToCanonical projection test + index-test fix + dead-cref cleanup (Task 2.5)
Perf re-baseline (HotPathLatencyTests): empirical p95 on Apple M-series Release
build: 4KB DetailsJson slow path ≈14 µs, small-DetailsJson no-redactors ≈2 µs,
true no-op fast path ≈0 µs. Thresholds updated: 200 µs / 30 µs / 5 µs (≈15×
headroom for contested CI runners). Old thresholds (50 µs / 10 µs) were set for
the pre-C3 typed-field path; canonical JSON parse+rewrite is empirically faster.
Adds a third test (Filter_Apply_NoDetailsJson_FastPath) that asserts same-instance
return on the DetailsJson-null + within-cap fast path. Env-var overrides retained.

CollapseAuditLogToCanonicalMigrationTests (new): three MSSQL-gated [SkippableFact]
tests verifying Action/Category/Outcome projection, NULL Actor, DetailsJson codec
round-trip, and all six persisted computed columns (Kind/Status/SourceSiteId/
ExecutionId/ParentExecutionId) for ApiOutbound, InboundAuthFailure, and Failed-
status rows.

AddAuditLogTableMigrationTests: rename CreatesFiveNamedIndexes →
CreatesNineNamedIndexes; expand coverage from 5 original indexes to all 9 named
non-clustered indexes present after CollapseAuditLogToCanonical (adds
IX_AuditLog_Execution, IX_AuditLog_ParentExecution, IX_AuditLog_Node_Occurred,
UX_AuditLog_EventId).

Dead-cref cleanup: zero references to the deleted IAuditPayloadFilter /
DefaultAuditPayloadFilter / SafeDefaultAuditPayloadFilter types remain in any
.cs file (source or test). 26 occurrences across 13 files replaced with correct
references to IAuditRedactor / ScadaBridgeAuditRedactor / SafeDefaultAuditRedactor
or reworded as plain prose.

Residual sweep: no unused transitional code found beyond the acknowledged
"C3 transitional shim" comments on IngestedAtUtc stamping (active code, not dead).
2026-06-02 14:59:23 -04:00

169 lines
7.6 KiB
C#

using Microsoft.Extensions.Logging;
using ZB.MOM.WW.Audit;
using ZB.MOM.WW.ScadaBridge.AuditLog.Redaction;
using ZB.MOM.WW.ScadaBridge.Commons.Interfaces.Services;
using IAuditWriter = ZB.MOM.WW.ScadaBridge.Commons.Interfaces.Services.IAuditWriter;
namespace ZB.MOM.WW.ScadaBridge.AuditLog.Site;
/// <summary>
/// Composes the primary <see cref="SqliteAuditWriter"/> with a drop-oldest
/// <see cref="RingBufferFallback"/>. Audit writes are best-effort by contract
/// (see <see cref="IAuditWriter"/>) — a primary failure must NEVER bubble out
/// to the calling script. Failed events are stashed in the ring; on the next
/// successful primary write the ring is drained back through the primary in
/// FIFO order.
/// </summary>
/// <remarks>
/// <para>
/// Each primary failure increments <see cref="IAuditWriteFailureCounter"/> so
/// Site Health Monitoring can surface a sustained outage as
/// <c>SiteAuditWriteFailures</c> (Bundle G).
/// </para>
/// <para>
/// Errors raised by the ring drain on recovery are logged and silently dropped
/// so we don't loop the failure mode — the trigger event itself succeeded, and
/// retrying the drain on the NEXT successful write is the recovery path.
/// </para>
/// </remarks>
public sealed class FallbackAuditWriter : IAuditWriter
{
private readonly IAuditWriter _primary;
private readonly RingBufferFallback _ring;
private readonly IAuditWriteFailureCounter _failureCounter;
private readonly ILogger<FallbackAuditWriter> _logger;
private readonly IAuditRedactor _redactor;
private readonly SemaphoreSlim _drainGate = new(1, 1);
/// <summary>
/// Bundle C (M5-T6) wires the singleton <see cref="IAuditRedactor"/>
/// here so every event written via the site hot path is truncated +
/// header/body/SQL-param redacted before it hits both the primary SQLite
/// writer AND the ring fallback. The parameter is optional (defaults to
/// the always-safe <see cref="SafeDefaultAuditRedactor"/>) so the long
/// tail of test composition roots that don't care about the redactor need
/// no change — the production
/// <see cref="ServiceCollectionExtensions.AddAuditLog"/> registration
/// always passes the real redactor through.
/// </summary>
/// <param name="primary">The primary audit writer (typically the SQLite writer).</param>
/// <param name="ring">Drop-oldest ring buffer used to stash events when the primary fails.</param>
/// <param name="failureCounter">Counter incremented on each primary failure for health reporting.</param>
/// <param name="logger">Logger for diagnostics.</param>
/// <param name="redactor">Optional canonical redactor applied before writing; null means the always-safe default.</param>
public FallbackAuditWriter(
IAuditWriter primary,
RingBufferFallback ring,
IAuditWriteFailureCounter failureCounter,
ILogger<FallbackAuditWriter> logger,
IAuditRedactor? redactor = null)
{
_primary = primary ?? throw new ArgumentNullException(nameof(primary));
_ring = ring ?? throw new ArgumentNullException(nameof(ring));
_failureCounter = failureCounter ?? throw new ArgumentNullException(nameof(failureCounter));
_logger = logger ?? throw new ArgumentNullException(nameof(logger));
// AuditLog-008: never default to a null redactor — over-redact instead.
// C3 (Task 2.5): wired via the canonical IAuditRedactor seam.
// SafeDefaultAuditRedactor performs HTTP header redaction with the
// hard-coded sensitive defaults (Authorization, X-Api-Key, Cookie,
// Set-Cookie) on the DetailsJson summaries so a test composition root
// that doesn't bind the real options never persists those headers
// verbatim. The full ScadaBridgeAuditRedactor (truncation + body /
// SQL-param redaction) is wired by AddAuditLog and takes precedence.
_redactor = redactor ?? SafeDefaultAuditRedactor.Instance;
}
/// <inheritdoc />
public async Task WriteAsync(AuditEvent evt, CancellationToken ct = default)
{
ArgumentNullException.ThrowIfNull(evt);
// Redact once, up-front. The redacted event flows BOTH to the primary
// and (on failure) to the ring buffer — so a primary outage that
// drains later still hands the SqliteAuditWriter a row that has
// already been truncated and redacted. The redactor contract is
// "MUST NOT throw". AuditLog-008: _redactor is now non-null (defaults
// to SafeDefaultAuditRedactor so header redaction is always applied
// even in composition roots that don't wire the real redactor).
var filtered = _redactor.Apply(evt);
try
{
await _primary.WriteAsync(filtered, ct).ConfigureAwait(false);
}
catch (Exception ex)
{
// Primary down: record the failure, stash in the ring, return
// success to the caller. Audit-write failures NEVER abort the
// user-facing action (alog.md §7). DO NOT attempt the ring drain
// here — primary is throwing, draining would just scramble FIFO
// order across re-enqueues.
_failureCounter.Increment();
_logger.LogWarning(ex,
"Primary audit writer threw; routing EventId {EventId} to drop-oldest ring.",
filtered.EventId);
// Ring stores the filtered copy so the eventual drain replays a
// payload that has already been capped/redacted — no second
// filter pass needed on recovery, and no risk of the ring
// holding the raw oversized blob in memory.
_ring.TryEnqueue(filtered);
return;
}
// Primary succeeded — opportunistically drain anything that piled up
// in the ring during the outage. Best-effort: a failure during the
// drain re-enqueues the popped event and is logged; the next
// successful write will retry. Drain order in the audit log is
// therefore: <triggering event>, <backlog FIFO>.
if (_ring.Count > 0)
{
await TryDrainRingAsync(ct).ConfigureAwait(false);
}
}
private async Task TryDrainRingAsync(CancellationToken ct)
{
// Serialise drains so two concurrent recoveries don't double-replay.
if (!await _drainGate.WaitAsync(0, ct).ConfigureAwait(false))
{
return;
}
try
{
// Pull only what is currently buffered; do NOT wait for new events.
// We iterate with a snapshot of Count so we never starve under
// concurrent enqueues.
var pending = _ring.Count;
for (var i = 0; i < pending; i++)
{
if (!_ring.TryDequeue(out var queued))
{
break;
}
try
{
await _primary.WriteAsync(queued, ct).ConfigureAwait(false);
}
catch (Exception ex)
{
// Primary fell over again. Put the event back at the head
// of the queue is impossible with Channel<T>; route to the
// tail (drop-oldest preserves the most-recent picture).
_failureCounter.Increment();
_logger.LogWarning(ex,
"Ring drain re-throw on EventId {EventId}; re-enqueuing.",
queued.EventId);
_ring.TryEnqueue(queued);
break;
}
}
}
finally
{
_drainGate.Release();
}
}
}