feat(health): AuditRedactionFailure counter + bridge (#23 M5)

Bundle C task M5-T7 — surface DefaultAuditPayloadFilter redactor
over-redactions as a Site Health metric so a misconfigured /
catastrophic regex shows up on /monitoring/health rather than
disappearing into a NoOp sink.

  - SiteHealthReport: new 'AuditRedactionFailure' int field
    (defaulted to 0 for back-compat with existing producers/tests).
  - ISiteHealthCollector / SiteHealthCollector:
    new IncrementAuditRedactionFailure() — per-interval atomic
    counter with Interlocked, reset on CollectReport, mirroring
    the M2 Bundle G SiteAuditWriteFailures pattern.
  - HealthMetricsAuditRedactionFailureCounter: new bridge in
    ScadaLink.AuditLog.Site that forwards IAuditRedactionFailureCounter
    increments to ISiteHealthCollector — mirrors
    HealthMetricsAuditWriteFailureCounter one-for-one.
  - AddAuditLogHealthMetricsBridge: now ALSO Replaces the
    NoOpAuditRedactionFailureCounter binding with the health-metrics
    bridge, so a single AddAuditLogHealthMetricsBridge() call wires
    both the M2 Bundle G write-failure counter and the M5 Bundle C
    redaction-failure counter into the health report.

Site-side only for M5 — the filter also runs on CentralAuditWriter
and AuditLogIngestActor (where it just keeps the NoOp default), but
a central-side health-metric surface for AuditRedactionFailure is
deferred to M6 alongside the rest of the central health collector
work.

Tests:
  - AuditRedactionFailureMetricTests (HealthMonitoring) covers the
    SiteHealthCollector increment/report/reset shape (3 tests).
  - HealthMetricsAuditRedactionFailureCounterTests (AuditLog) covers
    the AuditLog → HealthMonitoring bridge (3 tests).
  - Existing CountCapturingHealthCollector stub in
    DeploymentManagerRedeployTests extended with the new no-op
    interface method.

Verified: dotnet build clean, all 24 test projects green
(the only Failed at first ScadaLink.SiteRuntime.Tests run was the
known-flaky InstanceActorChildAttributeRaceTests; passes on re-run
in isolation and full suite, unrelated to these changes).
This commit is contained in:
Joseph Doherty
2026-05-20 17:28:33 -04:00
parent 9b1379ed9b
commit 23c0fd417e
8 changed files with 214 additions and 12 deletions

View File

@@ -172,26 +172,38 @@ public static class ServiceCollectionExtensions
}
/// <summary>
/// Audit Log (#23) M2 Bundle G — swap the default
/// <see cref="NoOpAuditWriteFailureCounter"/> registration for the real
/// <see cref="HealthMetricsAuditWriteFailureCounter"/> bridge so the
/// FallbackAuditWriter primary-failure counter surfaces in the site health
/// report payload as <c>SiteHealthReport.SiteAuditWriteFailures</c>.
/// Audit Log (#23) M2 Bundle G + M5 Bundle C — swap the default
/// <see cref="NoOpAuditWriteFailureCounter"/> and
/// <see cref="NoOpAuditRedactionFailureCounter"/> registrations for the
/// real <see cref="HealthMetricsAuditWriteFailureCounter"/> /
/// <see cref="HealthMetricsAuditRedactionFailureCounter"/> bridges so the
/// FallbackAuditWriter primary-failure counter AND the
/// DefaultAuditPayloadFilter redactor-failure counter both surface in the
/// site health report payload as
/// <c>SiteHealthReport.SiteAuditWriteFailures</c> +
/// <c>SiteHealthReport.AuditRedactionFailure</c>.
/// </summary>
/// <remarks>
/// <para>
/// Must be called AFTER both <see cref="AddAuditLog"/> (registers the
/// NoOp default this method replaces) and
/// NoOp defaults this method replaces) and
/// <c>ScadaLink.HealthMonitoring.ServiceCollectionExtensions.AddHealthMonitoring</c>
/// or <c>AddSiteHealthMonitoring</c> (registers the
/// <see cref="ISiteHealthCollector"/> the bridge depends on). Resolving
/// <see cref="IAuditWriteFailureCounter"/> without the latter throws
/// <see cref="ISiteHealthCollector"/> the bridges depend on). Resolving
/// <see cref="IAuditWriteFailureCounter"/> or
/// <see cref="IAuditRedactionFailureCounter"/> without the latter throws
/// <see cref="InvalidOperationException"/> at <c>GetRequiredService</c>
/// time — by design, since a silent NoOp would mask a misconfiguration.
/// </para>
/// <para>
/// Idempotent — calling twice replaces the descriptor each time without
/// piling up registrations.
/// Idempotent — calling twice replaces each descriptor without piling up
/// registrations.
/// </para>
/// <para>
/// Site-side only for M5: the central composition root keeps the NoOp
/// defaults; the central health-metric surface that would expose
/// <c>AuditRedactionFailure</c> next to the existing central counters
/// ships in M6.
/// </para>
/// </remarks>
public static IServiceCollection AddAuditLogHealthMetricsBridge(this IServiceCollection services)
@@ -200,6 +212,8 @@ public static class ServiceCollectionExtensions
services.Replace(
ServiceDescriptor.Singleton<IAuditWriteFailureCounter, HealthMetricsAuditWriteFailureCounter>());
services.Replace(
ServiceDescriptor.Singleton<IAuditRedactionFailureCounter, HealthMetricsAuditRedactionFailureCounter>());
return services;
}
}

View File

@@ -0,0 +1,48 @@
using ScadaLink.AuditLog.Payload;
using ScadaLink.HealthMonitoring;
namespace ScadaLink.AuditLog.Site;
/// <summary>
/// Audit Log (#23) M5 Bundle C — bridges
/// <see cref="IAuditRedactionFailureCounter"/> (incremented by
/// <see cref="DefaultAuditPayloadFilter"/> every time a header / body / SQL
/// parameter redactor stage throws and the filter has to over-redact the
/// offending field) into <see cref="ISiteHealthCollector"/> so the count
/// surfaces in the site health report payload as
/// <c>SiteHealthReport.AuditRedactionFailure</c>.
/// </summary>
/// <remarks>
/// <para>
/// Registered by <see cref="ServiceCollectionExtensions.AddAuditLogHealthMetricsBridge"/>;
/// callers must register <c>AddHealthMonitoring()</c> first so
/// <see cref="ISiteHealthCollector"/> resolves. The default <see cref="ServiceCollectionExtensions.AddAuditLog"/>
/// registration keeps <see cref="NoOpAuditRedactionFailureCounter"/> for nodes
/// where Site Health Monitoring is not wired (the silent-sink contract —
/// redaction failures must NEVER abort the user-facing action, alog.md §7).
/// </para>
/// <para>
/// Mirrors the M2 Bundle G <see cref="HealthMetricsAuditWriteFailureCounter"/>
/// shape one-for-one so the two health-metric bridges age together.
/// </para>
/// <para>
/// Site-side only for M5: the redaction filter also runs on the central
/// writers (CentralAuditWriter + AuditLogIngestActor), but the central
/// health-metric surface that would expose <c>AuditRedactionFailure</c>
/// alongside the existing central counters ships in M6. Until then, the
/// central composition root keeps the NoOp default — the redactions still
/// happen, they just don't get counted into a health report.
/// </para>
/// </remarks>
public sealed class HealthMetricsAuditRedactionFailureCounter : IAuditRedactionFailureCounter
{
private readonly ISiteHealthCollector _collector;
public HealthMetricsAuditRedactionFailureCounter(ISiteHealthCollector collector)
{
_collector = collector ?? throw new ArgumentNullException(nameof(collector));
}
/// <inheritdoc/>
public void Increment() => _collector.IncrementAuditRedactionFailure();
}

View File

@@ -25,7 +25,14 @@ public record SiteHealthReport(
// primary failures (SQLite throws routed to the drop-oldest ring). Surfaces
// a sustained audit-write outage on /monitoring/health. Defaults to 0 so
// existing producers / tests that don't construct the field stay valid.
int SiteAuditWriteFailures = 0);
int SiteAuditWriteFailures = 0,
// Audit Log (#23) M5 Bundle C: per-interval count of payload-filter
// redactor over-redactions (header / body / SQL parameter stages all
// throwing → field replaced with the "<redacted: redactor error>"
// marker). Surfaces a misconfigured / catastrophic regex on
// /monitoring/health. Defaults to 0 for back-compat with existing
// producers and tests that don't construct the field.
int AuditRedactionFailure = 0);
/// <summary>
/// Broadcast wrapper used between central nodes to keep per-node

View File

@@ -19,6 +19,15 @@ public interface ISiteHealthCollector
/// <c>AddAuditLogHealthMetricsBridge()</c>.
/// </summary>
void IncrementSiteAuditWriteFailures();
/// <summary>
/// Audit Log (#23) M5 Bundle C — increment the per-interval count of
/// payload-filter redactor over-redactions (header / body / SQL
/// parameter stage throws routed to the
/// <c>&lt;redacted: redactor error&gt;</c> marker). Bridged from the
/// <c>IAuditRedactionFailureCounter</c> binding registered via
/// <c>AddAuditLogHealthMetricsBridge()</c>.
/// </summary>
void IncrementAuditRedactionFailure();
void UpdateConnectionHealth(string connectionName, ConnectionHealth health);
void RemoveConnection(string connectionName);
void UpdateTagResolution(string connectionName, int totalSubscribed, int successfullyResolved);

View File

@@ -14,6 +14,7 @@ public class SiteHealthCollector : ISiteHealthCollector
private int _alarmErrorCount;
private int _deadLetterCount;
private int _siteAuditWriteFailures;
private int _auditRedactionFailures;
private readonly ConcurrentDictionary<string, ConnectionHealth> _connectionStatuses = new();
private readonly ConcurrentDictionary<string, TagResolutionStatus> _tagResolutionCounts = new();
private readonly ConcurrentDictionary<string, string> _connectionEndpoints = new();
@@ -74,6 +75,20 @@ public class SiteHealthCollector : ISiteHealthCollector
Interlocked.Increment(ref _siteAuditWriteFailures);
}
/// <summary>
/// Audit Log (#23) M5 Bundle C — increment the per-interval count of
/// payload-filter redactor over-redactions (header / body / SQL
/// parameter stages routed to the
/// <c>&lt;redacted: redactor error&gt;</c> marker). Bridged from the
/// <c>IAuditRedactionFailureCounter</c> binding registered via
/// <c>AddAuditLogHealthMetricsBridge()</c>; reset every interval together
/// with the other per-interval counters.
/// </summary>
public void IncrementAuditRedactionFailure()
{
Interlocked.Increment(ref _auditRedactionFailures);
}
/// <summary>
/// Update the health status for a named data connection.
/// Called by DCL when connection state changes.
@@ -158,6 +173,7 @@ public class SiteHealthCollector : ISiteHealthCollector
var alarmErrors = Interlocked.Exchange(ref _alarmErrorCount, 0);
var deadLetters = Interlocked.Exchange(ref _deadLetterCount, 0);
var siteAuditWriteFailures = Interlocked.Exchange(ref _siteAuditWriteFailures, 0);
var auditRedactionFailures = Interlocked.Exchange(ref _auditRedactionFailures, 0);
// Snapshot current connection and tag resolution state
var connectionStatuses = new Dictionary<string, ConnectionHealth>(_connectionStatuses);
@@ -190,6 +206,7 @@ public class SiteHealthCollector : ISiteHealthCollector
DataConnectionTagQuality: tagQuality,
ParkedMessageCount: Interlocked.CompareExchange(ref _parkedMessageCount, 0, 0),
ClusterNodes: _clusterNodes?.ToList(),
SiteAuditWriteFailures: siteAuditWriteFailures);
SiteAuditWriteFailures: siteAuditWriteFailures,
AuditRedactionFailure: auditRedactionFailures);
}
}