Files
ScadaBridge/code-reviews/NotificationOutbox/findings.md
T
Joseph Doherty 5d2386cc9d fix(transport): close bundle security + plaintext-retention gaps (4 findings)
T-003: move the unlock lockout server-side. The 3-strike counter used to
live in the Razor page only — a second tab / CLI caller could re-upload
the same bytes and grind PBKDF2 indefinitely. The counter now lives in
IBundleSessionStore, keyed by ContentHash, so retries against identical
bundle bytes are throttled regardless of client. BundleLockedException
surfaces the new typed error path.

T-005: bind the manifest's non-derivative fields into AES-GCM AAD. A
SHA-256 of the manifest (with ContentHash + Encryption normalised to
sentinels) is now passed to AesGcm.Encrypt / .Decrypt, so a tampered
SourceEnvironment / ExportedBy / CreatedAtUtc on a stolen bundle yields
an authentication-tag mismatch instead of slipping past the Step-4
typo-resistant confirmation gate.

T-006: cap zip entry count, decompressed length, and compression ratio
in LoadAsync's envelope validator BEFORE any payload is decompressed,
using ZipArchiveEntry.Length / .CompressedLength. New TransportOptions
fields default to 4 entries / 200 MB / 50x ratio.

T-007: clear decrypted plaintext on the ApplyAsync failure path and zero
the buffer on success before removing the session, so a 100 MB
DecryptedContent doesn't sit in memory for the 30-min TTL after a failed
apply. A BundleSessionEvictionService BackgroundService now also drives
EvictExpired periodically so abandoned sessions clear without needing a
fresh Get() call to trigger lazy eviction.

Also resolves NO-010 — the misleading "writer never throws" XML doc was
the same code+comment my prior NO-004 await-the-writer fix already
rewrote.
2026-05-28 04:14:07 -04:00

26 KiB

Code Review — NotificationOutbox

Field Value
Module src/ScadaLink.NotificationOutbox
Design doc docs/requirements/Component-NotificationOutbox.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 10

Summary

NotificationOutbox is a small, focused module — one ~985-line actor (NotificationOutboxActor), a strongly-typed options class, an INotificationDeliveryAdapter seam, and the single concrete EmailNotificationDeliveryAdapter. The Akka.NET conventions are textbook: every async path is wrapped with PipeTo, the dispatcher uses an in-flight guard cleared on DispatchComplete, the sender is captured before crossing the await, and the actor isolates per-notification failures so one bad row never aborts a batch. Test coverage is broad — ingest, dispatch, query, retry/discard, purge, KPI, and the new audit-emission paths (B2 attempts + B3 terminals) all have dedicated test files — and the audit-write-failure-never-aborts-delivery contract is explicitly asserted.

The dominant theme is trust-boundary leakage between Outbox, NotificationService, and ConfigurationDatabase. The outbox inherits two known defects from its sibling modules that are reachable through EmailNotificationDeliveryAdapter: the OAuth2 SASL empty-user bug (NS-021) ships every M365 send with user="", and the InsertIfNotExistsAsync check-then-act race (CD-015) lives on the outbox's ack-after-persist hot path. Neither is a defect of code under src/ScadaLink.NotificationOutbox/, but both are surfaced here because production dispatch and ingest go through these exact lines. A secondary theme is dispatcher-fire-and-forget audit writes (_ = _auditWriter.WriteAsync(...)) that can race the per-sweep scope dispose under the wrong DI graph, and a few smaller drifts: the dispatcher passes CancellationToken.None to adapter delivery (no graceful shutdown for in-flight SMTP sends), the StuckAgeThreshold XML-doc describes a behavior the design explicitly forbids (display-only, never reclaim), the MaxRetries boundary check uses >= against a config value that can be zero (immediate park on first transient failure), and several NotificationOutboxOptions fields are documented in code but absent from Component-NotificationOutbox.md. No Critical findings; two High, six Medium, two Low.

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Yes MaxRetries zero/negative immediately parks (NotificationOutbox-002); StuckAgeThreshold XML doc contradicts design (NotificationOutbox-009); Guid.TryParse accepts compact "N" ids emitted by sites.
2 Akka.NET conventions Yes PipeTo / sender-capture / in-flight guard pattern is correctly applied throughout. Fire-and-forget _ = _auditWriter.WriteAsync(...) raises a scope-lifetime concern (NotificationOutbox-004).
3 Concurrency & thread safety Yes Actor state mutated only on actor thread. Inherited CD-015 race on InsertIfNotExistsAsync (NotificationOutbox-005) is the only race; the dispatcher's in-flight guard correctly serializes sweeps.
4 Error handling & resilience Yes Outer try/catch on RunDispatchPass/RunPurgePass keeps the in-flight guard sane; per-notification isolation is correct. CT not threaded into delivery (NotificationOutbox-003).
5 Security Yes Inherited OAuth2 empty-user (NotificationOutbox-001) reachable through the adapter. No new credential or trust-boundary issues introduced by the outbox itself.
6 Performance & resource management Yes Dispatch interval & batch size are simple polling; ResolveAdapters rebuilds the lookup per sweep (NotificationOutbox-006). No leaks.
7 Design-document adherence Yes NotificationOutboxOptions.DispatchBatchSize, DeliveredKpiWindow, PurgeInterval are not in the design doc (NotificationOutbox-007).
8 Code organization & conventions Yes Options class lives in the component project (correct); DI extension lives in the component (correct); adapter is scoped, actor singleton — interaction correctly documented in ServiceCollectionExtensions. No issues.
9 Testing coverage Yes Solid actor-behaviour coverage. Missing tests for FallbackMaxRetries / empty-SMTP-config dispatch path (NotificationOutbox-008).
10 Documentation & comments Yes XML on StuckAgeThreshold misleading (NotificationOutbox-009); XML on dispatcher's audit _ = fire-and-forget says "writer never throws" but EmitAttemptAudit still wraps in try/catch — comment contradicts itself (NotificationOutbox-010).

Findings

NotificationOutbox-001 — EmailNotificationDeliveryAdapter inherits the OAuth2 empty-user SASL bug (NS-021) on the M365 send path

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.NotificationOutbox/Delivery/EmailNotificationDeliveryAdapter.cs:185-191 (calls smtp.AuthenticateAsync("oauth2", token)); root cause in src/ScadaLink.NotificationService/MailKitSmtpClientWrapper.cs:76-79

Description

EmailNotificationDeliveryAdapter.SendAsync resolves an OAuth2 access token via _tokenService.GetTokenAsync(...) and then calls await smtp.AuthenticateAsync(config.AuthType, credentials, cancellationToken); on ISmtpClientWrapper. The production implementation (MailKitSmtpClientWrapper) constructs new SaslMechanismOAuth2("", credentials) — an empty user-name field — which Microsoft 365 SMTP rejects with 535 5.7.3 Authentication unsuccessful. The sibling NotificationService finding NS-021 documents this in full; the outbox is the new home for delivery on central, so every OAuth2 send that the outbox dispatches hits this code path. The defect is therefore reachable here even though the offending constructor lives in the NotificationService project, and the central-only redesign means this is now the only delivery path in production. Existing outbox tests do not catch it because they all substitute ISmtpClientWrapper and assert only that AuthenticateAsync is invoked with ("oauth2", "<token>") — the real SaslMechanismOAuth2 is never instantiated. OAuth2TokenService.GetTokenAsync is explicitly wired to login.microsoftonline.com/.../oauth2/v2.0/token with scope=https://outlook.office365.com/.default, so M365 SMTP is the intended target — and is precisely the relay that requires the user field to be populated.

Recommendation

Track the NS-021 fix and add an outbox-side regression test once the wrapper signature is widened. Concretely, when ISmtpClientWrapper.AuthenticateAsync is extended to accept the sender mailbox (or a dedicated oauth2UserName parameter), update EmailNotificationDeliveryAdapter.SendAsync to pass config.FromAddress, and add a test in EmailNotificationDeliveryAdapterTests that asserts the OAuth2 path forwards the sender identity. Until then, surface the same finding here so the outbox is not treated as resolved when NS-021 fires.

Resolution

Unresolved.

NotificationOutbox-002 — Dispatcher parks on first transient failure when SmtpConfiguration.MaxRetries == 0

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:348-360

Description

The transient-failure branch increments RetryCount then evaluates if (notification.RetryCount >= maxRetries) notification.Status = NotificationStatus.Parked;. maxRetries is read from the central SmtpConfiguration.MaxRetries column, which has no enforced lower bound and is not validated by the outbox. A row whose MaxRetries is 0 (or any negative value) immediately satisfies 1 >= 0 on the very first transient failure, so the notification is parked without a single retry — directly contradicting the design doc's "fixed retry interval, reuse central SMTP max-retry-count" intent, where a configured value of zero would naturally read as "never retry, fail straight to permanent". SetupSmtpRetryPolicy in the dispatch tests always supplies a positive value, so this path is not exercised.

Additionally, an operator who clears the SMTP config row drops into the FallbackMaxRetries = 10 / FallbackRetryDelay = 1 min path (ResolveRetryPolicyAsync line 251); that path is also untested — see NotificationOutbox-008. The operational result is that a single bad SMTP config value silently halves the outbox's delivery guarantees.

Recommendation

Validate MaxRetries at the read point: treat a non-positive value as either the configured fallback (current FallbackMaxRetries = 10) or — preferred — surface the mis-configuration to the operator via a health metric and refuse to dispatch until the row is corrected. Either way, add a test that asserts the dispatcher's behaviour for MaxRetries == 0 and MaxRetries < 0.

Resolution

Unresolved.

NotificationOutbox-003 — Dispatcher does not propagate a CancellationToken into delivery; in-flight SMTP sends cannot be cancelled on shutdown

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:334, src/ScadaLink.NotificationOutbox/Delivery/INotificationDeliveryAdapter.cs:22

Description

DeliverOneAsync calls var outcome = await adapter.DeliverAsync(notification); — the second CancellationToken parameter on INotificationDeliveryAdapter.DeliverAsync is left at its default(CancellationToken) value, meaning CancellationToken.None. EmailNotificationDeliveryAdapter.SendAsync then threads that None token into smtp.ConnectAsync, smtp.AuthenticateAsync, and smtp.SendAsync. The consequence is that during a coordinated cluster shutdown (singleton handover, drain) any in-flight SMTP send is uncancellable and the dispatcher's sweep must wait for the underlying socket/SMTP timeout (SmtpConfiguration.ConnectionTimeoutSeconds) before the sweep's task completes and DispatchComplete lowers the in-flight guard. With the default connect-timeout values this is on the order of tens of seconds per notification in the in-progress batch, blocking CoordinatedShutdown.

The adapter implementations clearly expect a token — the contract type is CancellationToken cancellationToken = default everywhere — so this is a wiring gap, not a missing interface.

Recommendation

Wire a per-sweep CancellationTokenSource linked to the actor's lifecycle (cancel in PostStop) and pass its token into DeliverAsync. A linked source per sweep also bounds individual deliveries by the configured connection timeout when a more explicit per-attempt budget is wanted. Add a test that cancels mid-DeliverAsync and asserts the dispatcher completes promptly and the row is left non-terminal (Pending/Retrying unchanged) for the next sweep.

Resolution

Unresolved.

NotificationOutbox-004 — EmitAttemptAudit/EmitTerminalAudit fire-and-forget pattern can outlive the per-sweep DI scope

Severity Medium
Category Akka.NET conventions
Status Resolved
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:425-435, 463-485

Description

Both emission helpers issue _ = _auditWriter.WriteAsync(evt); — discarding the returned task. CentralAuditWriter.WriteAsync opens its own await using var scope = _services.CreateAsyncScope(); and resolves a scoped IAuditLogRepository (verified at src/ScadaLink.AuditLog/Central/CentralAuditWriter.cs:118-121), so the writer is defensively scope-independent. However the dispatcher already holds a per-sweep using var scope = _serviceProvider.CreateScope(); and the per-notification UpdateAsync runs in that scope. The fire-and-forget pattern means:

  1. The dispatcher's outer scope can be disposed (sweep done, DispatchComplete piped) while the audit WriteAsync task is still running on a different scope it owns — works today only because the writer creates its own scope.
  2. A faulted unobserved task is silently lost: if CentralAuditWriter.WriteAsync itself were ever made async void or refactored to not internally try/catch, the dispatcher would never see the fault and the audit row would vanish without the _logger.LogWarning reaching the operator.
  3. The XML-doc above EmitAttemptAudit says "PipeTo is not used because the writer never throws" — but the surrounding try { _ = _auditWriter.WriteAsync(evt); } catch (Exception ex) will only catch a synchronous throw from the task construction, not the awaited body of WriteAsync. The comment understates the risk: the catch is structurally unreachable for the documented failure mode.

The system actually wants the invariant "audit write never affects delivery" (verified by the AuditWriter_Throws_…StillSucceeds tests). That invariant is better expressed by await-ing the writer inside the actor's outer try/catch (the dispatcher already swallows per-notification exceptions) than by a discard-task, which couples the lifetime of the dispatcher's scope to that of the audit task through whatever scope graph the writer happens to use today.

Recommendation

Either await _auditWriter.WriteAsync(evt) inside the existing try/catch (the preferred fix — preserves the invariant, plays well with the per-sweep scope, and makes the catch block actually reachable), or — if a true fire-and-forget remains desired — capture the returned task and attach a continuation that calls _logger.LogWarning on faulted to keep diagnostics intact. Either way, fix the "writer never throws" XML-doc to match the implementation.

Resolution

Unresolved.

NotificationOutbox-005 — Ingest persistence inherits the CD-015 check-then-act race; under contention the second writer throws and the site retries

Severity Medium
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:127-132 (caller); root cause in src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs:33-45

Description

HandleSubmitPersistAsync calls repository.InsertIfNotExistsAsync(notification) on INotificationOutboxRepository. The current implementation (src/ScadaLink.ConfigurationDatabase/Repositories/NotificationOutboxRepository.cs) does a check-then-act with no duplicate-key catch — documented as CD-015 (High, Open). The Notification Outbox's documented contract is "at-least-once handoff with ack-after-persist plus insert-if-not-exists on NotificationId" (CLAUDE.md, Component-NotificationOutbox.md §Ingest & Idempotency), and the duplicate-insert race is the expected contention pattern — the site retries the same submission after a lost ack. As written, the loser surfaces a SqlException (2627 PK violation) wrapped in DbUpdateException, propagates through PipeTo's failure projection as a NotificationSubmitAck { Accepted: false, Error: "... PRIMARY KEY ..." }, the site treats the ack as a forwarding failure and forwards the message again, re-entering the same race. If the contending pair keeps racing this can livelock.

The actor side is fine — PipeTo's success/failure projection correctly forwards the exception message. The repository side needs the standard 2601/2627 → no-op pattern that AuditLog and SiteCall already use. This finding tracks the outbox-side visibility of the CD-015 defect so a re-review of NotificationOutbox surfaces it even if the reader has not yet read the ConfigurationDatabase findings.

Recommendation

Track CD-015 to resolution. As a defense-in-depth complement here, consider treating a duplicate-key DbUpdateException in the actor's ingest failure projection as Accepted: true so a lost ack between persisted-by-the-first-writer and ack-back does not produce a permanent re-forward loop — but the cleanest fix remains the CD-015 raw-SQL IF NOT EXISTS … INSERT with 2601/2627 catch in NotificationOutboxRepository.

Resolution

Unresolved.

NotificationOutbox-006 — ResolveAdapters rebuilds the NotificationType → adapter dictionary on every dispatch sweep

Severity Low
Category Performance & resource management
Status Open
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:267-277

Description

Every dispatch sweep calls ResolveAdapters(scope.ServiceProvider) which enumerates scopedServices.GetServices<INotificationDeliveryAdapter>() and builds a fresh Dictionary<NotificationType, INotificationDeliveryAdapter>. Adapter registration is decided at startup (AddNotificationOutbox registers EmailNotificationDeliveryAdapter); the registration set does not change at runtime. With a default DispatchInterval = 10s and only ever one entry today, the allocation overhead is trivial — but the comment "the last adapter registered for a given type wins, mirroring DI's last-wins resolution semantics" elevates this to a behaviour contract, and the per-sweep dictionary construction obscures the lookup's identity from one sweep to the next, making any future stateful adapter (rate limiter, circuit breaker) silently lose its state.

The same issue is the reason EmailNotificationDeliveryAdapter is scoped — it holds a scoped INotificationRepository. A trivial cache-the-types-but-resolve- the-instance fix is possible: cache the set of declared NotificationType values and look up each adapter by GetService<INotificationDeliveryAdapter>() filtered by Type per sweep.

Recommendation

Document the per-sweep contract explicitly ("each sweep gets a fresh adapter instance per the scoped DI contract — adapters must not carry state across sweeps") in the actor XML, or — preferred — cache only the types at startup (PreStart) and resolve the scoped instance per sweep, so future adapters with stateful intent (timeouts, circuit breakers) cannot accidentally lose state.

Resolution

Unresolved.

NotificationOutbox-007 — NotificationOutboxOptions.DispatchBatchSize, DeliveredKpiWindow, and PurgeInterval are not in the design document

Severity Medium
Category Design-document adherence
Status Open
Location src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:13, :22, :25; docs/requirements/Component-NotificationOutbox.md:152-160

Description

Component-NotificationOutbox.md §Configuration enumerates three options: dispatch interval, stuck-age threshold, and terminal-row retention window. The implemented NotificationOutboxOptions adds three additional fields:

  • DispatchBatchSize (default 100) — caps the per-sweep claim size, but is invisible to anyone reading only the spec.
  • PurgeInterval (default 1 day) — the design doc says "daily purge" as if the cadence is fixed; in code it is configurable.
  • DeliveredKpiWindow (default 1 min) — the KPI section says "Delivered (last interval)" without saying how long "last interval" is or that it is configurable.

The design doc also asserts "Delivery max-retry-count and retry interval are not part of NotificationOutboxOptions — they are reused from the central SMTP configuration" (line 160) — implementation honours this. But the three additions above are dead text in the design doc. The KPI dashboard cadence and the dispatch batch size are both operationally important values an operator/engineer will hunt for; their absence from the spec is design drift.

Recommendation

Add the three fields to Component-NotificationOutbox.md §Configuration with their defaults, or remove them from the implementation if they were meant to be fixed constants. Cross-link DeliveredKpiWindow from the §Monitoring "Delivered (last interval)" KPI bullet so a reader sees what controls the bucket length.

Resolution

Unresolved.

NotificationOutbox-008 — FallbackMaxRetries / FallbackRetryDelay path is unreachable in production AND untested

Severity Low
Category Testing coverage
Status Open
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:29-31, :251-259; tests in tests/ScadaLink.NotificationOutbox.Tests/NotificationOutboxActorDispatchTests.cs

Description

ResolveRetryPolicyAsync falls back to FallbackMaxRetries = 10 and FallbackRetryDelay = 1 min when notificationRepository.GetAllSmtpConfigurationsAsync() returns an empty list (no SMTP configuration row). The comment correctly observes that delivery itself will then return Permanent("No SMTP configuration available") from EmailNotificationDeliveryAdapter.cs:78-81, so the fallback retry policy never actually retries anything — the row is permanently parked on first attempt regardless of retry count or delay.

This produces three concerns. (1) The fallback is essentially dead code — the retry policy values are never consulted in practice because delivery always fails permanently before the retry branch is reached. (2) The fallback can be reached after a previously-deployed SMTP config is deleted, which is precisely the moment an operator needs accurate audit trails; the row will say Parked with LastError = "No SMTP configuration available" but the audit signal "retry policy fell back to defaults" is invisible. (3) Tests never exercise either the fallback path or the empty-SMTP-config dispatch path: SetupSmtpRetryPolicy always supplies a config in every dispatch test.

Recommendation

Add a regression test that runs a dispatch sweep with no SMTP config row and asserts the row is parked with the documented error. Optionally remove the fallback constants if parking-with-no-config is the intended operational signal; document the choice in the actor XML so a maintainer does not "fix" the unreachable code.

Resolution

Unresolved.

NotificationOutbox-009 — StuckAgeThreshold XML-doc says "in-progress notification is re-claimed" — contradicts the design's display-only stuck detection

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.NotificationOutbox/NotificationOutboxOptions.cs:15-16

Description

/// <summary>Age past which an in-progress notification is considered stuck and re-claimed.</summary>
public TimeSpan StuckAgeThreshold { get; set; } = TimeSpan.FromMinutes(10);

The implementation never reclaims anything based on StuckAgeThreshold. It is used only as a cutoff for the stuck-count KPI (StuckCutoff/IsStuck in NotificationOutboxActor.cs:932-942) and as a StuckCutoff filter on paginated queries. The design doc is explicit: "A notification is stuck if it is Pending or Retrying and older than a configurable age threshold (default 10 minutes). Detection is display-only — a count KPI and a row badge. There is no automated escalation or alerting" (Component-NotificationOutbox.md:143-145). A maintainer reading the XML and expecting "re-claim" behaviour will be surprised twice — once when no re-claim happens, and once when they go looking for the re-claim code and find none.

Recommendation

Rewrite the XML to match the design: "Age past which a still-Pending/Retrying notification is counted as stuck on the KPI tile and the per-row badge. Display-only — does not affect dispatch."

Resolution

Unresolved.

NotificationOutbox-010 — Comment claims PipeTo is not used "because the writer never throws"; the surrounding try/catch is dead-letter for the documented failure mode

Severity Medium
Category Documentation & comments
Status Resolved
Location src/ScadaLink.NotificationOutbox/NotificationOutboxActor.cs:469-477

Description

try
{
    var evt = BuildNotifyDeliverEvent(notification, now, AuditStatus.Attempted, errorMessage)
        with { DurationMs = durationMs };
    // Fire-and-forget — we do NOT await: the dispatcher loop must not
    // be blocked by audit IO, and the writer swallows its own faults.
    // PipeTo is not used because the writer never throws.
    _ = _auditWriter.WriteAsync(evt);
}
catch (Exception ex)
{
    _logger.LogWarning(ex, "Failed to emit Attempted audit row …");
}

The XML-doc on EmitAttemptAudit is internally inconsistent and structurally incorrect: (1) if "the writer never throws" then the surrounding try/catch is unreachable and dead code; (2) if the writer can throw (and the catch is meaningful) then "never throws" is wrong. In practice the catch only ever fires on a synchronous throw from the writer's task construction — never on a fault in the awaited body — because the discarded task is not observed. The current behaviour matches the design intent ("audit failure NEVER aborts delivery"), but the comment misleads the next reader on the why.

This is the same root cause as NotificationOutbox-004 — they target the same lines from different angles (NotificationOutbox-004 is the scope-lifetime / fire-and-forget Akka concern, NotificationOutbox-010 is the doc/comment-clarity concern). Closing NotificationOutbox-004 by switching to await resolves both.

Recommendation

If await-ing the writer (recommended fix per NotificationOutbox-004): delete the "PipeTo is not used because the writer never throws" line entirely and let the try/catch's behaviour speak for itself. If keeping fire-and-forget: rewrite the comment to "fire-and-forget by design (the writer is responsible for its own failure handling); the surrounding try/catch only catches the synchronous task-construction throw and is otherwise unreachable."

Resolution

Unresolved.