Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:
Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
during the bounded-retry window aborts cleanly.
Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
when any row's idempotent insert is still being retried (per-EventId
retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
SPLIT loop — by class-doc construction the catch could only mask real
failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
per-interval counters before sending, restore via new
ISiteHealthCollector.AddIntervalCounters on transport failure so counts
aren't silently lost.
Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
with a bounded SweepShutdownWaitTimeout (10s).
Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
own try/catch so a throw doesn't leak the relay actor or _activeStreams
entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
(auth-failure exit-code contract unified).
Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
DeleteTimedOut audit entries (mirrors DeployFailed pattern).
Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.
20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
Each finding is a focused validation guard or upper bound at a trust boundary.
Highlights:
- Commons-015: EncryptionMetadata ctor now validates Algorithm (AES-256-GCM
only), Kdf (PBKDF2-SHA256 only), Iterations ([100k, 10M]), non-null Salt/IV.
- Transport-004: new BundleUnlockRateLimiter (sliding-window, per-key,
singleton) wired into BundleImporter.LoadAsync; over-budget callers see
BundleUnlockRateLimitedException. Per-bundle 3-strike + per-window cap.
- ESG-022: ExternalSystemClient.InvokeHttpAsync allow-lists the documented
GET/POST/PUT/PATCH/DELETE set (case-insensitive); unknown verbs throw.
- SEL-015: SiteEventLogger queue now bounded (10k cap, DropOldest); dropped
events fault their Task and increment FailedWriteCount so the drop is
observable instead of an unbounded memory growth.
- SEL-017: EventLogQueryService clamps caller-supplied PageSize to a new
MaxQueryPageSize cap (default 500) so int.MaxValue can't OOM the host.
- SEL-020: LogEventAsync rejects severities outside {Info, Warning, Error}
(matches SQLite BINARY-collation query filter).
- InboundAPI-020: ContentType "json" check now case-insensitive
(application/JSON no longer slips through as not-json).
- InboundAPI-024: _knownBadMethods capped at 1000 entries (drops new entries
once full); per-request DB lookup remains the correctness path.
- SR-025: HandleSetStaticAttribute validates the attribute name against the
deployed config; unknown names now return Success=false instead of
leaking orphan override rows into the SQLite store.
- TE-021: MoveTemplateAsync runs the sibling-name-collision check at the
destination, mirroring TemplateFolderService.MoveFolderAsync.
- TE-022: LockEnforcer's once-locked-stays-locked rule now also covers
LockedInDerived (was previously only IsLocked).
New regression tests across 8 test projects (EncryptionMetadata, rate
limiter, ESG client allow-list, SEL bounded channel / PageSize clamp /
severity validation, InboundAPI ContentType + bad-methods cap, SiteRT
unknown-attribute, TemplateEngine MoveTemplate + LockedInDerived).
Build clean; affected suites all green. README regenerated: 93 open (was 104).
Note: a separate manual re-run was needed for the SiteEventLogging hunk
because its initial subagent's source edits never landed on disk despite
reporting success (file-collision-style failure mode).
UTC invariant + culture-safety fixes across UI form binding, audit entity
hydrate, and locale-dependent parses. Highlights:
- CentralUI-026/027: AuditFilterBar / SiteCallsReport / NotificationReport /
EventLogs now apply SpecifyKind(Local) + ToUniversalTime() at form submit
so browser-local datetime-local inputs aren't silently treated as UTC.
- Commons-019: AuditEvent.OccurredAtUtc / IngestedAtUtc init-setters
re-tag any incoming DateTime as Kind=Utc, documenting the invariant.
- CD-018: AuditLogEntityTypeConfiguration adds UTC ValueConverters on the
*Utc DateTime columns so EF hydrate yields Kind=Utc (SQL Server's
datetime2 has no Kind metadata, so reads were returning Unspecified).
- CD-020: GetPartitionBoundariesOlderThanAsync now SpecifyKind(Utc) on the
raw-ADO read, matching the existing defence in AuditLogPartitionMaintenance.
- SEL-021: EventLogQueryService.DateTimeOffset.Parse now uses
InvariantCulture + AssumeUniversal | AdjustToUniversal.
- SR-023: Convert.ToDouble in ScriptActor + AlarmActor (4 sites) now
passes InvariantCulture so non-US locales don't mis-parse string values.
- HM-020: CentralHealthAggregator.MarkHeartbeat anchors LastHeartbeatAt to
max(receivedAt, now) on offline→online so a stale receivedAt can't
leave a recovered site one tick from re-going-offline.
3 new tests added (AuditLog UTC converter, AuditFilterBar/EventLogs/
NotificationReport-touching CentralUI tests already cover Apply paths,
heartbeat offline→online). Build clean; ConfigurationDatabase 236,
Commons 330, HealthMonitoring 71, SiteRuntime 301, SiteEventLogging 50,
CentralUI 50 — all green. README regenerated: 104 open (was 112).
Transport-001: template Overwrite now diff-and-merges the bundle's
Attributes / Alarms / Scripts onto the target template via three private
helpers (SyncTemplateAttributesAsync / SyncTemplateAlarmsAsync /
SyncTemplateScriptsAsync). Each helper emits one audit row per detected
add / update / delete and feeds the post-merge state into the existing
ResolveAlarmScriptLinks and ResolveCompositionEdges passes.
Transport-002: external-system Overwrite now syncs the Methods collection
via a parallel SyncExternalSystemMethodsAsync helper mirroring the T-001
shape, with ExternalSystemMethodAdded / Updated / Deleted audit rows.
Both fixes are covered by new integration tests in BundleImporterApplyTests.
README regenerated — open findings dropped from 146 to 136; all 10 open
High findings are now closed (0 Critical, 0 High, 46 Medium, 90 Low
remaining).
Comm-016: delete dead HandleConnectionStateChanged + _debugSubscriptions /
_inProgressDeployments tracking + ConnectionStateChanged message record.
Disconnect detection is owned by the transport layers (gRPC keepalive PING
~25s; Ask-timeout at CommunicationService). Updates the
Component-Communication.md design doc to make that explicit.
SnF-018: NotificationForwarder.DeliverAsync now discards a corrupt buffered
payload (Warning log + return true) instead of returning false and parking
the row — honoring the design's "notifications do not park" invariant.
DM-018: reconciliation no longer force-sets Enabled, preserving an
intentional Disabled state after central failover.
ESG-018: DeliverBufferedAsync (both ExternalSystemClient + DatabaseGateway)
catches JsonException and returns false, turning a corrupt buffered row
into a parked operation instead of a retry-forever poison message.
InboundAPI-022: register ActiveNodeGate as IActiveNodeGate in the Central
DI branch so standby-node gating is actually wired up in production.
NS-019: remove orphaned NotificationDeliveryService /
INotificationDeliveryService / NotificationResult; central notification
delivery now lives entirely in NotificationOutbox.
SEL-016: normalise From/To filters to UTC before ISO-string compare so
non-UTC DateTimeOffset clients no longer get spuriously excluded events.
TE-017: include Description on attributes/alarms and a HashableConnections
projection (protocol, endpoint JSON, failover count) in the revision hash
and DiffService; staleness detection now catches description-only and
connection-endpoint edits.
Transport-001 and Transport-002 (also High) remain Open — they're being
handled in a follow-up batch because both touch BundleImporter.cs and
must serialise.
CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL
IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the
at-least-once livelock on the site→central notification handoff.
DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent
same-tag subscribes don't orphan an adapter handle; delete the latent
dead _subscriptionHandles dictionary; stop double-counting
_totalSubscribed when an unresolved tag is promoted via another instance;
release adapter handles on mid-flight unsubscribe; gate the
tag-resolution retry timer with IsTimerActive so subscribe bursts don't
reset it into starvation.
SR-020: add _terminatingActorsByName shadow so a third deploy arriving
during a pending redeploy doesn't crash on InvalidActorNameException —
displaced senders get a Failed/superseded response and the latest
command wins on Terminated.
SR-024: split OperationTrackingStore reads from writes (fresh
SqliteConnection per GetStatusAsync) so long writes don't block status
queries; rewrite Dispose to drop the sync-over-async bridge that could
deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the
dispose-once flag race-safe across both paths.
T-003: move the unlock lockout server-side. The 3-strike counter used to
live in the Razor page only — a second tab / CLI caller could re-upload
the same bytes and grind PBKDF2 indefinitely. The counter now lives in
IBundleSessionStore, keyed by ContentHash, so retries against identical
bundle bytes are throttled regardless of client. BundleLockedException
surfaces the new typed error path.
T-005: bind the manifest's non-derivative fields into AES-GCM AAD. A
SHA-256 of the manifest (with ContentHash + Encryption normalised to
sentinels) is now passed to AesGcm.Encrypt / .Decrypt, so a tampered
SourceEnvironment / ExportedBy / CreatedAtUtc on a stolen bundle yields
an authentication-tag mismatch instead of slipping past the Step-4
typo-resistant confirmation gate.
T-006: cap zip entry count, decompressed length, and compression ratio
in LoadAsync's envelope validator BEFORE any payload is decompressed,
using ZipArchiveEntry.Length / .CompressedLength. New TransportOptions
fields default to 4 entries / 200 MB / 50x ratio.
T-007: clear decrypted plaintext on the ApplyAsync failure path and zero
the buffer on success before removing the session, so a 100 MB
DecryptedContent doesn't sit in memory for the 30-min TTL after a failed
apply. A BundleSessionEvictionService BackgroundService now also drives
EvictExpired periodically so abandoned sessions clear without needing a
fresh Get() call to trigger lazy eviction.
Also resolves NO-010 — the misleading "writer never throws" XML doc was
the same code+comment my prior NO-004 await-the-writer fix already
rewrote.
NS-021/NO-001: thread FromAddress into XOAUTH2 so M365 stops rejecting
sends with 535 5.7.3. Added an additive oauth2UserName parameter on
ISmtpClientWrapper.AuthenticateAsync; both NotificationService and
NotificationOutbox now pass config.FromAddress.
NO-002: clamp non-positive SmtpConfiguration.MaxRetries/RetryDelay to the
1-min / 10-attempt fallback with a Warning so a misconfigured row no
longer parks transient failures on the first attempt or burn-loops.
NO-003: route a lifecycle-scoped CancellationToken from the
NotificationOutboxActor through the dispatch sweep into the adapter so
in-flight SMTP sends abort on PostStop instead of blocking
CoordinatedShutdown for the full SMTP timeout per row.
NO-004: await the central audit writer inside the existing try/catch
instead of fire-and-forget so the audit task can't outlive the per-sweep
DI scope and writer faults reach the operator log instead of being
silently dropped.
Two AuditLog integration tests seeded RetryDelay = TimeSpan.Zero to force
immediate re-claim on the second tick; updated them to 1 ms so they keep
the same intent without tripping the NO-002 clamp.
Resolves the auth-theme batch from the 2026-05-28 baseline review (8 findings
across Security/CentralUI/ManagementService/CLI). The most consequential gaps:
NotificationReport + SiteCallsReport now route through SiteScopeService so a
site-scoped Deployment user cannot see or act on other sites' rows (CUI-028);
QueryAuditLogCommand is no longer "any authenticated user" — gated Admin-only
to match /api/audit/query's strictness (MS-018); RoleMapper preserves the
broader grant when a user is in both an unscoped and scoped Deployment LDAP
group, instead of silently narrowing to the scoped set (Sec-016); and the
dead SiteScopeRequirement/Handler are deleted so SiteScopeService is
unambiguously the sole site-scoping mechanism (Sec-017). Pending findings:
172 → 164.
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
A heartbeat-registered site that has never sent a full report now has
LastReportReceivedAt = null instead of the year-0001 sentinel. TimestampDisplay
accepts DateTimeOffset? and renders null as a placeholder ('awaiting first
report') rather than a ~2000-year-stale date. Cross-module: HealthMonitoring +
CentralUI.
Inbound-API bearer credentials are no longer persisted in plaintext. ApiKey now
holds a KeyHash (peppered HMAC-SHA256); the key is shown once at creation and
only its hash is stored. Lookup and validation hash the presented candidate.
Cross-module: Commons (ApiKey, ApiKeyHasher), ConfigurationDatabase (mapping +
HashApiKeyValue migration), InboundAPI (ApiKeyValidator), ManagementService
(key creation), CentralUI (ApiKeys.razor). Existing keys must be re-issued.