Commit Graph

36 Commits

Author SHA1 Message Date
Joseph Doherty 819f1b4665 fix(validation): close Theme 3 — 11 input-validation / unbounded-input findings
Each finding is a focused validation guard or upper bound at a trust boundary.
Highlights:
- Commons-015: EncryptionMetadata ctor now validates Algorithm (AES-256-GCM
  only), Kdf (PBKDF2-SHA256 only), Iterations ([100k, 10M]), non-null Salt/IV.
- Transport-004: new BundleUnlockRateLimiter (sliding-window, per-key,
  singleton) wired into BundleImporter.LoadAsync; over-budget callers see
  BundleUnlockRateLimitedException. Per-bundle 3-strike + per-window cap.
- ESG-022: ExternalSystemClient.InvokeHttpAsync allow-lists the documented
  GET/POST/PUT/PATCH/DELETE set (case-insensitive); unknown verbs throw.
- SEL-015: SiteEventLogger queue now bounded (10k cap, DropOldest); dropped
  events fault their Task and increment FailedWriteCount so the drop is
  observable instead of an unbounded memory growth.
- SEL-017: EventLogQueryService clamps caller-supplied PageSize to a new
  MaxQueryPageSize cap (default 500) so int.MaxValue can't OOM the host.
- SEL-020: LogEventAsync rejects severities outside {Info, Warning, Error}
  (matches SQLite BINARY-collation query filter).
- InboundAPI-020: ContentType "json" check now case-insensitive
  (application/JSON no longer slips through as not-json).
- InboundAPI-024: _knownBadMethods capped at 1000 entries (drops new entries
  once full); per-request DB lookup remains the correctness path.
- SR-025: HandleSetStaticAttribute validates the attribute name against the
  deployed config; unknown names now return Success=false instead of
  leaking orphan override rows into the SQLite store.
- TE-021: MoveTemplateAsync runs the sibling-name-collision check at the
  destination, mirroring TemplateFolderService.MoveFolderAsync.
- TE-022: LockEnforcer's once-locked-stays-locked rule now also covers
  LockedInDerived (was previously only IsLocked).

New regression tests across 8 test projects (EncryptionMetadata, rate
limiter, ESG client allow-list, SEL bounded channel / PageSize clamp /
severity validation, InboundAPI ContentType + bad-methods cap, SiteRT
unknown-attribute, TemplateEngine MoveTemplate + LockedInDerived).
Build clean; affected suites all green. README regenerated: 93 open (was 104).

Note: a separate manual re-run was needed for the SiteEventLogging hunk
because its initial subagent's source edits never landed on disk despite
reporting success (file-collision-style failure mode).
2026-05-28 06:58:25 -04:00
Joseph Doherty 344379a40a fix(utc/locale): close Theme 2 — 8 UTC / time / locale findings
UTC invariant + culture-safety fixes across UI form binding, audit entity
hydrate, and locale-dependent parses. Highlights:
- CentralUI-026/027: AuditFilterBar / SiteCallsReport / NotificationReport /
  EventLogs now apply SpecifyKind(Local) + ToUniversalTime() at form submit
  so browser-local datetime-local inputs aren't silently treated as UTC.
- Commons-019: AuditEvent.OccurredAtUtc / IngestedAtUtc init-setters
  re-tag any incoming DateTime as Kind=Utc, documenting the invariant.
- CD-018: AuditLogEntityTypeConfiguration adds UTC ValueConverters on the
  *Utc DateTime columns so EF hydrate yields Kind=Utc (SQL Server's
  datetime2 has no Kind metadata, so reads were returning Unspecified).
- CD-020: GetPartitionBoundariesOlderThanAsync now SpecifyKind(Utc) on the
  raw-ADO read, matching the existing defence in AuditLogPartitionMaintenance.
- SEL-021: EventLogQueryService.DateTimeOffset.Parse now uses
  InvariantCulture + AssumeUniversal | AdjustToUniversal.
- SR-023: Convert.ToDouble in ScriptActor + AlarmActor (4 sites) now
  passes InvariantCulture so non-US locales don't mis-parse string values.
- HM-020: CentralHealthAggregator.MarkHeartbeat anchors LastHeartbeatAt to
  max(receivedAt, now) on offline→online so a stale receivedAt can't
  leave a recovered site one tick from re-going-offline.

3 new tests added (AuditLog UTC converter, AuditFilterBar/EventLogs/
NotificationReport-touching CentralUI tests already cover Apply paths,
heartbeat offline→online). Build clean; ConfigurationDatabase 236,
Commons 330, HealthMonitoring 71, SiteRuntime 301, SiteEventLogging 50,
CentralUI 50 — all green. README regenerated: 104 open (was 112).
2026-05-28 06:36:44 -04:00
Joseph Doherty 487859bff0 docs+code: close Theme 1 — 24 design-doc / XML-doc drift findings
Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights:
- Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains
  CancelAllStreams() (refuse new streams, cancel active), wired into
  Program.cs site branch via ApplicationStopping.
- InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes
  symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId.
- ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes.
- Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from
  cluster leader check (was hardcoded true).
- DM-020: reconciliation audit row attributes the current user, not prior deployer.
- SEL-019: EventLogPurgeService early-exits on standby via active-node check.
- Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase,
  NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component-
  Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring
  / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment.

11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2,
ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1,
EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/
Host suites all green. README regenerated: 112 open (was 136).
2026-05-28 06:28:31 -04:00
Joseph Doherty e3ca9af1be fix(transport): Overwrite resolution now syncs child collections (2 findings)
Transport-001: template Overwrite now diff-and-merges the bundle's
Attributes / Alarms / Scripts onto the target template via three private
helpers (SyncTemplateAttributesAsync / SyncTemplateAlarmsAsync /
SyncTemplateScriptsAsync). Each helper emits one audit row per detected
add / update / delete and feeds the post-merge state into the existing
ResolveAlarmScriptLinks and ResolveCompositionEdges passes.

Transport-002: external-system Overwrite now syncs the Methods collection
via a parallel SyncExternalSystemMethodsAsync helper mirroring the T-001
shape, with ExternalSystemMethodAdded / Updated / Deleted audit rows.

Both fixes are covered by new integration tests in BundleImporterApplyTests.
README regenerated — open findings dropped from 146 to 136; all 10 open
High findings are now closed (0 Critical, 0 High, 46 Medium, 90 Low
remaining).
2026-05-28 05:54:03 -04:00
Joseph Doherty f936f55f51 fix(concurrency): close 8 race / thread-safety findings across CD, DCL, SR
CD-015: rewrite NotificationOutboxRepository.InsertIfNotExistsAsync as raw-SQL
IF NOT EXISTS … INSERT with SqlException 2601/2627 catch, ending the
at-least-once livelock on the site→central notification handoff.

DCL-018/019/020/021/022: add _subscribesInFlight guard so concurrent
same-tag subscribes don't orphan an adapter handle; delete the latent
dead _subscriptionHandles dictionary; stop double-counting
_totalSubscribed when an unresolved tag is promoted via another instance;
release adapter handles on mid-flight unsubscribe; gate the
tag-resolution retry timer with IsTimerActive so subscribe bursts don't
reset it into starvation.

SR-020: add _terminatingActorsByName shadow so a third deploy arriving
during a pending redeploy doesn't crash on InvalidActorNameException —
displaced senders get a Failed/superseded response and the latest
command wins on Terminated.

SR-024: split OperationTrackingStore reads from writes (fresh
SqliteConnection per GetStatusAsync) so long writes don't block status
queries; rewrite Dispose to drop the sync-over-async bridge that could
deadlock on a non-reentrant SyncContext; Interlocked.Exchange makes the
dispose-once flag race-safe across both paths.
2026-05-28 05:20:13 -04:00
Joseph Doherty 5d2386cc9d fix(transport): close bundle security + plaintext-retention gaps (4 findings)
T-003: move the unlock lockout server-side. The 3-strike counter used to
live in the Razor page only — a second tab / CLI caller could re-upload
the same bytes and grind PBKDF2 indefinitely. The counter now lives in
IBundleSessionStore, keyed by ContentHash, so retries against identical
bundle bytes are throttled regardless of client. BundleLockedException
surfaces the new typed error path.

T-005: bind the manifest's non-derivative fields into AES-GCM AAD. A
SHA-256 of the manifest (with ContentHash + Encryption normalised to
sentinels) is now passed to AesGcm.Encrypt / .Decrypt, so a tampered
SourceEnvironment / ExportedBy / CreatedAtUtc on a stolen bundle yields
an authentication-tag mismatch instead of slipping past the Step-4
typo-resistant confirmation gate.

T-006: cap zip entry count, decompressed length, and compression ratio
in LoadAsync's envelope validator BEFORE any payload is decompressed,
using ZipArchiveEntry.Length / .CompressedLength. New TransportOptions
fields default to 4 entries / 200 MB / 50x ratio.

T-007: clear decrypted plaintext on the ApplyAsync failure path and zero
the buffer on success before removing the session, so a 100 MB
DecryptedContent doesn't sit in memory for the 30-min TTL after a failed
apply. A BundleSessionEvictionService BackgroundService now also drives
EvictExpired periodically so abandoned sessions clear without needing a
fresh Get() call to trigger lazy eviction.

Also resolves NO-010 — the misleading "writer never throws" XML doc was
the same code+comment my prior NO-004 await-the-writer fix already
rewrote.
2026-05-28 04:14:07 -04:00
Joseph Doherty 291274ae76 fix(notifications): close OAuth2 SMTP + dispatcher resilience gaps (5 findings)
NS-021/NO-001: thread FromAddress into XOAUTH2 so M365 stops rejecting
sends with 535 5.7.3. Added an additive oauth2UserName parameter on
ISmtpClientWrapper.AuthenticateAsync; both NotificationService and
NotificationOutbox now pass config.FromAddress.

NO-002: clamp non-positive SmtpConfiguration.MaxRetries/RetryDelay to the
1-min / 10-attempt fallback with a Warning so a misconfigured row no
longer parks transient failures on the first attempt or burn-loops.

NO-003: route a lifecycle-scoped CancellationToken from the
NotificationOutboxActor through the dispatch sweep into the adapter so
in-flight SMTP sends abort on PostStop instead of blocking
CoordinatedShutdown for the full SMTP timeout per row.

NO-004: await the central audit writer inside the existing try/catch
instead of fire-and-forget so the audit task can't outlive the per-sweep
DI scope and writer faults reach the operator log instead of being
silently dropped.

Two AuditLog integration tests seeded RetryDelay = TimeSpan.Zero to force
immediate re-claim on the second tick; updated them to 1 ms so they keep
the same intent without tripping the NO-002 clamp.
2026-05-28 03:54:43 -04:00
Joseph Doherty e536178323 fix(security): close auth & site-scoping gaps across 8 findings
Resolves the auth-theme batch from the 2026-05-28 baseline review (8 findings
across Security/CentralUI/ManagementService/CLI). The most consequential gaps:
NotificationReport + SiteCallsReport now route through SiteScopeService so a
site-scoped Deployment user cannot see or act on other sites' rows (CUI-028);
QueryAuditLogCommand is no longer "any authenticated user" — gated Admin-only
to match /api/audit/query's strictness (MS-018); RoleMapper preserves the
broader grant when a user is in both an unscoped and scoped Deployment LDAP
group, instead of silently narrowing to the scoped set (Sec-016); and the
dead SiteScopeRequirement/Handler are deleted so SiteScopeService is
unambiguously the sole site-scoping mechanism (Sec-017). Pending findings:
172 → 164.
2026-05-28 03:35:29 -04:00
Joseph Doherty f93b7b99bb code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
Joseph Doherty 722773f2b5 docs(code-reviews): regenerate index — all 66 re-review findings resolved 2026-05-17 05:43:08 -04:00
Joseph Doherty f23513c30b docs(code-reviews): regenerate index after resolving 64 of 66 re-review findings 2026-05-17 03:18:47 -04:00
Joseph Doherty 0ba4e49e11 docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine
11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
2026-05-17 00:51:58 -04:00
Joseph Doherty 3b3760f026 docs(code-reviews): re-review batch 3 at 39d737e — Host, InboundAPI, ManagementService, NotificationService, Security
21 new findings: Host-012..015, InboundAPI-014..017, ManagementService-014..017, NotificationService-014..018, Security-012..015.
2026-05-17 00:48:25 -04:00
Joseph Doherty 89636e2bbf docs(code-reviews): re-review batch 2 at 39d737e — ConfigurationDatabase, DataConnectionLayer, DeploymentManager, ExternalSystemGateway, HealthMonitoring
17 new findings: ConfigurationDatabase-012..014, DataConnectionLayer-014..017, DeploymentManager-015..017, ExternalSystemGateway-015..017, HealthMonitoring-013..016.
2026-05-17 00:45:10 -04:00
Joseph Doherty e49846603e docs(code-reviews): re-review batch 1 at 39d737e — CentralUI, CLI, ClusterInfrastructure, Commons, Communication
17 new findings: CentralUI-020..025, CLI-014..016, ClusterInfrastructure-009..010, Commons-013..014, Communication-012..015.
2026-05-17 00:41:21 -04:00
Joseph Doherty 39d737ebd6 docs(code-reviews): regenerate index — all low/medium findings resolved 2026-05-17 00:04:56 -04:00
Joseph Doherty 13a33a6c78 docs(code-reviews): regenerate index after batch 4 low/medium fixes 2026-05-16 22:32:31 -04:00
Joseph Doherty a3d359fff7 docs(code-reviews): regenerate index after batch 3 low/medium fixes 2026-05-16 22:24:03 -04:00
Joseph Doherty 3f19371017 docs(code-reviews): regenerate index after batch 2 low/medium fixes 2026-05-16 22:14:46 -04:00
Joseph Doherty 25a05af05d docs(code-reviews): regenerate index after batch 1 low/medium fixes 2026-05-16 22:04:44 -04:00
Joseph Doherty bc88a36435 docs(code-reviews): regenerate index after batch 4 medium fixes 2026-05-16 21:44:11 -04:00
Joseph Doherty 49fb85e92e docs(code-reviews): regenerate index after batch 3 medium fixes 2026-05-16 21:22:01 -04:00
Joseph Doherty 016bdf9c3c docs(code-reviews): regenerate index after batch 2 medium fixes 2026-05-16 21:11:24 -04:00
Joseph Doherty 8fc04d43c2 docs(code-reviews): regenerate index after batch 1 medium fixes; fix CentralUI-014 severity field format 2026-05-16 20:58:29 -04:00
Joseph Doherty 658b659c0c docs(code-reviews): regenerate index — all High findings resolved or re-triaged 2026-05-16 20:12:24 -04:00
Joseph Doherty 2ba5d5d578 docs(code-reviews): regenerate index after batch 4 High fixes; normalize re-triaged SF-002 severity field 2026-05-16 19:57:54 -04:00
Joseph Doherty 1ae11d1135 docs(code-reviews): regenerate index after batch 3 High fixes; fix regen-readme.py to parse the Won't Fix status 2026-05-16 19:48:17 -04:00
Joseph Doherty d30ded7e72 docs(code-reviews): regenerate index after batch 2 High fixes 2026-05-16 19:40:40 -04:00
Joseph Doherty d7630d80fe docs(code-reviews): regenerate index after batch 1 High fixes 2026-05-16 19:33:11 -04:00
Joseph Doherty d8f99ba781 docs(code-reviews): add regen-readme.py to generate the review index
README.md is now generated from the per-module findings.md files by
code-reviews/regen-readme.py (discovers modules, parses each finding's
severity/status, rebuilds the Pending Findings and Module Status tables).
Run with --check to fail when README.md is stale (CI-friendly).

REVIEW-PROCESS.md section 5 now points to the script instead of describing
a manual edit, and README.md carries a generated-file banner.
2026-05-16 19:18:18 -04:00
Joseph Doherty 91438dcc1b fix(store-and-forward): create the SQLite database directory on init (StoreAndForward-014)
StoreAndForwardStorage.InitializeAsync opened a SqliteConnection against the
configured SqliteDbPath (default ./data/store-and-forward.db) without ensuring
the parent directory exists. SQLite creates the database file but not its
directory, so when data/ was absent the connection failed with
"SQLite Error 14: unable to open database file" — aborting the site host's
RegisterSiteActors at StoreAndForwardService.StartAsync.

This was the root cause of the six failing SiteActorPathTests. Production
masked it because the Docker image / deployment creates data/.

InitializeAsync now calls EnsureDatabaseDirectoryExists, which parses the
connection string and creates the parent directory of a file-backed database
(in-memory databases and bare filenames are skipped).

Regression test InitializeAsync_FileInMissingDirectory_CreatesDirectory fails
against the pre-fix code. Host suite now 155/155 green (was 149/155).
2026-05-16 19:13:00 -04:00
Joseph Doherty 61253e3269 fix(store-and-forward): resolve S&F delivery + replication wiring (3 Critical findings)
Resolves StoreAndForward-001, ExternalSystemGateway-001, NotificationService-001
— one systemic gap where buffered messages were persisted but never delivered,
and the active node never replicated its buffer to the standby.

Delivery handlers (ExternalSystemGateway-001 / NotificationService-001):
- AkkaHostedService registers delivery handlers for the ExternalSystem,
  CachedDbWrite and Notification categories after StoreAndForwardService starts;
  each resolves its scoped consumer in a fresh DI scope.
- ExternalSystemClient, DatabaseGateway and NotificationDeliveryService each
  gain a DeliverBufferedAsync method: re-resolve the target and re-attempt
  delivery, returning true/false/throwing per the transient-vs-permanent contract.
- EnqueueAsync gains an attemptImmediateDelivery flag; CachedCallAsync and
  NotificationDeliveryService.SendAsync pass false (they already attempted
  delivery themselves) so registering a handler does not dispatch twice.

Replication (StoreAndForward-001):
- ReplicationService is injected into StoreAndForwardService; a new BufferAsync
  helper replicates every enqueue, and successful-retry removes and parks are
  replicated too. Fire-and-forget, no-op when replication is disabled.

Tests: StoreAndForwardReplicationTests (Add/Remove/Park observed),
attemptImmediateDelivery behaviour, and DeliverBufferedAsync paths for each
consumer. Full solution builds; StoreAndForward/ExternalSystemGateway/
NotificationService suites green.
2026-05-16 18:58:11 -04:00
Joseph Doherty a9bd7ee37c fix(central-ui): resolve CentralUI-001 — enforce script trust model before sandbox execution
ScriptAnalysisService.RunInSandboxAsync compiled and executed arbitrary
user C# in the central host process with no trust-model enforcement — the
forbidden-API set was only a Monaco editor diagnostic. A Design-role user
could run System.IO/Process/Reflection/network code on the central node.

Added a Roslyn semantic gate (EnforceTrustModel) invoked after compilation
and before script.RunAsync, and on nested shared scripts in callSharedFunc;
a script referencing any forbidden API is rejected before it runs.

Reworked FindForbiddenApiUsages: it now resolves every identifier against
the semantic model and checks types and members, so a fully-qualified call
(System.IO.File.WriteAllText) is caught — the pre-fix check only inspected
the leftmost identifier and missed that shape. This is a static semantic
gate, not a process sandbox.

Adds gate regression tests that fail against the pre-fix code, plus a
clean-script test guarding against over-blocking.
2026-05-16 18:41:12 -04:00
Joseph Doherty a9ceba00d0 fix(communication): resolve Communication-001 — early stream termination handling
DebugStreamService.StartStreamAsync awaited the initial debug snapshot inside
a try whose only handler was catch (OperationCanceledException). When the
stream terminated before the snapshot arrived, onTerminatedWrapper completed
the await with an InvalidOperationException that escaped the catch — the
caller got a raw, untranslated exception and the service did no teardown of
its own on that path.

Replaced with catch (Exception): it removes the session entry, sends
StopDebugStream to the bridge actor via the local reference (deterministic
teardown, idempotent), and throws a descriptive exception — TimeoutException
for the 30s timeout, otherwise an InvalidOperationException naming the
instance/site and wrapping the cause.

Re-triaged Critical -> Medium: the originally-claimed multi-minute site-side
resource leak does not occur (the bridge actor self-terminates on every
onTerminated path). Adds the first DebugStreamService test, which fails
against the pre-fix code.
2026-05-16 18:32:52 -04:00
Joseph Doherty 239bee3bc4 fix(data-connection): resolve DataConnectionLayer-001 — off-thread actor state mutation
HandleSubscribe spawned a Task.Run that mutated DataConnectionActor private
state (_subscriptionIds, _subscriptionsByInstance, _totalSubscribed,
_resolvedTags, _unresolvedTags) from a thread-pool thread, racing the actor's
own message loop — a data race on non-thread-safe Dictionary/HashSet and
non-atomic counters.

Restructured HandleSubscribe to follow the actor's existing PipeTo(Self)
pattern: the background task now performs only adapter I/O and pipes a
SubscribeCompleted message to Self; all subscription-state mutation happens
in the new HandleSubscribeCompleted handler on the actor thread (wired into
the Connected, Connecting and Reconnecting states).

Adds DCL001_ConcurrentSubscribes_DoNotCorruptSubscriptionCounters (30x30
concurrent subscribes) which fails against the pre-fix code and passes after.
2026-05-16 18:26:43 -04:00
Joseph Doherty 977d7369a7 docs: add code review process and baseline review of all 19 modules
Establishes a per-module code review workflow under code-reviews/ and
records the 2026-05-16 baseline review (commit 9c60592): 241 findings
across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low).
This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00