Commit Graph

30 Commits

Author SHA1 Message Date
Joseph Doherty 55f46e7c92 perf: close Theme 6 — 11 allocation / N+1 / lock-contention findings
Well-localised perf fixes across 8 modules.

Lock decoupling / SQL streaming:
- AuditLog-005: SqliteAuditWriter gains dedicated read-only _readConnection
  (+ _readLock) backed by WAL journal mode. GetBacklogStatsAsync,
  ReadPendingAsync, ReadPendingSinceAsync, ReadForwardedAsync no longer
  contend with the hot-path INSERT lock — backlog probes on a 30s timer
  can't stall the writer under multi-hundred-K Pending backlog.
- SEL-022: dropped Cache=Shared from SiteEventLogger's default connection
  string (single-connection logger; mode was dormant config).

Memory / streaming:
- CLI-019: bundle export streams base64 in 1 MB-aligned chunks via
  Convert.TryFromBase64Chars straight into the FileStream — no more
  full-bundle byte[] allocation.
- CentralUI-031: TransportImport now stages the upload to a per-session
  temp file under Path.GetTempPath() (replaces in-memory byte[] field);
  page implements IDisposable to delete the temp file on reset / new
  upload / dispose. Per-circuit working set drops from ~100 MB to ~80 KB.

N+1 hoisting:
- Transport-008: added ITemplateEngineRepository.GetTemplatesWithChildrenAsync
  bulk method; BundleImporter.PreviewAsync calls it once instead of per-
  template-name. Single query with .Include(...).AsSplitQuery().
- DM-023: BuildDeployArtifactsCommandAsync's per-site loop now references
  a pre-fetched GlobalArtifactSnapshot (shared scripts, external systems,
  DB connections, notification lists, SMTP) instead of re-querying per site.
- MgmtSvc-023: HandleQueryDeployments unfiltered branch uses one
  GetAllInstancesAsync bulk load + Dictionary<int,int?> lookup (was a
  GetInstanceByIdAsync per record).

Small allocations / per-tick rebuilds:
- InboundAPI-019: AuditWriteMiddleware gates EnableBuffering() on
  RequestHasBody() so GET/HEAD/DELETE/TRACE/OPTIONS and Content-Length:0
  requests skip the FileBufferingReadStream allocation.
- NotifOutbox-006: ResolveAdapters dictionary now cached on
  _adaptersCache (built lazily on first sweep) + actor-lifetime
  _adaptersScope; ResolveAdapters no longer rebuilds per dispatch tick.

Verify-only:
- Comm-017: Confirmed _inProgressDeployments was deleted by Comm-016 in
  commit ac96b83 — marked Resolved with that attribution. No code change.

Doc-correction:
- NS-022: Updated MailKitSmtpClientWrapper XML doc to spell out single-
  connection / per-delivery-factory contract (option (b) — transient
  client per Send — rejected because it re-handshakes TLS per email).

10+ new regression tests across 8 test projects. Build clean; affected
suites all green. README regenerated: 54 open (was 65).
2026-05-28 07:47:24 -04:00
Joseph Doherty 6ae0fea558 fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
2026-05-28 07:13:28 -04:00
Joseph Doherty 819f1b4665 fix(validation): close Theme 3 — 11 input-validation / unbounded-input findings
Each finding is a focused validation guard or upper bound at a trust boundary.
Highlights:
- Commons-015: EncryptionMetadata ctor now validates Algorithm (AES-256-GCM
  only), Kdf (PBKDF2-SHA256 only), Iterations ([100k, 10M]), non-null Salt/IV.
- Transport-004: new BundleUnlockRateLimiter (sliding-window, per-key,
  singleton) wired into BundleImporter.LoadAsync; over-budget callers see
  BundleUnlockRateLimitedException. Per-bundle 3-strike + per-window cap.
- ESG-022: ExternalSystemClient.InvokeHttpAsync allow-lists the documented
  GET/POST/PUT/PATCH/DELETE set (case-insensitive); unknown verbs throw.
- SEL-015: SiteEventLogger queue now bounded (10k cap, DropOldest); dropped
  events fault their Task and increment FailedWriteCount so the drop is
  observable instead of an unbounded memory growth.
- SEL-017: EventLogQueryService clamps caller-supplied PageSize to a new
  MaxQueryPageSize cap (default 500) so int.MaxValue can't OOM the host.
- SEL-020: LogEventAsync rejects severities outside {Info, Warning, Error}
  (matches SQLite BINARY-collation query filter).
- InboundAPI-020: ContentType "json" check now case-insensitive
  (application/JSON no longer slips through as not-json).
- InboundAPI-024: _knownBadMethods capped at 1000 entries (drops new entries
  once full); per-request DB lookup remains the correctness path.
- SR-025: HandleSetStaticAttribute validates the attribute name against the
  deployed config; unknown names now return Success=false instead of
  leaking orphan override rows into the SQLite store.
- TE-021: MoveTemplateAsync runs the sibling-name-collision check at the
  destination, mirroring TemplateFolderService.MoveFolderAsync.
- TE-022: LockEnforcer's once-locked-stays-locked rule now also covers
  LockedInDerived (was previously only IsLocked).

New regression tests across 8 test projects (EncryptionMetadata, rate
limiter, ESG client allow-list, SEL bounded channel / PageSize clamp /
severity validation, InboundAPI ContentType + bad-methods cap, SiteRT
unknown-attribute, TemplateEngine MoveTemplate + LockedInDerived).
Build clean; affected suites all green. README regenerated: 93 open (was 104).

Note: a separate manual re-run was needed for the SiteEventLogging hunk
because its initial subagent's source edits never landed on disk despite
reporting success (file-collision-style failure mode).
2026-05-28 06:58:25 -04:00
Joseph Doherty 487859bff0 docs+code: close Theme 1 — 24 design-doc / XML-doc drift findings
Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights:
- Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains
  CancelAllStreams() (refuse new streams, cancel active), wired into
  Program.cs site branch via ApplicationStopping.
- InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes
  symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId.
- ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes.
- Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from
  cluster leader check (was hardcoded true).
- DM-020: reconciliation audit row attributes the current user, not prior deployer.
- SEL-019: EventLogPurgeService early-exits on standby via active-node check.
- Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase,
  NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component-
  Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring
  / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment.

11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2,
ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1,
EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/
Host suites all green. README regenerated: 112 open (was 136).
2026-05-28 06:28:31 -04:00
Joseph Doherty 1eb6e972b0 docs: add XML doc comments across src + Sister Projects section in CLAUDE.md
Bulk CommentChecker pass: fills in <param>/<inheritdoc> tags on public
APIs across all 23 src/ projects so the doc-coverage gate is green. Also
adds a Sister Projects section to CLAUDE.md pointing at the MxAccess
Gateway and OtOpcUa sibling repos, and gitignores local credential
captures (*login*.txt) and the wonder-app-vd03 deploy/ artifacts.
2026-05-28 01:55:24 -04:00
Joseph Doherty e6ccee1a16 refactor(inboundapi): pool the request audit buffer + reset Position in finally 2026-05-23 09:46:53 -04:00
Joseph Doherty 7d87994ac0 feat(inboundapi): bound audit capture at InboundMaxBytes (memory safety)
AuditWriteMiddleware previously buffered the FULL request and response
bodies into memory and only let DefaultAuditPayloadFilter trim them
after persistence. A 500 MiB upload allocated 500 MiB of MemoryStream
plus 1 GiB of UTF-16 string transiently before the filter pulled it
back to the 1 MiB inbound ceiling — the cap was real on the persisted
row but not at the capture site.

Inject IOptionsMonitor<AuditLogOptions> and read InboundMaxBytes
per-request (same convention as DefaultAuditPayloadFilter so a live
config change picks up the next request). The request reader now pulls
at most cap + 1 bytes into a UTF-8 byte-safe-truncated string and
rewinds the stream so the endpoint handler still sees the full body.
The response wrap is a new CapturedResponseStream that forwards every
Write / WriteAsync to the real sink (the client still receives all
bytes) while capturing at most cap + 1 bytes for the audit copy. The
middleware now sets PayloadTruncated itself when either body hit the
cap; the filter still OR's its own determination on top.

Adds a project reference from ScadaLink.InboundAPI to
ScadaLink.AuditLog so AuditLogOptions resolves. AuditLog does NOT
reference InboundAPI back, so no cycle is introduced.

Tests:
 - All 21 existing AuditWriteMiddlewareTests still pass (the helper
   gains an optional AuditLogOptions argument; default is the standard
   1 MiB ceiling so existing small-body tests are unaffected).
 - MiddlewareOrderTests' construction site updated for the new ctor
   arg; a StaticAuditLogOptionsMonitor file-local double mirrors the
   InboundChannelCapTests pattern.
 - New RequestBody_AboveInboundMaxBytes_TruncatedToCap_PayloadTruncatedTrue
   pins a 4 KiB cap against a 20 KB body: audit copy <= 4 KiB,
   PayloadTruncated = true, downstream handler reads the full 20 KB.
 - New ResponseBody_AboveInboundMaxBytes_TruncatedToCap_ClientStillReceivesAllBytes_PayloadTruncatedTrue
   pins the same shape on the response side: client sink receives
   20 KB, audit copy <= 4 KiB, PayloadTruncated = true.

InboundAPI test count: 133 -> 135.
2026-05-23 09:25:00 -04:00
Joseph Doherty a8d2e13d4e feat(inboundapi): AuditWriteMiddleware captures response body on ApiInbound audit rows 2026-05-23 06:00:24 -04:00
Joseph Doherty dc2c73b07d refactor(inboundapi): fail-fast on missing inbound ExecutionId stash 2026-05-21 17:26:49 -04:00
Joseph Doherty d8453bfba2 feat(inboundapi): mint inbound ExecutionId early, carry it as RouteToCallRequest.ParentExecutionId 2026-05-21 17:22:13 -04:00
Joseph Doherty cfd8f1ecf4 feat(auditlog): inbound audit rows carry ExecutionId 2026-05-21 15:44:17 -04:00
Joseph Doherty aadb1fd72a refactor(auditlog): rename audit correlation field, add cross-helper tests 2026-05-21 13:57:17 -04:00
Joseph Doherty 8243f61e96 feat(auditlog): per-script-execution correlation id on sync audit rows 2026-05-21 13:46:34 -04:00
Joseph Doherty 1c862989b4 feat(inbound): register AuditWriteMiddleware in pipeline (#23 M4) 2026-05-20 16:35:13 -04:00
Joseph Doherty 3c3f7770c1 feat(inbound): AuditWriteMiddleware emitting InboundRequest/InboundAuthFailure (#23 M4) 2026-05-20 16:35:03 -04:00
Joseph Doherty 7da303d7bb fix(configuration-database): resolve ConfigurationDatabase-012 — store inbound-API keys as HMAC-SHA256 hashes
Inbound-API bearer credentials are no longer persisted in plaintext. ApiKey now
holds a KeyHash (peppered HMAC-SHA256); the key is shown once at creation and
only its hash is stored. Lookup and validation hash the presented candidate.
Cross-module: Commons (ApiKey, ApiKeyHasher), ConfigurationDatabase (mapping +
HashApiKeyValue migration), InboundAPI (ApiKeyValidator), ManagementService
(key creation), CentralUI (ApiKeys.razor). Existing keys must be re-issued.
2026-05-17 05:42:52 -04:00
Joseph Doherty 73a393076a fix(inbound-api): resolve InboundAPI-014..017 — return-value validation, reflection-gateway hardening, deadline-bound routed calls, RouteHelper test coverage 2026-05-17 03:18:33 -04:00
Joseph Doherty 8dd74121c3 fix(inbound-api): resolve InboundAPI-012 — move ParameterDefinition POCO to ScadaLink.Commons (Types/InboundApi) 2026-05-17 00:04:56 -04:00
Joseph Doherty 858fe24add fix(inbound-api): resolve InboundAPI-009,010,011,013 — cache failed compiles, reject unknown body fields, close enumeration oracle, drop misnamed factory; InboundAPI-007,012 flagged 2026-05-16 22:24:03 -04:00
Joseph Doherty da955042aa fix(inbound-api): resolve InboundAPI-002,004,006,008 — disconnect vs timeout, body size limit, active-node gate; surface InboundAPI-007 2026-05-16 21:22:01 -04:00
Joseph Doherty 6f4efdfa2e fix(inbound-api): resolve InboundAPI-001/003/005 — concurrent handler cache, constant-time API key compare, script trust-model enforcement 2026-05-16 19:47:17 -04:00
Joseph Doherty 9c60592632 build: adopt NuGet Central Package Management
Move all package versions into Directory.Packages.props so every project
resolves a single consistent version. Consolidates the Roslyn packages
(Microsoft.CodeAnalysis.CSharp.Scripting/Workspaces) onto 5.0.0, which
resolves the pre-existing NU1608 version-skew error in the test projects.
2026-05-16 15:56:30 -04:00
Joseph Doherty 295150751f feat(scripts): realign Test Run with runtime API, add anonymous-object calls and instance binding
The Test Run sandbox and Monaco analysis modelled a script API that had
drifted from the site runtime's ScriptGlobals, so real scripts failed to
compile in Test Run. Realign both to the runtime surface
(Instance/Scripts/ExternalSystem/Attributes/Children/Parent) and drop the
duplicate ScriptHost stub so the two cannot diverge again.

- Script calls (Scripts.CallShared, Instance.CallScript, Route.To().Call)
  accept an anonymous object instead of a hand-built dictionary, via a
  shared ScriptArgs normalizer; existing dictionary calls still compile.
- Test Run can optionally bind to a deployed instance, so Instance/
  Attributes/CallScript route to it cross-site; adds site-side
  RouteToGetAttributes/RouteToSetAttributes handlers.
- Adds Test Run panels to the API method and template script editors.
- Fixes the TestDatabaseQuery seed script, which queried a table that
  never existed.

Also commits unrelated in-progress work already in the tree: the health
monitoring report loop, site streaming changes, and the Admin/Design
data-connection and SMTP page reorganization.
2026-05-16 03:37:56 -04:00
Joseph Doherty 161dc406ed feat(scripts): add typed Parameters.Get<T>() helpers for script API
Replace raw dictionary casting with ScriptParameters wrapper that provides
Get<T>, Get<T?>, Get<T[]>, and Get<List<T>> with clear error messages,
numeric conversion, and JsonElement support for Inbound API parameters.
2026-03-22 15:47:18 -04:00
Joseph Doherty da683d4fe9 fix: lazy-compile API method scripts and prefix composed alarm trigger attributes
- InboundScriptExecutor lazy-compiles scripts on first request, solving
  the multi-node problem where methods created via CLI/UI were only compiled
  on the ManagementActor's node, not the node handling the HTTP request.
- ManagementActor hot-registers API method scripts on create/update/delete
  for the local node.
- FlatteningService prefixes the "attribute" field in composed alarm trigger
  configs with the composition instance name so alarms evaluate against the
  correct path-qualified attribute (e.g. CoolingTank.Level not Level).
2026-03-18 09:30:12 -04:00
Joseph Doherty 78fbb13df7 feat: wire Inbound API Route.To().Call() to site instance scripts and add Roslyn compilation
Completes the Inbound API → site script call chain by adding RouteToCallRequest
handlers in SiteCommunicationActor and DeploymentManagerActor. Also replaces the
placeholder dispatch table in InboundScriptExecutor with Roslyn compilation of
API method scripts at startup, enabling user-defined inbound API methods to call
instance scripts across the cluster.
2026-03-18 08:43:13 -04:00
Joseph Doherty b659978764 Phase 8: Production readiness — failover tests, security hardening, sandboxing, deployment docs
- WP-1-3: Central/site failover + dual-node recovery tests (17 tests)
- WP-4: Performance testing framework for target scale (7 tests)
- WP-5: Security hardening (LDAPS, JWT key length, no secrets in logs) (11 tests)
- WP-6: Script sandboxing adversarial tests (28 tests, all forbidden APIs)
- WP-7: Recovery drill test scaffolds (5 tests)
- WP-8: Observability validation (structured logs, correlation IDs, metrics) (6 tests)
- WP-9: Message contract compatibility (forward/backward compat) (18 tests)
- WP-10: Deployment packaging (installation guide, production checklist, topology)
- WP-11: Operational runbooks (failover, troubleshooting, maintenance)
92 new tests, all passing. Zero warnings.
2026-03-16 22:12:31 -04:00
Joseph Doherty 8c2091dc0a Phase 0 WP-0.10–0.12: Host skeleton, options classes, sample configs, and execution framework
- WP-0.10: Role-based Host startup (Central=WebApplication, Site=generic Host),
  15 component AddXxx() extension methods, MapCentralUI/MapInboundAPI stubs
- WP-0.11: 12 per-component options classes with config binding
- WP-0.12: Sample appsettings for central and site topologies
- Add execution procedure and checklist template to generate_plans.md
- Add phase-0-checklist.md for execution tracking
- Resolve all 21 open questions from plan generation
- Update IDataConnection with batch ops and IAsyncDisposable
57 tests pass, zero warnings.
2026-03-16 18:59:07 -04:00
Joseph Doherty fed5f5a82c Add .gitignore and remove tracked build artifacts (bin/obj) 2026-03-16 18:38:00 -04:00
Joseph Doherty 34190e1347 Phase 0 WP-0.1: Create .NET 10 solution structure with all 17 component projects
17 source projects (Commons + Host + 15 components) and 17 xUnit test projects.
SLNX format, net10.0, nullable enabled, warnings as errors. All components
reference Commons; Host references all components. Builds and tests clean.
2026-03-16 18:37:36 -04:00