Commit Graph

14 Commits

Author SHA1 Message Date
Joseph Doherty 6ae0fea558 fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
2026-05-28 07:13:28 -04:00
Joseph Doherty 487859bff0 docs+code: close Theme 1 — 24 design-doc / XML-doc drift findings
Doc/XML-comment drift + small adherence fixes across 17 modules. Highlights:
- Host-017: site CoordinatedShutdown ordering — SiteStreamGrpcServer gains
  CancelAllStreams() (refuse new streams, cancel active), wired into
  Program.cs site branch via ApplicationStopping.
- InboundAPI-021: ParentExecutionId now travels on RouteToGet/SetAttributes
  symmetric with RouteToCallRequest; RouteHelper stamps from _parentExecutionId.
- ClusterInfra-012: ClusterOptionsValidator now requires both seed nodes.
- Comm-018: SiteCommunicationActor.HeartbeatMessage.IsActive derived from
  cluster leader check (was hardcoded true).
- DM-020: reconciliation audit row attributes the current user, not prior deployer.
- SEL-019: EventLogPurgeService early-exits on standby via active-node check.
- Plus comment/XML-doc accuracy fixes across AuditLog, ConfigurationDatabase,
  NotificationOutbox, SiteRuntime, SiteCallAudit; doc refreshes for Component-
  Commons / -ManagementService / -CLI / -ExternalSystemGateway / -HealthMonitoring
  / -Transport / -ConfigurationDatabase; CD-023 index-name doc alignment.

11 new regression tests (RouteHelper x4, SiteStreamGrpcServer x2,
ClusterOptionsValidator x1, SiteCommunicationActor x1, DeploymentService x1,
EventLogPurgeService x3). Build clean (0 warnings); InboundAPI/Communication/
Host suites all green. README regenerated: 112 open (was 136).
2026-05-28 06:28:31 -04:00
Joseph Doherty a768135237 fix(communication): resolve Communication-012..015 — endpoint-aware gRPC client cache, address-change recreation, correlation-id validation, node-flip tests 2026-05-17 03:18:17 -04:00
Joseph Doherty 31a6995d24 fix(communication): resolve Communication-004..008 — Resume supervision, gRPC option wiring, address-load logging, sync dispose, flap detection 2026-05-16 20:58:03 -04:00
Joseph Doherty 301e7fb854 fix(communication): resolve Communication-002/003 — gRPC reconnect stream cleanup and subscription map safety 2026-05-16 19:33:09 -04:00
Joseph Doherty 751248feb6 feat(alarms): HiLo trigger type with per-band level, hysteresis, messages, overrides
Adds a new HiLo alarm trigger type with four configurable setpoints
(LoLo / Lo / Hi / HiHi). Each setpoint carries an optional priority,
deadband (for hysteresis), and operator message. The site runtime emits
AlarmStateChanged with an AlarmLevel field so consumers can differentiate
warning vs critical bands.

Plumbing:
  - new AlarmLevel enum + AlarmStateChanged.Level/Message init properties
  - AlarmTriggerEditor (Blazor) gets a HiLo render with severity tinting
  - AlarmTriggerConfigCodec extracted from the editor for testability
  - sitestream.proto carries level + message over gRPC
  - SemanticValidator enforces numeric attribute, setpoint ordering,
    non-negative deadband
  - on-trigger scripts get an Alarm global (Name/Level/Priority/Message)
    so notification routing can branch by severity
  - per-instance InstanceAlarmOverride entity + EF migration + flattening
    step + CLI commands; HiLo overrides merge setpoint-by-setpoint, binary
    types whole-replace
  - DebugView shows a Level badge + per-band message tooltip
  - App.razor auto-reloads on permanent Blazor circuit failure
  - docker/regen-proto.sh automates the proto regen workflow (the linux/arm64
    protoc segfault means generated files are checked in for now)
2026-05-13 03:23:32 -04:00
Joseph Doherty 416a03b782 feat: complete gRPC streaming channel — site host, docker config, docs, integration tests
Switch site host to WebApplicationBuilder with Kestrel HTTP/2 gRPC server,
add GrpcPort/keepalive config, wire SiteStreamManager as ISiteStreamSubscriber,
expose gRPC ports in docker-compose, add site seed script, update all 10
requirement docs + CLAUDE.md + README.md for the new dual-transport architecture.
2026-03-21 12:38:33 -04:00
Joseph Doherty 3fe3c4161b test: add proto contract, cleanup verification, and regression guardrail tests 2026-03-21 12:36:27 -04:00
Joseph Doherty 2cd43b6992 feat: update DebugStreamBridgeActor to use gRPC for streaming events
After receiving the initial snapshot via ClusterClient, the bridge actor
now opens a gRPC server-streaming subscription via SiteStreamGrpcClient
for ongoing AttributeValueChanged/AlarmStateChanged events. Adds NodeA/
NodeB failover with max 3 retries, retry count reset on successful event,
and IWithTimers-based reconnect scheduling.

- DebugStreamBridgeActor: gRPC stream after snapshot, reconnect state machine
- DebugStreamService: inject SiteStreamGrpcClientFactory, resolve gRPC addresses
- ServiceCollectionExtensions: register SiteStreamGrpcClientFactory singleton
- SiteStreamGrpcClient: make SubscribeAsync/Unsubscribe virtual for testability
- SiteStreamGrpcClientFactory: make GetOrCreate virtual for testability
- New test suite: DebugStreamBridgeActorTests (8 tests)
2026-03-21 12:14:24 -04:00
Joseph Doherty 25a6022f7b feat: add SiteStreamGrpcClient and SiteStreamGrpcClientFactory
Per-site gRPC client for central-side streaming subscriptions to site
servers. SiteStreamGrpcClient manages server-streaming calls with
keepalive, converts proto events to domain types, and supports
cancellation via Unsubscribe. SiteStreamGrpcClientFactory caches one
client per site identifier.

Includes InternalsVisibleTo for test access to conversion helpers and
comprehensive unit tests for event mapping, quality/alarm-state
conversion, unsubscribe behavior, and factory caching.
2026-03-21 12:06:38 -04:00
Joseph Doherty 55a05914d0 feat: add SiteStreamGrpcServer with Channel<T> bridge and stream limits
- Define ISiteStreamSubscriber interface for decoupling from SiteRuntime
- Implement SiteStreamGrpcServer (inherits SiteStreamServiceBase) with:
  - Readiness gate (SetReady)
  - Max concurrent stream enforcement
  - Duplicate correlationId replacement (cancels previous stream)
  - StreamRelayActor creation per subscription
  - Bounded Channel<SiteStreamEvent> bridge (1000 capacity, drop-oldest)
  - Clean teardown: unsubscribe, stop actor, remove tracking entry
- Identity-safe cleanup using ConcurrentDictionary.TryRemove(KeyValuePair)
  to prevent replacement streams from being removed by predecessor cleanup
- 7 unit tests covering reject-not-ready, max-streams, duplicate cancel,
  cleanup-on-cancel, subscribe/remove lifecycle, event forwarding
2026-03-21 11:52:31 -04:00
Joseph Doherty d70bbbe739 feat: add StreamRelayActor bridging Akka events to gRPC proto channel 2026-03-21 11:48:04 -04:00
Joseph Doherty deb58e1f17 feat: add sitestream.proto definition and generated gRPC stubs
Proto3 definition with SiteStreamService (server streaming), Quality and
AlarmStateEnum enums with UNSPECIFIED=0, google.protobuf.Timestamp for
cross-platform timestamps. Pre-generated C# stubs checked in (no protoc
at build time). 10 roundtrip tests covering serialization, oneof
discrimination, and Timestamp<->DateTimeOffset conversion.
2026-03-21 11:41:01 -04:00
Joseph Doherty 826cfbee31 feat: add sitestream.proto definition and generated gRPC stubs
Define the SiteStreamService proto for real-time instance event
streaming (attribute value changes, alarm state changes) from site
nodes to central. Add pre-generated C# stubs following the existing
LmxProxy pattern, gRPC NuGet packages with FrameworkReference for
ASP.NET Core server types, and proto roundtrip tests.
2026-03-21 11:37:39 -04:00