fix(review): full code-review remediation — 5 High + Medium/Low across 16 modules

Remediation from the full per-module code review at 4307c381 (findings recorded
separately in code-reviews/).

Highs fixed:
- DeploymentManager-025/SiteRuntime-031: stop broadcasting notification lists + SMTP
  configs (incl. credentials) to sites; site purges already-persisted rows on apply
  (enforces the central-only delivery design; clears plaintext SMTP creds at rest).
- DataConnectionLayer-023: guard the native-alarm subscribe path against the
  mid-flight-unsubscribe adapter-feed leak (mirrors the DCL-021 tag-path fix).
- SiteEventLogging-024: normalize From/To query bounds to UTC (the -016 fix the
  audit trail claimed but never committed).
- KpiHistory-001: add an in-flight guard to the recorder sample tick.
- ScriptAnalysis-001: harden the trust analyzer's TPA-absent fallback (resolve
  forbidden anchors in the minimal reference set; warn on degraded mode) — anchors
  added to validation references only, never the compile gate.
(InboundAPI-026 left to the feat/ipsen-movein effort per owner decision.)

Medium/Low: DM-026 deterministic deploy-status tiebreaker; SR-027/028/029/030
native-alarm leak/phantom-active/delete-during-redeploy fixes; AL-013/014/016;
TE-024 (folder-mutation audit rows now persisted)/025; SF-025 gauge-provider
clear-on-stop; ESG-025/026; SEC-023/024/025; SCA-007/008/009; plus doc/test
accuracy COM-023/024, HOST-025/026, HM-024/025, NS-027/028.

Full-solution build 0 warnings; ~3560 tests across 18 touched suites green.
This commit is contained in:
Joseph Doherty
2026-06-20 17:55:12 -04:00
parent 4307c38117
commit fd618cf1dc
52 changed files with 2239 additions and 313 deletions
+23 -11
View File
@@ -82,14 +82,19 @@ Both central and site clusters. Each side has communication actors that handle m
The streaming protocol is defined in `sitestream.proto` (`src/ZB.MOM.WW.ScadaBridge.Communication/Protos/sitestream.proto`):
- **Service**: `SiteStreamService` with a single RPC `SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent)`.
- **Messages**: `InstanceStreamRequest` (correlation_id, instance_unique_name), `SiteStreamEvent` (correlation_id, oneof event: `AttributeValueUpdate`, `AlarmStateUpdate`).
- **Service**: `SiteStreamService` — hosted on each site node by `SiteStreamGrpcServer` — exposes five RPCs. One is the original real-time **server-streaming** subscription; the other four are **unary request/response** calls added by the Audit Log (#23) and Site Call Audit (#22) components. A unary call is request/response and is distinct from the command/control ClusterClient channel — gRPC on this service is no longer real-time-stream-only:
- `SubscribeInstance(InstanceStreamRequest) returns (stream SiteStreamEvent)` — the real-time debug stream (§6); the only server-streaming RPC.
- `IngestAuditEvents(AuditEventBatch) returns (IngestAck)` — central-side **ingest** receiving surface for Audit Log (#23) telemetry; routes the batch to the central `AuditLogIngestActor` proxy and returns the accepted `EventId`s. (The production *push* path is still ClusterClient via `ClusterClientSiteAuditClient`; this RPC is the gRPC-receiving counterpart.)
- `IngestCachedTelemetry(CachedTelemetryBatch) returns (IngestAck)` — ingest receiving surface for the combined cached-call telemetry packet (audit row + `SiteCalls` operational upsert written in one transaction).
- `PullAuditEvents(PullAuditEventsRequest) returns (PullAuditEventsResponse)` — central→site **reconciliation pull** for the Audit Log self-heal feed; the site serves `Pending`/`Forwarded` rows from its `ISiteAuditQueue`.
- `PullSiteCalls(PullSiteCallsRequest) returns (PullSiteCallsResponse)` — central→site reconciliation pull for the Site Call Audit (#22) self-heal feed; the site serves operation-tracking rows changed since a cursor from its `IOperationTrackingStore`. A separate RPC from `PullAuditEvents` because the tracking store is the operational source of truth, distinct from the site audit queue.
- **Messages**: `InstanceStreamRequest` (correlation_id, instance_unique_name), `SiteStreamEvent` (correlation_id, oneof event: `AttributeValueUpdate`, `AlarmStateUpdate`); `AuditEventDto`/`AuditEventBatch`/`IngestAck` for ingest; `CachedTelemetryPacket`/`CachedTelemetryBatch` (each packet pairing an `AuditEventDto` with a `SiteCallOperationalDto`); `PullAuditEventsRequest`/`PullAuditEventsResponse` and `PullSiteCallsRequest`/`PullSiteCallsResponse` (each request carries `since_utc` + `batch_size`; each response carries `more_available` to signal a saturated batch).
- The `oneof event` pattern is extensible — future event types (health metrics, connection state changes) are added as new fields without breaking existing consumers.
- Proto field numbers are never reused. Old clients ignore unknown `oneof` variants.
- Proto field numbers are never reused; new RPCs and message fields are appended additively. Old clients ignore unknown `oneof` variants.
#### Enriched AlarmStateUpdate (Native Alarm Mirror)
`AlarmStateUpdate` carries the read-only native alarm mirror (Computed, native OPC UA, and native MxAccess Gateway alarms) to central over the **existing gRPC real-time stream** — no new transport, no command/control round-trip. The message was extended **additively**: existing fields 17 are unchanged, and fields 821 carry the enriched native-alarm state. Old clients that only read fields 17 continue to work; new fields are populated only where the source provides them.
`AlarmStateUpdate` carries the read-only native alarm mirror (Computed, native OPC UA, and native MxAccess Gateway alarms) to central over the **existing gRPC real-time stream** — no new transport, no command/control round-trip. The message was extended **additively**: existing fields 17 are unchanged, and fields 823 carry the enriched native-alarm state. Old clients that only read fields 17 continue to work; new fields are populated only where the source provides them.
| Field | # | Type | Meaning |
|-------|---|------|---------|
@@ -107,11 +112,14 @@ The streaming protocol is defined in `sitestream.proto` (`src/ZB.MOM.WW.ScadaBri
| `original_raise_time` | 19 | Timestamp | First-raise time of the underlying condition (nullable on the wire). |
| `current_value` | 20 | string | Current process value associated with the alarm. |
| `limit_value` | 21 | string | Limit / setpoint value that the alarm evaluates against. |
| `native_source_canonical_name` | 22 | string | Native binding canonical name; empty for computed alarms. |
| `is_configured_placeholder` | 23 | bool | Marks a quiet-binding placeholder row. **Snapshot-only** — see the relay note below; on the live gRPC stream this is always `false`. |
- **Server-side mapping (`StreamRelayActor.HandleAlarmStateChanged`)**: maps the enriched domain `AlarmStateChanged` event — `Kind` + `AlarmConditionState` + native metadata — out to the proto `AlarmStateUpdate`. The nullable `original_raise_time` is emitted only when present, and `shelve_state` is mapped from the domain shelve enum to its wire string via a new **`AlarmShelveStateCodec`** (string↔enum, defaulting to `Unshelved`). The domain `Confirmed` (`bool?`) is collapsed to a definite bool for field 11.
- **Client-side mapping (`SiteStreamGrpcClient.ConvertToDomainEvent`)**: reconstructs the domain `AlarmStateChanged` from the proto — `Kind` is parsed via `ParseAlarmKind`, the `Condition` is rebuilt with `severity` taken from the existing wire `priority`, and native metadata is repopulated from fields 821 — so central-side consumers receive the same domain event the site emitted.
- **Placeholder rows are dropped at the relay**: `is_configured_placeholder` (field 23) is a **Debug View snapshot-only** concept emitted by `InstanceActor.BuildAlarmStatesSnapshot` for quiet bindings — it is never a real alarm transition (its timestamp may be `DateTimeOffset.MinValue`, the Protobuf `Timestamp` lower boundary). `StreamRelayActor.HandleAlarmStateChanged` therefore returns early — **never relaying a placeholder row to the live gRPC stream** — so field 23 is always `false` on the live stream and only ever carries `true` in the snapshot path.
- **Client-side mapping (`SiteStreamGrpcClient.ConvertToDomainEvent`)**: reconstructs the domain `AlarmStateChanged` from the proto — `Kind` is parsed via `ParseAlarmKind`, the `Condition` is rebuilt with `severity` taken from the existing wire `priority`, and native metadata is repopulated from fields 823 (`native_source_canonical_name``NativeSourceCanonicalName`, `is_configured_placeholder``IsConfiguredPlaceholder`) — so central-side consumers receive the same domain event the site emitted.
> **Regeneration is manual (macOS-only).** `sitestream.proto` is **not** auto-compiled: the `<Protobuf>` include is commented out in the `.csproj`, and the generated C# is **vendored** under `SiteStreamGrpc/`. To regenerate after editing the proto: toggle the `<Protobuf>` include on, build so `Grpc.Tools` regenerates the C#, copy the generated files into `SiteStreamGrpc/`, then re-comment the include. Adding fields 821 followed this process.
> **Regeneration is manual (macOS-only).** `sitestream.proto` is **not** auto-compiled: the `<Protobuf>` include is commented out in the `.csproj`, and the generated C# is **vendored** under `SiteStreamGrpc/`. To regenerate after editing the proto: toggle the `<Protobuf>` include on, build so `Grpc.Tools` regenerates the C#, copy the generated files into `SiteStreamGrpc/`, then re-comment the include. Adding `AlarmStateUpdate` fields 823 and the four unary RPCs (`IngestAuditEvents`, `IngestCachedTelemetry`, `PullAuditEvents`, `PullSiteCalls`) plus their message types followed this process.
#### gRPC Connection Keepalive
@@ -161,9 +169,9 @@ Keepalive settings are configurable via `CommunicationOptions`:
- **Pattern**: Fire-and-forget telemetry with a periodic reconciliation pull.
- The site **Store-and-Forward Engine** emits a `CachedCallTelemetry` message to central on **every** cached-call lifecycle transition (`Pending → Retrying → Delivered / Parked / Failed / Discarded`). The first telemetry event for an operation carries its initial status — `Pending` when a transient failure has buffered the call, or directly `Delivered`/`Failed` for a cached call that never buffers. The message carries the `TrackedOperationId`, source site, `Kind` (the `TrackedOperationKind` enum), target summary, status, retry count, last error, key timestamps, and source provenance.
- Emission is **best-effort and at-least-once**, **idempotent on `TrackedOperationId`** — central's Site Call Audit component ingests with insert-if-not-exists then upsert-on-newer-status, so a re-sent or out-of-order event is harmless.
- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — issues a `CachedCallReconcileRequest` to each site; the site replies with a `CachedCallReconcileResponse` carrying all tracking rows changed since a cursor. Any telemetry missed during a disconnect self-heals through this pull.
- **Reconciliation pull**: because telemetry is best-effort, the central **Site Call Audit** component periodically — and on site reconnect — pulls the changed rows back from each site over the **`PullSiteCalls` unary gRPC RPC** on `SiteStreamService` (not a ClusterClient round-trip). Central sends a `PullSiteCallsRequest` (`since_utc` cursor + `batch_size`); the site reads its `IOperationTrackingStore` and replies with a `PullSiteCallsResponse` carrying the matching operation-tracking rows (as `SiteCallOperationalDto`s) plus a `more_available` flag that signals a saturated batch so central advances the cursor and pulls again. Any telemetry missed during a disconnect self-heals through this pull. The Audit Log (#23) reconciliation feed uses the sibling `PullAuditEvents` RPC the same way.
- Central audit is an **eventually-consistent mirror** — the site's operation tracking table remains the source of truth for cached-call status (`Tracking.Status(id)` is always answered site-locally).
- **Transport**: ClusterClient (site→central command/control), consistent with how other site→central messages are sent.
- **Transport**: the *push* telemetry emission rides **ClusterClient** (site→central command/control), consistent with how other site→central messages are sent; the *reconciliation pull* rides the **gRPC** unary `PullSiteCalls` RPC (central→site request/response). The two paths are complementary — push is the fast, best-effort feed; pull is the slower self-heal backfill.
## Topology
@@ -251,7 +259,7 @@ Each request/response pattern has a default timeout that can be overridden in co
| 5. Recipe/Command Delivery | 30 seconds | Fire-and-forget with ack |
| 8. Remote Queries | 30 seconds | Querying parked messages or event logs |
| 9. Notification Submission | 30 seconds | Fire-and-forget with ack; central acks after persisting the row |
| 10. Cached Call Telemetry | 30 seconds | Reconciliation pull is request/response; telemetry emission itself is fire-and-forget |
| 10. Cached Call Telemetry | 30 seconds | Telemetry emission (ClusterClient) is fire-and-forget; the reconciliation pull is the unary gRPC `PullSiteCalls` request/response (its deadline is the gRPC call timeout, not the Akka ask) |
Timeouts use the Akka.NET **ask pattern**. If no response is received within the timeout, the caller receives a timeout failure.
@@ -302,6 +310,9 @@ Disconnect is detected at the **transport layer**, never via an application-leve
- **Cluster Infrastructure**: Manages node roles and failover detection.
- **Configuration Database**: Provides site node addresses (NodeAAddress, NodeBAddress for Akka remoting; GrpcNodeAAddress, GrpcNodeBAddress for gRPC streaming) for address resolution.
- **Site Runtime (SiteStreamManager)**: The SiteStreamGrpcServer subscribes to SiteStreamManager to receive real-time events for gRPC delivery.
- **`ISiteAuditQueue` (site-local)**: Handed to `SiteStreamGrpcServer` (post-construction, on site roles) so the `PullAuditEvents` RPC can read the site's `Pending`/`Forwarded` audit rows to serve the Audit Log (#23) reconciliation pull. Null when not wired (central-only host) — the handler then returns an empty response.
- **`IOperationTrackingStore` (site-local)**: Handed to `SiteStreamGrpcServer` (post-construction, on site roles) so the `PullSiteCalls` RPC can read operation-tracking rows changed since a cursor to serve the Site Call Audit (#22) reconciliation pull. Null when not wired — the handler returns an empty response.
- **`AuditLogIngestActor` proxy (central)**: Handed to `SiteStreamGrpcServer` after the central cluster singleton starts; the `IngestAuditEvents` / `IngestCachedTelemetry` RPCs route ingested batches to it. Null when not yet wired — the handler returns an empty `IngestAck` so the caller treats it as transient and retries.
## Interactions
@@ -309,7 +320,8 @@ Disconnect is detected at the **transport layer**, never via an application-leve
- **Site Runtime**: Receives deployments, lifecycle commands, and artifact updates. Provides debug view data.
- **Central UI**: Debug view requests and remote queries flow through communication.
- **Health Monitoring**: Receives periodic health reports from sites.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` and answers `CachedCallReconcileRequest` pulls, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands.
- **Site Call Audit (central)**: Receives cached-call telemetry and reconciliation responses; issues reconciliation pulls and relays parked-operation Retry/Discard commands to sites through communication.
- **Store-and-Forward Engine (site)**: Parked message queries/commands are routed through communication. Also emits `CachedCallTelemetry` (push, ClusterClient) and serves the `PullSiteCalls` gRPC reconciliation pull from its `IOperationTrackingStore`, and receives relayed `RetryParkedOperation` / `DiscardParkedOperation` commands.
- **Site Call Audit (central)**: Receives cached-call telemetry and issues the `PullSiteCalls` gRPC reconciliation pulls to sites; relays parked-operation Retry/Discard commands to sites through communication.
- **Audit Log (#23)**: Sites forward audit-event telemetry (push) and serve the `PullAuditEvents` gRPC reconciliation pull from their `ISiteAuditQueue`; the central `AuditLogIngestActor` is the ingest target for both the push path and the combined cached-call telemetry packet.
- **Site Event Logging**: Event log queries are routed through communication.
- **Management Service**: The ManagementActor is registered with ClusterClientReceptionist on central nodes. The CLI communicates with the ManagementActor via ClusterClient, which is a separate channel from inter-cluster remoting.