diff --git a/docs/plans/2026-06-18-hosts-rows-abcip-nested-hygiene-design.md b/docs/plans/2026-06-18-hosts-rows-abcip-nested-hygiene-design.md new file mode 100644 index 00000000..cca008d6 --- /dev/null +++ b/docs/plans/2026-06-18-hosts-rows-abcip-nested-hygiene-design.md @@ -0,0 +1,166 @@ +# Hosts per-driver rows + AbCip nested-struct + Galaxy hygiene — Design + +**Date:** 2026-06-18 +**Branch:** `feat/hosts-rows-abcip-nested-hygiene` (off master `f59680fa`) +**Backlog items:** `stillpending.md` §A #8 (Hosts per-driver rows), #6 (AbCip nested-struct), #3/#13 + #10 (Galaxy stale-comment hygiene + reconcile) + +A three-component backlog phase. One live-provable AdminUI feature, one offline-provable driver +enhancement, one doc/comment hygiene sweep. The three touch disjoint projects (AdminUI / Driver.AbCip / +Driver.Galaxy + docs), so they are independently implementable. + +The triggering follow-up ("cursor-based paging WITHIN a single timestamp / oversized tie clusters") is +**stale** — it shipped at `2e6c6d3a` (`HistoryPaging.SliceTieCluster`, wired in `OtOpcUaNodeManager.cs:1928`, +tested in `HistoryPagingTests.cs`). This phase handles the "+ backlog" half. + +--- + +## Standing constraints (in force) + +- **NO Commons wire/proto contract change**, NO interface/Core.Abstractions contract change, NO EF + migration, NO bUnit (Razor proven only by live `/run`). +- Stage by explicit path, never `git add .`; never stage the never-stage files (`sql_login.txt`, + `src/Server/.../Host/pki/`, `pending.md`, `current.md`, `stillpending.md`, `docker-dev/docker-compose.yml`). +- No force-push, no `--no-verify`. Finish = merge to master + push. +- `dangerouslyDisableSandbox: true` for all build/test/rig commands. + +--- + +## Component A — `/hosts` per-driver-instance status (backlog #8) + +### The constraint that shapes the architecture + +`DriverHealthChanged` (the `driver-health` DPS message, `Commons/Messages/Drivers/DriverHealthChanged.cs`) +carries `ClusterId` + `DriverInstanceId` + health, but **no hosting-node identity**. The snapshot store +(`InMemoryDriverStatusSnapshotStore`) keys by `DriverInstanceId` alone, so in a redundancy pair the two +nodes' health snapshots overwrite each other. `ClusterNode.NodeId` is a *logical* id (`"LINE3-OPCUA-A"`), +not the Akka member address — the runtime keys "this node" by that logical id +(`DriverHostActor.cs:428` filters `s.NodeId == _localNode.Value`). + +Consequences: +- True per-Akka-**member** driver health would need a hosting-node field on `DriverHealthChanged` (a + Commons change — forbidden) plus a node-keyed store, OR a per-node `DriverHostActor`-children Ask (new + cluster actor messaging). Both are out of scope here. +- The **reliable join key both sides share is `ClusterId`.** + +### Approach (no Commons change) + +- **Reuse the existing health pipeline unchanged** (`DriverStatusSignalRBridge` → snapshot store). +- **Add one AdminUI-internal store method:** `IReadOnlyCollection GetAll()` on + `IDriverStatusSnapshotStore`, implemented in `InMemoryDriverStatusSnapshotStore` by snapshotting the + backing `ConcurrentDictionary` values. (AdminUI-internal interface — **not** Commons.) +- **New "Driver Instances" section on `/hosts`**, grouped by `ClusterId`: + - Each cluster-group header lists that cluster's nodes from ConfigDB `ClusterNode` (logical NodeId + Host). + - The body lists that cluster's live driver snapshots, enriched with `Name` + `DriverType` by joining + ConfigDB `DriverInstance` on `DriverInstanceId`. Each row: a live status chip (the same state→color + mapping `DriverStatusPanel` uses), last successful read, last error, 5-min error count. + - Live-refreshed by subscribing to the store's `SnapshotChanged` event (in-process — dodges the Traefik + self-hub trap, same pattern as `DriverStatusPanel`). +- **Keep the existing Akka member topology rows as-is** — they are authoritative for Akka-level + Up/Unreachable/Leader, which the DB does not know. Two sections, each authoritative for its own data. + +### Testing + +Extract the grouping/enrichment into a **pure view-model builder**: +`HostsDriverView.Build(snapshots, clusterNodes, driverInstances) -> IReadOnlyList`. +xUnit + Shouldly: group-by-cluster, Name/DriverType enrichment, unknown-driver (snapshot with no matching +`DriverInstance`) fallback, empty inputs. The Razor is a thin shell over the builder — proven only by live +`/run` (no bUnit), per established pattern. + +### Live `/run` (Component A) + +Rebuild **both** central-1 AND central-2 (the `:9200` AdminUI is Traefik round-robined across both — a +half-deploy round-robins old/new code). Deploy a Modbus driver (Galaxy too if convenient), open +`http://localhost:9200/hosts`, confirm the Driver Instances section lists the deployed drivers grouped by +cluster with live state, then drive a Reconnect from a driver page and confirm the chip updates. + +### Deferred follow-up (noted, not built) + +True per-Akka-member nesting needs node identity on the health feed (Commons field) or a per-node +`DriverHostActor`-children Ask. Out of scope under the no-Commons constraint. + +--- + +## Component B — AbCip nested-struct member expansion (backlog #6) + +### Finding (simpler than the backlog framed) + +The backlog said the CIP Template Object member block "carries no nested template id." It actually **does**: +the per-member info `u16` uses the *same encoding as the Symbol Object* — bit 15 = struct flag, lower 12 +bits = template instance id for a struct member (the codebase's own comment in `CipTemplateObjectDecoder` +documents this). The prior member-expansion fix (`4e141402`/`4a7b0fde`) **discarded** that id and recursed +with `templateInstanceId: null` (`AbCipDriver.cs:~1179`), leaving a test-only `SeedDiscoveredUdtShapeForTest` +seam. So the close is to stop discarding it — **no new CIP query needed**. + +### Approach (driver-internal only) + +- Carry `uint? NestedTemplateId` on the `AbCipUdtMember` record (driver-internal — not Commons). +- `CipTemplateObjectDecoder` sets it from `info & 0x0FFF` when the struct flag (bit 15) is set; `null` for + scalar members. +- The driver threads `member.NestedTemplateId` into the existing `FetchUdtShapeAsync` (recursion site + `AbCipDriver.cs:~1179`) instead of `null`. Reuses `IAbCipTemplateReader.ReadAsync("@udt/{id}")` — zero new + CIP primitives. + +### Testing (fully offline) + +Via the injectable `IAbCipTemplateReader` / `FakeTemplateReader`: feed a Template Object blob with a struct +member whose info = `0x8000 | nestedId`; assert (1) the decoder extracts `NestedTemplateId == nestedId`, +(2) the driver fetches the nested shape (`@udt/{nestedId}`) so nested members become addressable. Add a +scalar-member case asserting `NestedTemplateId == null`. Live-verify stays infra-gated (no nested-UDT +ControlLogix rig) — the unit test pins the real risk (decode + threading). + +### Risk + fallback + +Load-bearing assumption: struct-member lower-12 = template id. The codebase comment asserts it and the test +pins it. If it proved false on real controller bytes, the fallback is the backlog's original per-member +controller query — but the direct path is the confident choice. + +--- + +## Component C — Backlog hygiene + reconcile (backlog #3/#13 + #10) + +### Galaxy stale comments (comment-only — verify each claim against current code first) + +Rewrite 5 sites that reference an unshipped "PR 4.W", a never-added `Galaxy:Backend` switch, and the +**retired** "legacy-host" backend (PR 7.2 retired Galaxy.Host/Proxy/Shared): + +- `GalaxyDriver.cs:~52` — `IGalaxyDataReader` "until then... legacy-host backend handles reads in + production" → reads ARE supported in production now (standard driver, Phase A shipped). +- `GalaxyDriver.cs:~92` — `_ownedMxSession` "PR 4.W — production runtime owned by InitializeAsync" → + confirm it's built in `InitializeAsync` now, relabel to present tense. +- `GalaxyDriver.cs:~669` — `DiscoverAsync` "PR 4.W — also refresh the per-platform probe watcher's + membership after discovery" → confirm wired, drop the forward-ref. +- `GalaxyDriverFactoryExtensions.cs:~19` — "PR 4.W will add a server-side `Galaxy:Backend` switch" → never + landed and won't (legacy "Galaxy" proxy type retired; only `GalaxyMxGateway`). Replace with the shipped + reality. +- `HostStatusAggregator.cs:~21` — "re-raises OnHostStatusChanged as the driver-level event (wired in PR + 4.W)" → confirm wired, drop the forward-ref. + +Each rewrite must verify the claim against the current code before asserting it (e.g. confirm +`_ownedMxSession` really is built in `InitializeAsync`); convert to accurate present-tense or a real TODO. + +### Reconcile + +- `stillpending.md` (never-staged): mark item #10 SHIPPED (the `ctx`-receiver guard is in + `ScriptAnalysisService.cs:224`, `70e1bde9`); record the #6/#8 outcomes; tidy. +- Update the two memory files (`MEMORY.md`, `project_stillpending_backlog.md`). + +--- + +## Task slicing (independent → parallelizable) + +| Task | Component | Project | Class | Parallel with | +|---|---|---|---|---| +| T1 | AbCip nested-struct (decoder + record + driver threading + tests) | Driver.AbCip | standard | T2, T4 | +| T2 | Hosts store `GetAll()` + pure grouping view-model + tests | AdminUI | standard | T1, T4 | +| T3 | Hosts Razor "Driver Instances" section (uses T2) | AdminUI | small | none (after T2) | +| T4 | Galaxy stale-comment rewrites | Driver.Galaxy | trivial | T1, T2 | +| T5 | Reconcile stillpending #10/#6/#8 + memory | docs (never-staged) | trivial | none | +| T6 | Build + driver/AdminUI tests + live `/run` + finish (merge+push) | — | small | none | + +Parallel implementers will use **worktree isolation** (or serial commits) per the shared-tree git-race +lesson. T3 depends on T2; T5/T6 run at the end. + +## Done = + +Build clean + `dotnet test` green (Driver.AbCip + AdminUI + Driver.Galaxy) + Component A live `/run` pass + +merged to master + pushed.