docs(plans): design — Hosts per-driver rows + AbCip nested-struct + Galaxy hygiene

This commit is contained in:
Joseph Doherty
2026-06-18 11:26:09 -04:00
parent f59680fa48
commit fec0891584
@@ -0,0 +1,166 @@
# Hosts per-driver rows + AbCip nested-struct + Galaxy hygiene — Design
**Date:** 2026-06-18
**Branch:** `feat/hosts-rows-abcip-nested-hygiene` (off master `f59680fa`)
**Backlog items:** `stillpending.md` §A #8 (Hosts per-driver rows), #6 (AbCip nested-struct), #3/#13 + #10 (Galaxy stale-comment hygiene + reconcile)
A three-component backlog phase. One live-provable AdminUI feature, one offline-provable driver
enhancement, one doc/comment hygiene sweep. The three touch disjoint projects (AdminUI / Driver.AbCip /
Driver.Galaxy + docs), so they are independently implementable.
The triggering follow-up ("cursor-based paging WITHIN a single timestamp / oversized tie clusters") is
**stale** — it shipped at `2e6c6d3a` (`HistoryPaging.SliceTieCluster`, wired in `OtOpcUaNodeManager.cs:1928`,
tested in `HistoryPagingTests.cs`). This phase handles the "+ backlog" half.
---
## Standing constraints (in force)
- **NO Commons wire/proto contract change**, NO interface/Core.Abstractions contract change, NO EF
migration, NO bUnit (Razor proven only by live `/run`).
- Stage by explicit path, never `git add .`; never stage the never-stage files (`sql_login.txt`,
`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `stillpending.md`, `docker-dev/docker-compose.yml`).
- No force-push, no `--no-verify`. Finish = merge to master + push.
- `dangerouslyDisableSandbox: true` for all build/test/rig commands.
---
## Component A — `/hosts` per-driver-instance status (backlog #8)
### The constraint that shapes the architecture
`DriverHealthChanged` (the `driver-health` DPS message, `Commons/Messages/Drivers/DriverHealthChanged.cs`)
carries `ClusterId` + `DriverInstanceId` + health, but **no hosting-node identity**. The snapshot store
(`InMemoryDriverStatusSnapshotStore`) keys by `DriverInstanceId` alone, so in a redundancy pair the two
nodes' health snapshots overwrite each other. `ClusterNode.NodeId` is a *logical* id (`"LINE3-OPCUA-A"`),
not the Akka member address — the runtime keys "this node" by that logical id
(`DriverHostActor.cs:428` filters `s.NodeId == _localNode.Value`).
Consequences:
- True per-Akka-**member** driver health would need a hosting-node field on `DriverHealthChanged` (a
Commons change — forbidden) plus a node-keyed store, OR a per-node `DriverHostActor`-children Ask (new
cluster actor messaging). Both are out of scope here.
- The **reliable join key both sides share is `ClusterId`.**
### Approach (no Commons change)
- **Reuse the existing health pipeline unchanged** (`DriverStatusSignalRBridge` → snapshot store).
- **Add one AdminUI-internal store method:** `IReadOnlyCollection<DriverHealthChanged> GetAll()` on
`IDriverStatusSnapshotStore`, implemented in `InMemoryDriverStatusSnapshotStore` by snapshotting the
backing `ConcurrentDictionary` values. (AdminUI-internal interface — **not** Commons.)
- **New "Driver Instances" section on `/hosts`**, grouped by `ClusterId`:
- Each cluster-group header lists that cluster's nodes from ConfigDB `ClusterNode` (logical NodeId + Host).
- The body lists that cluster's live driver snapshots, enriched with `Name` + `DriverType` by joining
ConfigDB `DriverInstance` on `DriverInstanceId`. Each row: a live status chip (the same state→color
mapping `DriverStatusPanel` uses), last successful read, last error, 5-min error count.
- Live-refreshed by subscribing to the store's `SnapshotChanged` event (in-process — dodges the Traefik
self-hub trap, same pattern as `DriverStatusPanel`).
- **Keep the existing Akka member topology rows as-is** — they are authoritative for Akka-level
Up/Unreachable/Leader, which the DB does not know. Two sections, each authoritative for its own data.
### Testing
Extract the grouping/enrichment into a **pure view-model builder**:
`HostsDriverView.Build(snapshots, clusterNodes, driverInstances) -> IReadOnlyList<ClusterDriverGroup>`.
xUnit + Shouldly: group-by-cluster, Name/DriverType enrichment, unknown-driver (snapshot with no matching
`DriverInstance`) fallback, empty inputs. The Razor is a thin shell over the builder — proven only by live
`/run` (no bUnit), per established pattern.
### Live `/run` (Component A)
Rebuild **both** central-1 AND central-2 (the `:9200` AdminUI is Traefik round-robined across both — a
half-deploy round-robins old/new code). Deploy a Modbus driver (Galaxy too if convenient), open
`http://localhost:9200/hosts`, confirm the Driver Instances section lists the deployed drivers grouped by
cluster with live state, then drive a Reconnect from a driver page and confirm the chip updates.
### Deferred follow-up (noted, not built)
True per-Akka-member nesting needs node identity on the health feed (Commons field) or a per-node
`DriverHostActor`-children Ask. Out of scope under the no-Commons constraint.
---
## Component B — AbCip nested-struct member expansion (backlog #6)
### Finding (simpler than the backlog framed)
The backlog said the CIP Template Object member block "carries no nested template id." It actually **does**:
the per-member info `u16` uses the *same encoding as the Symbol Object* — bit 15 = struct flag, lower 12
bits = template instance id for a struct member (the codebase's own comment in `CipTemplateObjectDecoder`
documents this). The prior member-expansion fix (`4e141402`/`4a7b0fde`) **discarded** that id and recursed
with `templateInstanceId: null` (`AbCipDriver.cs:~1179`), leaving a test-only `SeedDiscoveredUdtShapeForTest`
seam. So the close is to stop discarding it — **no new CIP query needed**.
### Approach (driver-internal only)
- Carry `uint? NestedTemplateId` on the `AbCipUdtMember` record (driver-internal — not Commons).
- `CipTemplateObjectDecoder` sets it from `info & 0x0FFF` when the struct flag (bit 15) is set; `null` for
scalar members.
- The driver threads `member.NestedTemplateId` into the existing `FetchUdtShapeAsync` (recursion site
`AbCipDriver.cs:~1179`) instead of `null`. Reuses `IAbCipTemplateReader.ReadAsync("@udt/{id}")` — zero new
CIP primitives.
### Testing (fully offline)
Via the injectable `IAbCipTemplateReader` / `FakeTemplateReader`: feed a Template Object blob with a struct
member whose info = `0x8000 | nestedId`; assert (1) the decoder extracts `NestedTemplateId == nestedId`,
(2) the driver fetches the nested shape (`@udt/{nestedId}`) so nested members become addressable. Add a
scalar-member case asserting `NestedTemplateId == null`. Live-verify stays infra-gated (no nested-UDT
ControlLogix rig) — the unit test pins the real risk (decode + threading).
### Risk + fallback
Load-bearing assumption: struct-member lower-12 = template id. The codebase comment asserts it and the test
pins it. If it proved false on real controller bytes, the fallback is the backlog's original per-member
controller query — but the direct path is the confident choice.
---
## Component C — Backlog hygiene + reconcile (backlog #3/#13 + #10)
### Galaxy stale comments (comment-only — verify each claim against current code first)
Rewrite 5 sites that reference an unshipped "PR 4.W", a never-added `Galaxy:Backend` switch, and the
**retired** "legacy-host" backend (PR 7.2 retired Galaxy.Host/Proxy/Shared):
- `GalaxyDriver.cs:~52``IGalaxyDataReader` "until then... legacy-host backend handles reads in
production" → reads ARE supported in production now (standard driver, Phase A shipped).
- `GalaxyDriver.cs:~92``_ownedMxSession` "PR 4.W — production runtime owned by InitializeAsync" →
confirm it's built in `InitializeAsync` now, relabel to present tense.
- `GalaxyDriver.cs:~669``DiscoverAsync` "PR 4.W — also refresh the per-platform probe watcher's
membership after discovery" → confirm wired, drop the forward-ref.
- `GalaxyDriverFactoryExtensions.cs:~19` — "PR 4.W will add a server-side `Galaxy:Backend` switch" → never
landed and won't (legacy "Galaxy" proxy type retired; only `GalaxyMxGateway`). Replace with the shipped
reality.
- `HostStatusAggregator.cs:~21` — "re-raises OnHostStatusChanged as the driver-level event (wired in PR
4.W)" → confirm wired, drop the forward-ref.
Each rewrite must verify the claim against the current code before asserting it (e.g. confirm
`_ownedMxSession` really is built in `InitializeAsync`); convert to accurate present-tense or a real TODO.
### Reconcile
- `stillpending.md` (never-staged): mark item #10 SHIPPED (the `ctx`-receiver guard is in
`ScriptAnalysisService.cs:224`, `70e1bde9`); record the #6/#8 outcomes; tidy.
- Update the two memory files (`MEMORY.md`, `project_stillpending_backlog.md`).
---
## Task slicing (independent → parallelizable)
| Task | Component | Project | Class | Parallel with |
|---|---|---|---|---|
| T1 | AbCip nested-struct (decoder + record + driver threading + tests) | Driver.AbCip | standard | T2, T4 |
| T2 | Hosts store `GetAll()` + pure grouping view-model + tests | AdminUI | standard | T1, T4 |
| T3 | Hosts Razor "Driver Instances" section (uses T2) | AdminUI | small | none (after T2) |
| T4 | Galaxy stale-comment rewrites | Driver.Galaxy | trivial | T1, T2 |
| T5 | Reconcile stillpending #10/#6/#8 + memory | docs (never-staged) | trivial | none |
| T6 | Build + driver/AdminUI tests + live `/run` + finish (merge+push) | — | small | none |
Parallel implementers will use **worktree isolation** (or serial commits) per the shared-tree git-race
lesson. T3 depends on T2; T5/T6 run at the end.
## Done =
Build clean + `dotnet test` green (Driver.AbCip + AdminUI + Driver.Galaxy) + Component A live `/run` pass +
merged to master + pushed.