Files
lmxopcua/docs/plans/2026-06-18-hosts-rows-abcip-nested-hygiene-design.md
T

9.4 KiB

Hosts per-driver rows + AbCip nested-struct + Galaxy hygiene — Design

Date: 2026-06-18 Branch: feat/hosts-rows-abcip-nested-hygiene (off master f59680fa) Backlog items: stillpending.md §A #8 (Hosts per-driver rows), #6 (AbCip nested-struct), #3/#13 + #10 (Galaxy stale-comment hygiene + reconcile)

A three-component backlog phase. One live-provable AdminUI feature, one offline-provable driver enhancement, one doc/comment hygiene sweep. The three touch disjoint projects (AdminUI / Driver.AbCip / Driver.Galaxy + docs), so they are independently implementable.

The triggering follow-up ("cursor-based paging WITHIN a single timestamp / oversized tie clusters") is stale — it shipped at 2e6c6d3a (HistoryPaging.SliceTieCluster, wired in OtOpcUaNodeManager.cs:1928, tested in HistoryPagingTests.cs). This phase handles the "+ backlog" half.


Standing constraints (in force)

  • NO Commons wire/proto contract change, NO interface/Core.Abstractions contract change, NO EF migration, NO bUnit (Razor proven only by live /run).
  • Stage by explicit path, never git add .; never stage the never-stage files (sql_login.txt, src/Server/.../Host/pki/, pending.md, current.md, stillpending.md, docker-dev/docker-compose.yml).
  • No force-push, no --no-verify. Finish = merge to master + push.
  • dangerouslyDisableSandbox: true for all build/test/rig commands.

Component A — /hosts per-driver-instance status (backlog #8)

The constraint that shapes the architecture

DriverHealthChanged (the driver-health DPS message, Commons/Messages/Drivers/DriverHealthChanged.cs) carries ClusterId + DriverInstanceId + health, but no hosting-node identity. The snapshot store (InMemoryDriverStatusSnapshotStore) keys by DriverInstanceId alone, so in a redundancy pair the two nodes' health snapshots overwrite each other. ClusterNode.NodeId is a logical id ("LINE3-OPCUA-A"), not the Akka member address — the runtime keys "this node" by that logical id (DriverHostActor.cs:428 filters s.NodeId == _localNode.Value).

Consequences:

  • True per-Akka-member driver health would need a hosting-node field on DriverHealthChanged (a Commons change — forbidden) plus a node-keyed store, OR a per-node DriverHostActor-children Ask (new cluster actor messaging). Both are out of scope here.
  • The reliable join key both sides share is ClusterId.

Approach (no Commons change)

  • Reuse the existing health pipeline unchanged (DriverStatusSignalRBridge → snapshot store).
  • Add one AdminUI-internal store method: IReadOnlyCollection<DriverHealthChanged> GetAll() on IDriverStatusSnapshotStore, implemented in InMemoryDriverStatusSnapshotStore by snapshotting the backing ConcurrentDictionary values. (AdminUI-internal interface — not Commons.)
  • New "Driver Instances" section on /hosts, grouped by ClusterId:
    • Each cluster-group header lists that cluster's nodes from ConfigDB ClusterNode (logical NodeId + Host).
    • The body lists that cluster's live driver snapshots, enriched with Name + DriverType by joining ConfigDB DriverInstance on DriverInstanceId. Each row: a live status chip (the same state→color mapping DriverStatusPanel uses), last successful read, last error, 5-min error count.
    • Live-refreshed by subscribing to the store's SnapshotChanged event (in-process — dodges the Traefik self-hub trap, same pattern as DriverStatusPanel).
  • Keep the existing Akka member topology rows as-is — they are authoritative for Akka-level Up/Unreachable/Leader, which the DB does not know. Two sections, each authoritative for its own data.

Testing

Extract the grouping/enrichment into a pure view-model builder: HostsDriverView.Build(snapshots, clusterNodes, driverInstances) -> IReadOnlyList<ClusterDriverGroup>. xUnit + Shouldly: group-by-cluster, Name/DriverType enrichment, unknown-driver (snapshot with no matching DriverInstance) fallback, empty inputs. The Razor is a thin shell over the builder — proven only by live /run (no bUnit), per established pattern.

Live /run (Component A)

Rebuild both central-1 AND central-2 (the :9200 AdminUI is Traefik round-robined across both — a half-deploy round-robins old/new code). Deploy a Modbus driver (Galaxy too if convenient), open http://localhost:9200/hosts, confirm the Driver Instances section lists the deployed drivers grouped by cluster with live state, then drive a Reconnect from a driver page and confirm the chip updates.

Deferred follow-up (noted, not built)

True per-Akka-member nesting needs node identity on the health feed (Commons field) or a per-node DriverHostActor-children Ask. Out of scope under the no-Commons constraint.


Component B — AbCip nested-struct member expansion (backlog #6)

Finding (simpler than the backlog framed)

The backlog said the CIP Template Object member block "carries no nested template id." It actually does: the per-member info u16 uses the same encoding as the Symbol Object — bit 15 = struct flag, lower 12 bits = template instance id for a struct member (the codebase's own comment in CipTemplateObjectDecoder documents this). The prior member-expansion fix (4e141402/4a7b0fde) discarded that id and recursed with templateInstanceId: null (AbCipDriver.cs:~1179), leaving a test-only SeedDiscoveredUdtShapeForTest seam. So the close is to stop discarding it — no new CIP query needed.

Approach (driver-internal only)

  • Carry uint? NestedTemplateId on the AbCipUdtMember record (driver-internal — not Commons).
  • CipTemplateObjectDecoder sets it from info & 0x0FFF when the struct flag (bit 15) is set; null for scalar members.
  • The driver threads member.NestedTemplateId into the existing FetchUdtShapeAsync (recursion site AbCipDriver.cs:~1179) instead of null. Reuses IAbCipTemplateReader.ReadAsync("@udt/{id}") — zero new CIP primitives.

Testing (fully offline)

Via the injectable IAbCipTemplateReader / FakeTemplateReader: feed a Template Object blob with a struct member whose info = 0x8000 | nestedId; assert (1) the decoder extracts NestedTemplateId == nestedId, (2) the driver fetches the nested shape (@udt/{nestedId}) so nested members become addressable. Add a scalar-member case asserting NestedTemplateId == null. Live-verify stays infra-gated (no nested-UDT ControlLogix rig) — the unit test pins the real risk (decode + threading).

Risk + fallback

Load-bearing assumption: struct-member lower-12 = template id. The codebase comment asserts it and the test pins it. If it proved false on real controller bytes, the fallback is the backlog's original per-member controller query — but the direct path is the confident choice.


Component C — Backlog hygiene + reconcile (backlog #3/#13 + #10)

Galaxy stale comments (comment-only — verify each claim against current code first)

Rewrite 5 sites that reference an unshipped "PR 4.W", a never-added Galaxy:Backend switch, and the retired "legacy-host" backend (PR 7.2 retired Galaxy.Host/Proxy/Shared):

  • GalaxyDriver.cs:~52IGalaxyDataReader "until then... legacy-host backend handles reads in production" → reads ARE supported in production now (standard driver, Phase A shipped).
  • GalaxyDriver.cs:~92_ownedMxSession "PR 4.W — production runtime owned by InitializeAsync" → confirm it's built in InitializeAsync now, relabel to present tense.
  • GalaxyDriver.cs:~669DiscoverAsync "PR 4.W — also refresh the per-platform probe watcher's membership after discovery" → confirm wired, drop the forward-ref.
  • GalaxyDriverFactoryExtensions.cs:~19 — "PR 4.W will add a server-side Galaxy:Backend switch" → never landed and won't (legacy "Galaxy" proxy type retired; only GalaxyMxGateway). Replace with the shipped reality.
  • HostStatusAggregator.cs:~21 — "re-raises OnHostStatusChanged as the driver-level event (wired in PR 4.W)" → confirm wired, drop the forward-ref.

Each rewrite must verify the claim against the current code before asserting it (e.g. confirm _ownedMxSession really is built in InitializeAsync); convert to accurate present-tense or a real TODO.

Reconcile

  • stillpending.md (never-staged): mark item #10 SHIPPED (the ctx-receiver guard is in ScriptAnalysisService.cs:224, 70e1bde9); record the #6/#8 outcomes; tidy.
  • Update the two memory files (MEMORY.md, project_stillpending_backlog.md).

Task slicing (independent → parallelizable)

Task Component Project Class Parallel with
T1 AbCip nested-struct (decoder + record + driver threading + tests) Driver.AbCip standard T2, T4
T2 Hosts store GetAll() + pure grouping view-model + tests AdminUI standard T1, T4
T3 Hosts Razor "Driver Instances" section (uses T2) AdminUI small none (after T2)
T4 Galaxy stale-comment rewrites Driver.Galaxy trivial T1, T2
T5 Reconcile stillpending #10/#6/#8 + memory docs (never-staged) trivial none
T6 Build + driver/AdminUI tests + live /run + finish (merge+push) small none

Parallel implementers will use worktree isolation (or serial commits) per the shared-tree git-race lesson. T3 depends on T2; T5/T6 run at the end.

Done =

Build clean + dotnet test green (Driver.AbCip + AdminUI + Driver.Galaxy) + Component A live /run pass + merged to master + pushed.