diff --git a/docs/plans/2026-06-18-galaxy-writer-handle-sharing-design.md b/docs/plans/2026-06-18-galaxy-writer-handle-sharing-design.md new file mode 100644 index 00000000..06cd485b --- /dev/null +++ b/docs/plans/2026-06-18-galaxy-writer-handle-sharing-design.md @@ -0,0 +1,138 @@ +# Galaxy writer ⇄ subscription-registry item-handle sharing — Design + +> Brainstormed 2026-06-18. Backlog item: `stillpending.md` §2.4 — "Galaxy — writer +> item-handle cache not shared with the subscription registry" +> (`Driver.Galaxy/Runtime/GatewayGalaxyDataWriter.cs`). Off master `70e1bde9`. + +## Problem (verified in code) + +The Galaxy data writer's item-handle cache is **already shipped and reconnect-wired**: +`GatewayGalaxyDataWriter` holds `_itemHandles` (fullRef → MXAccess hItem) plus +`_supervisedHandles`, and `GalaxyDriver.ReopenAsync` invalidates them after a session +recreate (`GalaxyDriver.cs:298`, commits `f05b5d79` / `f77488ee`). So "add a writer cache" +is **not** the open work. + +What is open is exactly what the backlog says: the writer cache is **isolated from the +`SubscriptionRegistry`**. The registry already records `TagBinding(FullReference, ItemHandle)` +for every live subscription, and — the load-bearing fact — both subscribe-handles (from +`SubscribeBulk`) and write-handles (from the writer's `AddItem`) are issued against the **same +MXAccess server registration** (`GalaxyMxSession.ServerHandle`). Within one registration an +hItem is shared across the AddItem / Advise / Write call paths, so a handle the gateway returned +for a subscription is reusable by a `Write`. + +Today the writer never consults the registry, so the **first write to an already-subscribed tag +pays a redundant `AddItem` round-trip** (subsequent writes hit the writer's own `_itemHandles`). +This is an `_Optimization._` — a first-write-only latency saving, plus unification of handle +provenance (one fewer source of truth for "which hItem is this tag"). + +## Load-bearing premise (proven live, not assumed) + +> *A subscription's item handle is usable for a no-login supervisory `Write` that commits.* + +If this were false, borrowing the subscribed handle would **regress writes** for subscribed +tags. The MXAccess Toolkit semantics say it holds (one hItem per item under a registration, +shared across Advise/Write), but this phase does **not** assume it — the live-gw test is the +**merge gate**, not decoration. + +## Approach + +Three options were considered: + +1. **Delegate seam (chosen).** The writer ctor gains an optional + `Func? subscribedHandleSource = null`; `GalaxyDriver` wires it to the + registry's new `TryResolveItemHandle`. This is symmetric with the writer's existing + `securityResolver` delegate, preserves the writer's "independently testable" property + (`null` ⇒ today's exact behavior), and is the smallest surface. +2. **Direct `SubscriptionRegistry` reference** in the writer. Same assembly, registry is + test-constructible — but couples the writer to a concrete collaborator with no upside over (1). +3. **Unified shared cache object** both sides read/write. Matches the literal "wire bindings + into `_itemHandles`" phrasing but is the most invasive (changes the writer's ownership + model). YAGNI. + +→ **Approach 1.** AdminUI untouched. No Commons/proto/EF/migration change. No bUnit. + +## Resolution rule (self-healing — no stale-handle regression) + +`GatewayGalaxyDataWriter.EnsureItemHandleAsync(fullRef)` becomes: + +1. `_itemHandles` hit (the writer `AddItem`'d this tag itself before) → use it. +2. else `subscribedHandleSource?.Invoke(fullRef)` returns a **live** handle → use it, but + **do NOT store it in `_itemHandles`**. The registry owns the borrowed handle's lifecycle + (including reconnect `Rebind`), so after a reconnect the writer always re-borrows the *fresh* + handle on the next write — there is no stale-cache window introduced by the borrow. +3. else `AddItem` + store in `_itemHandles` (today's path), incrementing an `AddItemCallCount` + test seam. + +`AdviseSupervisory` is unchanged: the writer still supervisory-advises the (possibly borrowed) +hItem once per handle via `_supervisedHandles`. Borrowing only skips the `AddItem` round-trip — +which is the entire point of the optimization. The borrowed item already carries the +subscriber's data-change advise; supervisory advise is an additional mode on the same item. + +The decision in steps 1–2 is extracted into a synchronous internal seam +`TryResolveCachedOrBorrowed(fullRef) -> int?` so it is unit-testable **without a live session** +(the SDK `MxGatewaySession` is sealed with an internal ctor and cannot be faked — see Testing). +`EnsureItemHandleAsync` calls the seam first and only `AddItem`s on a null result. + +## Registry change + +`SubscriptionRegistry` gains a forward lookup: + +- A `ConcurrentDictionary _itemHandleByFullRef` (`StringComparer.OrdinalIgnoreCase`, + matching the writer's cache), maintained incrementally in `Register` / `Rebind` (add the + `fullRef → handle` for each `binding.ItemHandle > 0`) and best-effort dropped in `Remove` / + `Rebind`. +- `public int? TryResolveItemHandle(string fullRef)` that returns the mapped handle **only if + `_subscribersByItemHandle` still contains it** — a liveness guard. This means even a lingering + forward-map entry can never hand out a dead handle, because `_subscribersByItemHandle` is the + already-authoritative live-handle set. The guard de-risks the removal bookkeeping: the forward + map is an index, not the source of truth. + +The event-dispatch hot path (`ResolveSubscribers`) is **untouched** — the forward map is consulted +only on writes (rare relative to events) and mutated only on subscribe/unsubscribe/rebind (which +already do O(bindings) work). + +`GalaxyDriver` passes `_subscriptions.TryResolveItemHandle` as the writer's +`subscribedHandleSource` when it constructs the production writer. + +## Error handling + +Unchanged. A borrowed handle that fails a write surfaces its own Bad status via the existing +`TranslateReply` path; the writer does not cache borrowed handles, so the next write simply +re-resolves (re-borrow if still live, else `AddItem`). `TryResolveItemHandle` returns `null` +cleanly for unknown / dead handles. + +## Testing + +**Unit (xUnit + Shouldly, `Driver.Galaxy.Tests`) — no live session required:** + +- `GatewayGalaxyDataWriter`: seed `subscribedHandleSource` → `TryResolveCachedOrBorrowed` + returns the borrowed handle and leaves `CachedItemHandleCount == 0`; a `_itemHandles` hit wins + over the source; a `null` source ⇒ no borrow (returns null). (Mirrors the existing + `SeedHandleCachesForTest` / count-seam pattern, since the gw session cannot be faked.) +- `SubscriptionRegistry`: `TryResolveItemHandle` returns the handle after `Register`; returns + `null` after `Remove`; returns the **fresh** handle after `Rebind`; the liveness guard returns + `null` when the handle is absent from `_subscribersByItemHandle`. + +**Live gw (the merge gate)** — extend `GatewayGalaxyLiveReopenAndWriteTests` (skip-gated; runs +against `MXGW_ENDPOINT=http://10.100.0.48:5120` with the key sourced via the durable +`docker exec … printenv GALAXY_MXGW_API_KEY` recipe, never echoed). Add an internal +`AddItemCallCount` seam on the writer, then: + +- Subscribe a real tag (registry now holds its hItem) → write it through the registry-wired + writer → assert the write **commits** (Good status + value persists) **and `AddItemCallCount == 0`**. +- Control: write a **non-subscribed** tag → assert `AddItemCallCount == 1`. + +This proves both halves: the borrowed handle is write-usable (premise holds) **and** the redundant +`AddItem` is actually skipped. + +## Deferred / out of scope + +- Reverse direction (subscriber borrowing the writer's `AddItem` handles) — subscriber handles + come from `SubscribeBulk`; no need. +- Unifying the two caches into one shared object (Approach 3). +- Any AdminUI / Commons / proto / EF change. + +## Done = + +Build clean + `dotnet test` (Driver.Galaxy) green + the live-gw test proves a subscribed-tag +write commits with zero `AddItem` → merge to master + push.