49 Commits

Author SHA1 Message Date
Joseph Doherty b330faff03 mbproxy: cross-platform support — Linux/systemd alongside Windows
Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00
Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:40:54 -04:00
Joseph Doherty 7466a46aa7 mbproxy/docs: retire superseded design/plan docs and dissolve DL260/
The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).

All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:37:48 -04:00
Joseph Doherty 0a603f94d0 mbproxy/README: route publish steps through install/publish.ps1
The README was telling readers to memorise the dotnet publish flag set; now
it points at the script that captures both flavours so the documented path
matches what we actually run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:55:07 -04:00
Joseph Doherty ee1ae89e25 mbproxy/install: add publish.ps1 for the two single-file build flavours
Captures the dotnet publish invocations used to produce the self-contained
(~100 MB, bundles .NET 10) and framework-dependent (~1.5 MB, requires .NET
10 preinstalled) win-x64 single-file Mbproxy.exe builds, so re-cutting a
release isn't institutional knowledge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:53:22 -04:00
Joseph Doherty 1a2856526a mbproxy: strip historical phase/wave/plan references from source comments
Comments described the *history* of how the code arrived (phase numbers,
wave IDs, review IDs, dated TODOs) instead of what it does today. That
scaffolding rotted as the codebase evolved. Cleaned 60 source files +
.gitignore; behaviour unchanged (387/387 tests still pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:04:30 -04:00
Joseph Doherty b3b8313e9c mbproxy: Wave 6 — wire ProxyCounters.AddBytes (bytes counters were always 0)
The 10-min stress test (1.46 M PDUs through the proxy) revealed that
status.json's bytes.upstreamIn / bytes.upstreamOut counters always read 0
because ProxyCounters.AddBytes was defined but never called from anywhere.
Same shape as the original review's W2.22 finding (counter wired in DTO +
HTML but no increment site), missed for the bytes counters specifically.

Wired five increment sites in PlcMultiplexer:

  OnUpstreamFrameAsync (request side, parsed frame)
    AddBytes(up: frame.Length, down: 0) — counted ONCE per parsed frame
    regardless of subsequent routing (cache hit, coalesce, backend
    round-trip, exception).

  RunBackendReaderAsync fan-out (response side, after TrySendResponse=true)
    AddBytes(up: 0, down: outFrame.Length) per delivered party. With
    coalescing, one backend response fans out to N parties and produces
    N × frame.Length bytes leaving the proxy upstream-side. Drops
    (TrySendResponse=false) increment ResponseDropForFullUpstream
    instead.

  Cache hit path
    AddBytes(up: 0, down: hitFrame.Length) for the BuildCacheHitFrame
    response (no backend round-trip but still bytes leaving the proxy).

  Saturation cleanup (W1.2 path, both branches)
    AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x04.

  Non-coalescing-path saturation
    AddBytes(up: 0, down: excFrame.Length) for the single exception 0x04.

  Watchdog timeout exception delivery
    AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x0B.

Backend-side bytes (proxy ↔ PLC) are NOT counted by these counters — the
field name is `BytesUpstreamIn/Out` which is upstream-only by contract.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:30:48 -04:00
Joseph Doherty 59d0b5deb9 mbproxy: Wave 5 — fixes from third re-review pass
Closes findings from the third focused re-review pass on the post-W4-followup
state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md).

W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync
  The hot-reload OnChange handler at AdminEndpointHost.StartAsync did
  fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort
  was hot-reloaded during shutdown, the queued Task could land between
  StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take
  the lock, and bind a fresh Kestrel WebApplication on the new port —
  resurrecting admin AFTER the host considered it shut down. Worse, if
  DisposeAsync had already run _lock.Dispose, the queued Task throws
  ObjectDisposedException as an unobserved Task exception. Fix: _disposed
  guard at the top of the OnChange lambda AND inside the queued Task.Run,
  plus try/catch (ObjectDisposedException) around _lock.WaitAsync and
  _lock.Release.

W5/m2 — inFlightAtCancel computed AFTER base.StopAsync
  The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync
  (so the multiplexers' counter providers were still wired), but it computed
  the snapshot AFTER base.StopAsync(cancellationToken). Between those two
  lines, in-flight requests whose responses arrive get removed from
  _correlation, and the watchdog can clear stale entries. The reported
  count therefore drifted downward from "in-flight at signal time" to
  "in-flight at compute time." Fix: snapshot at the very top of StopAsync
  before any cancellation is propagated.

W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort)
  When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the
  body runs unprotected. A concurrent EnsureBackendConnectedAsync that
  DOES hold the gate may TryAllocate a TxId that collides (after wraparound
  in the allocator's forward scan) with one being released by the channel
  drain. The double-release would mark the new request's slot as free even
  though it's legitimately in-flight, allowing the next allocation to reuse
  the same slot and CorrelationMap.TryAdd to fail (silent request drop).
  Probability is very low (gate timeout AND new accept landing AND TxId
  collision in 65,536-slot space); the only consequence is one dropped
  request the client retries. Documented inline at PlcMultiplexer.cs near
  the gateHeld declaration as accepted best-effort behaviour.

W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor
  Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per
  re-review verdict.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:13:47 -04:00
Joseph Doherty 9251c564c1 mbproxy: resolve remaining items from ReReviewAfterRemediation.md
Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.

NM3 — _supervisorCts leaks on re-Start
  StartAsync now disposes the previous CTS before reassigning. Idempotent:
  a try/catch (ObjectDisposedException) covers the very-first-Start case
  where the field-init CTS is still fresh.

NM4 — W2.15 TCS is single-shot
  _firstAttemptCompleted is no longer readonly; StartAsync re-creates it
  after the W2.16 guard so a re-Started supervisor's
  WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.

Nm6 — _admin GetService<> returns null silently
  ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
  Preserves the loud-failure intent from the original IHostedService
  registration without forcing test hosts to wire admin.

Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
  Added a volatile bool _disposed flag with an early-return at the top of
  DisposeAsync. Symmetry with PlcMultiplexer; protects against
  ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
  again on host shutdown.

T3 — RemoveInheritedAppsettings only fires on Build
  AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
  so a `dotnet publish` against the test csproj doesn't ship the example
  PLCs from the linked install template.

T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
  Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
  return.

Accepted (with rationale documented in ReReviewAfterRemediation.md):
  Nm2 — CoalescedHit semantic is per-design
  Nm4 — _lastBindError preservation on clean exit is intentional forensics
  Nm5 — EventLogBridge has no injectable logger
  Nm8 — Cosmetic log noise
  T1  — Reflection on private fields documented as maintenance trap

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:02:21 -04:00
Joseph Doherty 7a435957ee mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes
Closes the new findings from the post-remediation re-review
(codereviews/2026-05-14/ReReviewAfterRemediation.md):

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
  Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was
  meant to replace. Supervisor.StopAsync nulls the per-mux counter
  provider before the drain loop runs, so CountInFlight always returns 0
  and the drain budget is never spent on actual draining. Fix: snapshot
  the in-flight count BEFORE supervisor stop, drop the theatrical
  post-stop loop, and report InFlightAtCancel as the snapshot count
  (= the number of in-flight requests dropped by the stop). The
  supervisor stop IS the drain — there is nothing to drain that
  wouldn't be killed by the stop itself.

NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable
  Without a token, a long Polly-wrapped EnsureBackendConnectedAsync
  against an unreachable host could hold the gate for the full
  BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync
  (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound
  the wait with a 2 s teardown deadline; on timeout proceed
  best-effort without the gate. Worst-case consequence is one orphaned
  in-flight cycle on the dying backend, surfaced to upstream as
  exception 0x0B by the watchdog.

NM2 — ReplaceContext non-atomic ctx + provider swap
  Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If
  `_ctx` was swapped first, a snapshot taken in the gap would still hold
  the OLD adapter wrapping the OLD cache — which the supervisor disposes
  immediately after we return. Fix: set the provider FIRST, then swap
  `_ctx`. Snapshots in the swap window now read either (old, old) or
  (new, new), never (old-after-disposed).

NM5 — Self-cascade ObjectDisposedException after dispose
  Writer/reader fault catches fired `_ = TearDownBackendAsync(...)`
  unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the
  fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as
  an unobserved Task exception. Fix: skip self-cascade when
  `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit
  TearDown anyway.

Nm1 — Saturation cleanup uses await SendResponseAsync
  W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync,
  which would serialise on a wedged late-attacher's full bounded channel
  and stall delivery to its peers — contradicting the W1.3 doctrine that
  the fan-out path must never await per-pipe writes. Fix: use
  TrySendResponse and increment ResponseDropForFullUpstream on drop.

T2 — WatchdogVsResponse_Race seeded Random fragility
  Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms;
  Random's algorithm is implementation-defined across .NET major versions
  (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all
  samples on one side of the deadline and break the "both branches must
  fire" assertion. Fix: deterministic counter-based alternation (15 fast
  + 15 slow across 30 iterations) — guaranteed by construction.

Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS
single-shot semantics) are unfixed: no caller actually re-Starts a
supervisor today; both become real only if the reconciler ever changes
to re-Start instead of dispose-and-rebuild. Documented in the re-review.

Tests: 387 pass / 0 fail. Three back-to-back race-test runs in
isolation all green (T2 alternation is deterministic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:52:33 -04:00
Joseph Doherty 53f842a655 mbproxy: close all 5 race-hard W3 test gaps from 2026-05-14 review
Closes the 5 deterministically-race-hard test gaps that were previously
documented as known omissions (#5–9 in codereviews/2026-05-14/RemediationPlan.md).
Tests: 387 pass / 0 fail (baseline 382 + 5 new race tests). Three back-to-back
runs in isolation all green — no observable flakes.

Each test reaches the relevant code path deterministically by either:
  - reaching into the multiplexer's private state via reflection (only used
    for pre-saturating the TxIdAllocator — the test path that's externally
    impossible to hit otherwise without spawning 65,536 real connections),
  - constructing a backend stub that exercises the timing window directly, or
  - asserting only the externally-observable contract that holds across all
    valid interleavings (no-double-delivery, no-hang) rather than asserting
    a specific ordering that flakes.

W3 #5 — TxIdAllocator_Saturated_NextRequest_GetsException04_WithOriginalTxId
  Pre-saturates the multiplexer's _allocator via reflection (TryAllocate
  ×65536), then sends one FC06 write. The next request hits the
  !_allocator.TryAllocate branch immediately and the test verifies exception
  04 with the original TxId echoed.

W3 #6 — TxIdAllocator_Saturated_TwoConcurrentIdenticalReads_BothPipesGetException04
  Pre-saturates the allocator, then fires two concurrent identical FC03 reads
  from two pipes. Both pipes must receive exception 04 — regardless of whether
  pipe B coalesces onto pipe A's stub (W1.2's deliver-to-late-attachers path)
  OR opens its own factory failure path. The contract verified is "no late
  attacher hangs" — the externally-observable invariant from the W1.2 fix.

W3 #7 — SlowUpstream_DoesNotStallPeerResponses_DropCounterIncrements
  Wedges upstream A by leaving its socket-receive side undrained, pumps 500
  FC03 requests through A so the bounded response channel + kernel buffer
  fill, then sends one request from a healthy upstream B. B's response must
  arrive within seconds (would block forever pre-W1.3) and A's
  ResponseDropForFullUpstream counter must increment — proving the W1.3
  TrySendResponse non-blocking fan-out works as designed.

W3 #8 — WatchdogVsResponse_Race_AlwaysExactlyOneOutcome_PerRequest
  Custom SlowResponseBackend stub responds at a randomized 350–450 ms delay
  while BackendRequestTimeoutMs=400. Across 30 iterations, the request races
  the watchdog's per-tick scan. The contract asserts: every request gets
  exactly ONE response (normal or exception 0x0B), the original TxId is
  always echoed, and BOTH branches are exercised (proving the race window is
  real). The W1 claim-then-dispatch design (CorrelationMap.TryRemove as the
  single source of truth) makes this contract hold across all interleavings.

W3 #9 — CascadeVsNewAccept_StressChurn_NoCrash_NoHang
  Stress-test: 3 cascade cycles, 8 concurrent connect+request tasks per
  cycle. Backend is killed mid-cascade-storm to force teardown to race the
  in-flight connect attempts. After all churn the multiplexer must still
  serve a normal request. The originally-flagged race (a pipe added between
  _pipes.Values.ToArray() and _pipes.Clear() in TearDownBackendAsync) is
  microseconds wide and not deterministically reproducible without test
  seams; this stress test instead proves the no-crash-under-churn property
  that operators care about.

Helpers added:
  DrainAllocator(PlcMultiplexer) — reflection-based saturation primitive,
    only used by tests #5 and #6.
  SlowResponseBackend — backend stub with caller-supplied per-request delay
    via a Func<int>, only used by test #8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:29:44 -04:00
Joseph Doherty 2545237973 mbproxy: close remaining 5 W3 test gaps from 2026-05-14 review
Closes the 5 "easily addable" W3 test gaps left after the prior W3 commit;
the 5 race-hard gaps remain documented as known omissions per the plan.
Tests: 382 pass / 0 fail (baseline 378 + 4 net new methods — the supervisor
runtime-fault test replaces the existing placeholder).

  #11 BcdCodecTests.Encode16_IntMinValue_Throws_OutOfRange_NoArithmeticSurprise
      Locks the (uint)value > Max16 boundary check against int.MinValue. The
      cast becomes 0x80000000 which is well above 9999, so the throw fires
      cleanly. Prevents regression to a two-sided int comparison that would
      underflow.

  #15 BcdPduPipelineTests.FC03_Request_QtyAbove128_AtNonBcdAddress_PassesThroughUnchanged
      DL205/DL260 caps FC03/FC04 at qty=128 (DL260/dl205.md). The proxy must
      NOT truncate the qty field — passing through unchanged lets the PLC's
      own validator return exception 03 to the client (transparent contract
      for FCs/addresses the rewriter doesn't own).

  #4 SupervisorTests.Supervisor_RuntimeFault_OnRunningListener_RecoversAndRebinds
      Replaces the previous placeholder. Genuinely faults the running listener
      mid-life by stopping its underlying TcpListener via reflection (the
      single externally-observable hook to force the accept loop's
      AcceptAsync to throw ObjectDisposedException). Asserts the supervisor
      transitions to Recovering, re-binds via the Polly pipeline, and bumps
      RecoveryAttempts.

  #10 HotReloadE2ETests.E2E_ReadCoalescingEnabled_FlipAtRuntime_PropagatesToOptionsMonitor
      Validates that flipping Mbproxy.Resilience.ReadCoalescing.Enabled at
      runtime via hot-reload propagates through the live IOptionsMonitor.
      The W2.1 fix wires the accessor through to add/restart supervisors;
      the multiplexer reads it per-PDU (unit-tested separately). Proving
      IOptionsMonitor sees the new value is sufficient for the contract.

  #16 ConfigReconcilerTests.Apply_ManyConcurrentReloads_With_PlcChurn_NoCorruption
      Stress-tests the W2.3 ConcurrentDictionary fix. 16 concurrent applies
      cycle through 8 distinct PLC rosters, driving Add+Remove churn against
      the live supervisor dict. Without W2.3 the inner Task.WhenAll
      continuations would corrupt Dictionary<,> and crash with
      KeyNotFoundException / ArgumentException. Asserts every apply succeeds,
      no orphan supervisors remain, and the reload counter equals 16.

The 5 deterministically-race-hard gaps (#5 TxId saturation propagation, #6
coalescing factory leak under saturation, #7 backend-reader head-of-line
block, #8 watchdog↔response race, #9 cascade↔new-accept race) remain open
by design — reproducing those races deterministically requires test seams in
production code or stress-style tests that flake on slow CI. The Wave-1
fixes are still verified at the unit-contract level
(UpstreamPipeTests.TrySendResponse_WhenChannelFull, etc.).

This closes everything actionable in codereviews/2026-05-14/RemediationPlan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:17:03 -04:00
Joseph Doherty 7ead3581ab mbproxy: Wave 3 cleanups, docs, and test gaps from 2026-05-14 review
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).

Code cleanups:
  * PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
    conversion uses Stopwatch ticks two lines below).
  * UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
    == count ? false : false` ternary; both branches were `false`.
  * InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
    `AttachOrCreate` and made `void`. Test sites updated to drop the dead
    `bool ok = ...; ok.ShouldBeTrue();` assertions.
  * BcdCodec.HasBadNibble promoted from private to internal; the duplicate
    copy in BcdPduPipeline removed and the call sites updated to
    `BcdCodec.HasBadNibble`.
  * PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
    100 ms. Now both agree.
  * StatusSnapshotBuilder: simplified the unreachable
    `RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
    `RemoteEp?.ToString() ?? "?"`.
  * Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
    description of where Polly is used (BackendConnect + ListenerRecovery).

Doc updates:
  * README: added a callout about the unconventional 32-bit BCD wire format
    ("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
    integrators using off-the-shelf clients learn about the silent-corruption
    hazard before configuring writes.
  * docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
    semantics — "miss" means "did not find a fresh entry / did not coalesce",
    NOT "produced a backend round-trip". Operators wanting actual backend
    traffic should compute `miss − coalescedHit − exception04`.
  * docs/Architecture/ResponseCache.md: documented the structural
    "skip invalidation while recovering" gating (no backend reader during
    recovery → no FC06/FC16 response → no invalidation).
  * docs/Operations/Configuration.md: noted that the Event Log sink is the
    custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
  * docs/plan/README.md: added a Phase 12 row pointing at the remediation
    plan and linking out to codereviews/2026-05-14/.

Test additions (W3 high-value gaps):
  * BcdPduPipelineTests:
    - FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
      (symmetric inverse of the existing low-side partial-overlap test).
    - FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
      (mixed-slot routing in a single FC03 read).
    - FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
      carries no register data; rewriter must pass through).
  * AdminEndpointTests:
    - NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
      PATCH against `/` and `/status.json` must return 405; guards against
      an accidental MapPost being added later).
  * HotReloadE2ETests:
    - E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
      W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
      reload path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:06:52 -04:00
Joseph Doherty e66b17fe5f mbproxy: Wave 2 fixes from 2026-05-14 code review
Resolves the 21 Major findings catalogued in
codereviews/2026-05-14/RemediationPlan.md (Wave 2). Tests: 370 pass / 0 fail
(baseline 363 + 7 new W2 regression tests).

Multiplexer / concurrency:
  W2.1  ConfigReconciler.Attach now threads the live coalescingAccessor through
        to add/restart-built supervisors so a hot-reload of
        ReadCoalescing.{Enabled,MaxParties} propagates to PLCs added or
        restarted via reload.
  W2.2  PlcMultiplexer._disposed and UpstreamPipe._disposed are now volatile
        for ARM/portability defense.
  W2.3  ProxyWorker._supervisors / ConfigReconciler._supervisors switched from
        Dictionary to ConcurrentDictionary; reconciler uses TryRemove. The
        outer Apply is serialised by a semaphore but the inner Add/Remove/
        Restart Task.WhenAll continuations run in parallel.
  W2.4  Counter parity for cache miss + coalescing-saturation miss documented
        inline (per-design contract; behavior unchanged).
  W2.5  _disposeCts.Dispose() and _connectGate.Dispose() guarded against late
        watchdog ticks.
  W2.6  _connectGate disposed in DisposeAsync.
  W2.7  Inline doc clarifying the post-rewriter FC byte read.

Cache / hot-reload:
  W2.8  PlcListenerSupervisor.ReplaceContextAsync now calls Clear() to capture
        the entry count, emits mbproxy.cache.flushed, then disposes the old
        cache. Previously the event was defined but never emitted.
  W2.9  Inline doc explaining the implicit "skip cache invalidation while
        recovering" gating (no backend reader during recovery → no FC06/FC16
        response → no invalidation).
  W2.10 ReloadValidator now re-checks resolved per-tag CacheTtlMs against
        Cache.AllowLongTtl after BcdTagMapBuilder folds the per-PLC default.

BCD rewriter:
  W2.11 Duplicate addresses detected within Global itself and within the per-PLC
        Add list itself, BEFORE the working dictionary collapses keys. Cross-list
        collisions (Global vs Add) remain the documented width-override pattern.
        Previously the DuplicateAddress error was unreachable dead code.
  W2.12 OverlappingHighRegister reports each colliding pair exactly once
        (canonicalised low/high pair tracked in a HashSet).
  W2.13 FC16 32-bit write rejects clientLow > 9999 or clientHigh > 9999 BEFORE
        the high*10000+low reconstruction. Without this guard, (high=9999,
        low=9999) silently re-encoded as (high=9998, low=9999), losing 1 from
        the high word.
  W2.14 FC16 validates pdu.Length >= 6 + qty*2 upfront — no half-rewritten
        requests when a malformed client claims more registers than it ships.

Supervisor:
  W2.15 WaitForInitialBindAttemptAsync now backed by TaskCompletionSource
        instead of 10ms busy-poll. Resolves race against fast Stopped→Bound→
        Stopped transitions and hangs when the supervisor task throws.
  W2.16 StartAsync refuses re-entry on a non-Stopped supervisor (was leaking
        the previous _supervisorCts).
  W2.17 New TransitionTo helper writes _state, _lastBindError, and (optionally)
        _recoveryAttempts under one lock. Snapshot() reads under the same lock
        so the status page never reports an inconsistent triple. Truncate
        helper extracted (was copy-pasted across three sites).
  W2.18 MbproxyOptionsValidator + ReloadValidator reject Connection.{Backend
        ConnectTimeoutMs, BackendRequestTimeoutMs, GracefulShutdownTimeoutMs}
        <= 0. Misconfigured 0 produces immediate CancelAfter(0) failures.

Hosting / diagnostics:
  W2.20 ProxyWorker.StopAsync supervisor-stop deadline now reads from
        IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs
        (was hard-coded 5s).
  W2.21 src/Mbproxy/appsettings.json deleted; the published file is now a Link
        to install/mbproxy.config.template.json so the binary ships with a
        usable, fully-commented example config instead of an empty stub. Tests
        strip the inherited file from their bin via an AfterTargets="Build"
        Target so they don't pick up the template's example PLCs.
  W2.22 invalidBcdWarnings (PlcPdusStatus) and codeOther (ExceptionCounts)
        added to StatusDto, plumbed through StatusSnapshotBuilder, surfaced
        in StatusHtmlRenderer table cells.
  W2.23 EventLogBridge caches EventLog.SourceExists at construction so Emit
        doesn't hit the registry on every Error+ log line.

New regression tests:
  ReloadValidatorTests:
    Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails
    Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes
    Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails
    Validate_ZeroBackendConnectTimeoutMs_Fails
    Validate_NegativeGracefulShutdownTimeoutMs_Fails
  BcdPduPipelineTests:
    FC16_32Bit_ClientHighOrLowAbove9999_PassesThroughRaw_WithInvalidBcdWarning
    FC16_TruncatedRegisterData_PassesThroughRaw_NoPartialRewrite

Reworked tests in BcdTagMapBuilderTests for the W2.11 contract (Global dup,
Add dup, Add-overrides-Global accepted as width override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:48:44 -04:00
Joseph Doherty ce32c5cee8 mbproxy: Wave 1 fixes from 2026-05-14 code review
Resolves the four critical correctness defects + the ShutdownCoordinator
double-stop ordering bug called out in codereviews/2026-05-14/Overview.md.
Tests: 362 pass / 0 fail (baseline 358 + 4 new W1 regression tests).

W1.1 — Context swap on running multiplexer.
  PlcMultiplexer._ctx becomes volatile with a new ReplaceContext() method
  that re-registers the cache stats provider on the (preserved) counters.
  PlcListener exposes its multiplexer; PlcListenerSupervisor.ReplaceContextAsync
  swaps the running mux first, then disposes the old cache. Hot-reload
  tag-list changes and the cache-flush-on-reload contract now actually take
  effect on the next PDU instead of waiting for the next listener fault.

W1.2 — Coalescing factory leak.
  When the InFlightByKey factory soft-fails (allocator saturation or duplicate
  TxId), the cleanup path now TryRemoves the stub and walks every party on it
  (including late attachers) to deliver Modbus exception 0x04. Previously
  only the leader got the exception; late attachers waited forever for a
  response that no backend round-trip would ever fire.

W1.3 — Backend-reader head-of-line block.
  UpstreamPipe gains TrySendResponse for non-blocking enqueue. The per-PLC
  backend reader's fan-out loop uses it instead of awaiting SendResponseAsync,
  so a wedged upstream's full bounded response channel can no longer stall
  the single backend reader and starve every other client on that PLC. New
  responseDropForFullUpstream counter on ProxyCounters / CounterSnapshot
  records the drops.

W1.4 — Stranded outbound frames after cascade.
  TearDownBackendAsync acquires _connectGate and drains any frames left in
  _outboundChannel after the writer task faulted/cancelled, releasing their
  proxy TxIds back to the allocator. Without this, a fresh
  EnsureBackendConnectedAsync racing the cascade would send stranded frames
  with old TxIds onto the new backend socket; the responses would arrive
  with no correlation entry and the upstream peers would hang on the
  watchdog until BackendRequestTimeoutMs.

W1.5 — Delete ShutdownCoordinator (Option B).
  Drain logic moved into ProxyWorker.StopAsync. AdminEndpointHost is no
  longer registered as IHostedService; ProxyWorker drives its lifecycle
  directly so admin starts after listeners are bound and stops AFTER the
  in-flight drain (the design's documented contract). Admin is resolved
  lazily in ExecuteAsync to break the circular DI graph
  (Admin -> StatusSnapshotBuilder -> ProxyWorker). GracefulShutdownTimeoutMs
  is now read fresh from IOptionsMonitor.CurrentValue at stop time, so a
  hot-reloaded value is honoured. Removes ShutdownCoordinator + tests.

New tests:
  PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu
  PlcMultiplexerTests.ReplaceContext_NewCache_NextReadGoesToBackend_NotOldCache
  UpstreamPipeTests.TrySendResponse_WhenChannelFull_ReturnsFalse_WithoutBlocking
  UpstreamPipeTests.TrySendResponse_AfterDispose_ReturnsFalse

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:16:13 -04:00
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00
Joseph Doherty 53a7111120 mbproxy/README: refresh top-section descriptions after docs split
Three lines were stale after the docs/ reorganization:

- Headline link for the response cache pointed at design.md's "Phase 11"
  anchor; the focused doc is now docs/Architecture/ResponseCache.md.
- The docs/ tree one-liner said "Design document, phase plans, and
  operations runbook" but the directory now contains five subdirs of
  topic-focused docs.
- The Resource index row for design.md claimed it carried "log events"
  (now lives in Reference/LogEvents.md) and operations.md's row claimed
  "config, troubleshooting" (now in Operations/Configuration.md and
  Operations/Troubleshooting.md). Both rows rewritten to match each
  doc's current scope. Added a row for docs/kpi.md which the index
  was missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:51:26 -04:00
Joseph Doherty b4c82bf379 mbproxy/docs: slim operations.md to runbook content + pointers
Three sections in operations.md duplicated the new focused docs:
  - "Configuration" → Operations/Configuration.md + Features/HotReload.md
  - "Status page"   → Operations/StatusPage.md
  - "Common failure modes" → Operations/Troubleshooting.md + Reference/LogEvents.md

Replaced each with a short pointer block. The runbook now keeps only
content unique to day-two ops: install steps, upgrade procedure, uninstall,
log file locations / retention / archival, and the first-install smoke
checklist. 271 -> 176 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:49:34 -04:00
Joseph Doherty 0eb560a7f6 mbproxy/docs: slim design.md log-event section to a pointer
The 24-row event-name table in design.md duplicated docs/Reference/LogEvents.md
and had drifted out of date (config.reload.applied and .rejected listed wrong
property templates; LogContext.PushProperty scoping described but not actually
used in source). LogEvents.md is the authoritative catalog now; design.md
keeps the logging-architecture rationale (sinks, naming convention, stability
contract) and routes to LogEvents.md for the names.

Also corrects the property-emission description: properties go through
[LoggerMessage] templates, not LogContext.PushProperty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:45:39 -04:00
Joseph Doherty f49e27e316 mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:44:34 -04:00
Joseph Doherty 4fcda87ecd root/docs: index mbproxy's post-Phase-9/10/11 capabilities
Multiplexing, read coalescing, and opt-in response caching change what someone might
come to mbproxy for, so the root index now names them alongside BCD rewriting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:11:20 -04:00
Joseph Doherty 1db900edef mbproxy: add opt-in response cache (Phase 11)
Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.

Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).

New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
  the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
  loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
  mbproxy.cache.* names).

Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).

Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).

Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.

Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.

PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.

Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
  overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
  unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
  dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
  coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
  overlapping, non-overlapping write does not invalidate, multi-tag
  TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
  works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
  * Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
  * TTL expiry path with sleep > TTL.
  * Write invalidation through the proxy on a scratch register.
  * BCD-decoded bytes are cached, not raw BCD nibbles.
  * Regression: Cache disabled by default -> behaviour byte-identical to
    Phase 10.

Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.

Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:08:51 -04:00
Joseph Doherty 892b10baf4 mbproxy/docs: pivot design contract for Phase 11 response cache
Lands the design-contract pivot ahead of any cache implementation code so
reviewers can evaluate the change to the "purely transparent proxy" stance
independently of the Phase-11 code that depends on it.

- docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes
  sections to acknowledge the opt-in cache; add new "Response cache (Phase
  11)" section covering lookup order (cache -> coalesce -> backend), multi-
  tag range TTL = min, post-rewriter storage, address-range-overlap write
  invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate,
  and LRU-bounded capacity. Extend log event table with mbproxy.cache.*
  events. Extend per-PLC status field table with cacheHitCount /
  cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes.
  Extend hot-reload propagation table with CacheTtlMs / Cache.* rows.
- docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11"
  to "shipped in Phase 11" and add Tier 2.4a cache-memory section.
- CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture
  headline bullets to reflect the transparent-by-default + opt-in-cache
  contract; flip "Implementation complete through Phase 10" to "through
  Phase 11".
- install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache
  block and a CacheTtlMs example on a BcdTags.Global entry, with prominent
  staleness commentary documenting the design contract.

No code changes in this commit - implementation lands in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:35:35 -04:00
Joseph Doherty a2dba4bd07 mbproxy: add in-flight read coalescing (Phase 10)
When two or more upstream clients send the same FC03/FC04 read while a
matching request is already in flight on the same PLC's multiplexed
backend socket, attach the late arrivals to the existing InFlightRequest
.InterestedParties list instead of opening a second backend round-trip.
The single backend response fans out to every attached party with each
party's original MBAP TxId restored individually. Zero post-response
staleness — coalescing operates entirely within the in-flight window
(microseconds to ~10 ms typical); the proxy is NOT a cache layer.

Headline mechanism:

- New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys
  the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus
  tables and never share a key; different unit IDs never coalesce;
  writes (FC06/FC16) bypass the coalescing path entirely.
- InFlightByKeyMap uses a simple lock around a Dictionary; atomic
  TryAttachOrCreate either appends a new party to the in-flight
  request's mutable List<InterestedParty> or invokes a factory to
  build a fresh entry. Per-entry MaxParties cap (default 32) bounds
  fan-out cost; past the cap, the next arrival opens a new entry.
- PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for
  FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The
  factory closure does the Phase-9 work (allocate TxId, add to
  CorrelationMap); the channel send happens AFTER returning from
  TryAttachOrCreate so the map lock is not held across the async send.
- Response fan-out in RunBackendReaderAsync removes the entry from
  InFlightByKeyMap before iterating InterestedParties, ensuring no
  concurrent attach can mutate the list during iteration.
- Cascade + watchdog paths also drain the key map so a stale entry
  cannot outlive its backend round-trip.

Counter accounting balance (per snapshot): CoalescedHitCount +
CoalescedMissCount equals total FC03 + FC04 requests since startup.
Even with coalescing disabled, every read still bumps Miss so dashboard
math stays balanced.

New surface (additive only):
- src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs
- src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs
- src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs
- ReadCoalescingOptions on ResilienceOptions
- CoalescedHitCount / CoalescedMissCount /
  CoalescedResponseToDeadUpstream counters surfaced on /status.json
  per PLC and as a compact "Coal" cell on the HTML status page.

Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire
previously read the same register from both clients (which now
coalesces). Patched to read two different addresses so the test still
proves distinct backend TxIds without violating the coalescing
contract.

Tests added: 24 new (19 unit + 5 E2E):
- CoalescingKeyTests (5)
- InFlightByKeyMapTests (6, includes concurrent stress)
- ReadCoalescingTests (8, stub-backend with deterministic delay)
- ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active
  during overlap is proven against the stub, not the sim, due to
  pymodbus 3.13's known concurrent-frame bug)

Total: 325 tests passing (282 unit + 43 E2E).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:26:06 -04:00
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00
Joseph Doherty 2e937228a0 mxaccesscli: add read-batch / write-batch / subscribe-batch (JSONL input)
Three new subcommands that take a JSONL file (or '-' for stdin) and reuse a
single MxSession across all entries. The big win is in write-batch: a
two-phase pipeline (Advise all -> drain DataChange to resolve types; Write
all -> drain WriteComplete) reduces wall time from N x (resolve + write_ack)
to ~max(resolve) + ~max(write_ack). Measured 38.2s -> 10.3s (~3.7x) for
four writes against the ZB dev galaxy; the saving grows with N.

Per-item continue-on-error: parse errors are collected line-by-line and
abort with exit 2 before any LMX session opens; runtime failures (resolve
timeout, bad references, coerce errors, write timeouts) get their own
results[] row with a typed `error` string and exit 1. Auth flags mirror
`mxa write` and are resolved once before Phase A.

Shared infra:
  - Mx/JsonlInputReader.cs: lazy line reader (skips blank / '#' lines),
    bare-string or {"tag":"..."} for read/sub, {"tag","value","type"?} for
    write, with array-suffix consistency check at parse time.
  - Mx/ValueCoercion.cs: new CoerceJToken(...) wrapper preserves the
    single source of truth for type vocabulary.

Docs:
  - README run examples extended for each new command.
  - docs/usage.md: new "Batch input format" subsection (shared contract),
    one section per command with envelope examples and a full
    failure-mode table for write-batch, plus a "Batch commands -
    verified live" section capturing the 2026-05-10 ZB-galaxy run and
    pipelining-timing numbers.
  - test-fixtures/ holds the exact JSONL files used in the verified-live
    run so the doc numbers are reproducible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 08:34:46 -04:00
Joseph Doherty 6cde4d9fe4 graccesscli: scripts-get now uses direct read; bypass anon-type reflection
Two follow-on fixes after the MxValueDetails reader correction:

1. AttributeValueDetails was falling through to the package-binary
   fallback even when MxValueDetails returned Supported=true. Cause:
   `ReadString(direct, "Supported")` reflects on an anonymous type via
   InvokeMember, which returned an empty string under
   `BindingFlags.IgnoreCase`, making the StringEquals check fail.
   Replace with a typed `Type.GetProperty("Supported").GetValue(direct)`
   bool check via the new IsDirectSupported helper.

2. ObjectScript (the `scripts get` body retrieval) skipped the live
   attribute path entirely and went straight to the package-binary
   parser. The parser truncates strings at certain delimiters (e.g.
   marker text containing `\n` or `=` came back as just the prefix).
   Now ObjectScript tries `obj.ConfigurableAttributes[<script>.<field>]
   .value.GetString()` first and only falls back to the package body
   when the direct read returns null. Matches the pattern in
   AttributeValueDetails. The package fallback survives for cases
   where direct GRAccess doesn't surface the attribute.

End-to-end live round-trip on $DelmiaReceiver.ProcessRecipe.ExecuteText
with marker `// scripts-get round-trip marker\nMe.RecipeDownloadFlag = false;`:

  scripts get      -> Source: direct-graccess
                      Body: <marker>
  attribute value get -> Value: <marker>

Both readers return the same content the writer wrote. Field then
restored to the original `Me.RecipeDownloadFlag = false;`.

Tests still 66/66.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 01:34:04 -04:00
Joseph Doherty 842b94fb39 graccesscli: fix attribute value reader (typed IMxValue + IgnoreCase)
The MxValueDetails reader was the root of the false "writeback gap"
diagnosis. Three problems compounded:

1. ReadProperty(attr, "Value") used case-sensitive late-binding via
   InvokeMember. The COM-side property is exposed as `value` (lowercase)
   on IAttribute, so the lookup silently returned null.

2. Even if the MxValue was retrieved, the accessor probe loop
   (Invoke(value, "GetBoolean") -> "GetInteger" -> ... -> "GetString")
   relied on IDispatch late-binding. The MxValue interface's accessors
   are vtable-only — they aren't reachable through IDispatch — so every
   call threw, and the reader reported "no supported scalar accessor
   succeeded" even when the typed accessor would have returned the
   value cleanly.

3. The probe loop ordering put GetString last, so for any value that
   somehow survived the type errors silently, an earlier accessor could
   have returned a wrong-typed result.

Fix:

- Add `BindingFlags.IgnoreCase` to ReadProperty + Invoke. COM IDispatch
  is case-insensitive at the IDL level; .NET InvokeMember is not unless
  explicitly told to be. Comment explains the trap.
- AttributeValueDetails now uses typed `attr.value` (and falls back to
  the same lookup via obj.ConfigurableAttributes when the runtime proxy
  returns null), bypassing the late-binding gap entirely.
- MxValueDetails casts to IMxValue and dispatches on
  GetDataType().ToString(), calling the typed scalar accessor for the
  declared type (GetBoolean / GetInteger / GetFloat / GetDouble /
  GetString / GetElapsedTime). MxBigString and MxInternationalizedString
  share the GetString path. MxTime is left as a follow-up; its
  VBFILETIME field layout differs across interop builds.
- Replaced the misleading "Attribute value is not exposed" / "no
  scalar accessor succeeded" messages with specific diagnostics that
  identify the actual failure (wrong type cast, typed accessor exception,
  unhandled MxDataType).

Live verification on $DelmiaReceiver.ProcessRecipe.DeclarationsText:

  BEFORE fix: `attribute value get` → Supported: false, Value: null,
              "Attribute value is not exposed by this GRAccess attribute."
  AFTER fix:  `attribute value get` → Supported: true, DataType: MxString,
              Value: "// roundtrip-test marker (graccesscli scripts set
              --field DeclarationsText)\n"

The reader now correctly surfaces the value graccesscli's writes have
been persisting all along. This is the gap that produced the "writeback
gap" mirage in the previous round of investigation.

Tests still 66/66.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:50:16 -04:00
Joseph Doherty c52d8d0171 graccesscli: correct script-edit docs to TN-537 truth (writes DO persist)
I was wrong. AVEVA Tech Note 537 ("Creating an Application Object Script
Using GRAccess", April 2008) documents the supported pattern:
ConfigurableAttributes[<script>.<field>].SetValue(MxValue) inside a
CheckOut/Save/CheckIn cycle. graccesscli's existing
FindAttributeForMutation already follows this — writes to MxCategoryPackageOnly_Lockable
script-text fields persist correctly.

The earlier "writeback gap" diagnosis was a phantom caused by a reader-side
issue. `object attribute value get` against a script body returns
"Supported: False / Attribute value is not exposed" because
MxValueDetails uses a case-sensitive `ReadProperty(attr, "Value")` lookup
plus an accessor probe (GetBoolean -> GetInteger -> GetFloat -> GetDouble
-> GetString) that can fall through silently for some MxValue shapes. The
COM-side property is exposed as `value` (lowercase), readable as
`attr.value.GetString()` -- which the live probe at
`analysis/ide-edit-investigation/probe_setvalue/` does and confirms the
post-write content matches the marker exactly.

Live verification on $TestMachine.UpdateTestChangingInt.DeclarationsText
and $DelmiaReceiver.ProcessRecipe.{ExecuteText,DeclarationsText}:

  === verdict ===
    marker landed on same-proxy   ConfigurableAttributes: True
    marker landed on same-proxy   Attributes            : True
    marker landed on fresh-proxy  ConfigurableAttributes: True
    marker landed on fresh-proxy  Attributes            : True

The probe also confirmed that two earlier graccesscli `object scripts set`
invocations (which I had wrongly believed failed) had persisted -- the
marker text I wrote previously was still on disk in
ProcessRecipe.{ExecuteText,DeclarationsText} when read directly via
attr.value.GetString(). The probe restored both fields to their original
values.

This commit:

- Updates the misleading [Command(...)] / [CommandOption(...)]
  descriptions in GRAccessSurfaceCommands.cs back to honest versions
  citing TN-537.
- Restores the --file-using examples for `object scripts set` and
  `object scripts create` across script-editing.md, llm-integration.md,
  usage.md, and zb-testmachine.md.
- Removes the test that asserted the (wrong) EnsureMutableViaSetValue
  guard. Re-aims ScriptCommandDescriptions_… at the corrected wording.
- Removes two leftover EnsureMutableViaSetValue calls in the trigger-period
  / trigger-type write paths (both targeted MxCategoryWriteable_C_Lockable
  attributes; would never have fired even if the helper still existed).
- Adds analysis/ide-edit-investigation/REPORT.md (replacing the earlier
  wrong report) plus the probe sources under probe_setvalue/.

The MxValueDetails reader gap (case-sensitive ReadProperty + accessor
probe) is a real follow-up: `object attribute value get` should
case-insensitively read `value` and try GetString first when the
underlying MxValue.DataType is MxString. Out of scope here -- that's a
separate, smaller fix.

Test count delta: 67 -> 66 (-2 wrong tests, +1 corrected description test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:33:51 -04:00
Joseph Doherty e4e54254d8 Revert "graccesscli/docs: document the package-only writeback boundary"
This reverts commit c12fbc5988.
2026-05-05 21:19:00 -04:00
Joseph Doherty 4e242ca824 Revert "graccesscli: fail fast on package-only attribute writes (writeback gap fix)"
This reverts commit bd95ace1c5.
2026-05-05 21:19:00 -04:00
Joseph Doherty 87c0124174 Clarify script mutation limits 2026-05-05 16:40:55 -04:00
Joseph Doherty c12fbc5988 graccesscli/docs: document the package-only writeback boundary
Adds a new "Editing Script Content" section to script-parsing.md that
explains the IAttribute.SetValue silent-no-op behavior on
MxCategoryPackageOnly* attributes, enumerates which ScriptExtension
fields fall on each side of the boundary (text fields = no, trigger
settings = yes), names the EnsureMutableViaSetValue safety check that
landed in bd95ace, and includes a one-liner for inspecting any
object's package-only attributes via `object attributes --configurable`.

Closes the documentation gap surfaced when validating the round-trip
script edit on \$DelmiaReceiver.ProcessRecipe — the previous docs
described the read path comprehensively but said nothing about which
attributes can actually be written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:52:57 -04:00
Joseph Doherty bd95ace1c5 graccesscli: fail fast on package-only attribute writes (writeback gap fix)
Investigation findings: IAttribute.SetValue silently no-ops for any
attribute whose AttributeCategory starts with `MxCategoryPackageOnly`.
The COM call returns Successful=true and the dispatcher reports OK,
but a verify-read via direct IAttribute.GetValue confirms the value
is unchanged.

Discovered while validating the new --field flag round-trip on
\$DelmiaReceiver.ProcessRecipe.ExecuteText:

- BEFORE write: ExecuteText body = `Me.RecipeDownloadFlag = false;`
- WRITE: scripts set --field ExecuteText --file marker.txt → OK
- AFTER write: ExecuteText body = `Me.RecipeDownloadFlag = false;`
  (UNCHANGED — the marker prefix from marker.txt is absent)
- ConfigurableAttributes shows `ProcessRecipe.ExecuteText` category =
  `MxCategoryPackageOnly_Lockable`. Same for DeclarationsText,
  StartupText, ShutdownText, OnScanText, OffScanText, Expression.
- Per docs/script-parsing.md:8, "Object-level scripts ... may appear
  as attributes, extension attributes, or only inside exported object
  packages" — script-text fields are in the third bucket.

Trigger settings (TriggerType, TriggerPeriod) are
`MxCategoryWriteable_C_Lockable` and remain effective via SetValue —
verified by `scripts settings set --trigger-period-ms 1000` returning
OK on a clean checkout state.

Add `EnsureMutableViaSetValue(IAttribute, name)` helper that throws
InvalidOperationException with a clear remediation hint when called on
a package-only attribute. Wired into three write sites:

- `case "value-set"` (object attribute value set)
- `ObjectScriptSet` (scripts set body writer)
- `SetScriptSettings` Expression branch (the one trigger-related field
  that's package-only; TriggerType/TriggerPeriod stay unguarded)

Validation against live ZB galaxy:

- Before fix: `scripts set --field ExecuteText` returned
  `success: true` but didn't mutate the body.
- After fix: same call returns `success: false` with
  `"Attribute 'ProcessRecipe.ExecuteText' has category
  'MxCategoryPackageOnly_Lockable'. Package-only attributes silently
  no-op via IAttribute.SetValue and can only be edited through the
  package import/export round-trip..."` (exit code 1).
- `scripts settings set --trigger-period-ms 1000` against the same
  script: still OK (TriggerPeriod is Writeable_C_Lockable).

The real fix for editing script bodies is the package-rewrite path —
export `.aaPKG`, modify the binary script-extension records, re-import
via `galaxy.ImportObjects`. graccesscli already has the read half
(`PackageSnapshotParser.Parse`); the write half is a separate followup.

61 → 63 → 63 (no test count delta this commit; the new helper is exercised
end-to-end by the live validation above; unit tests for the failure path
require a mock IAttribute proxy that's out of scope here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:45:07 -04:00
Joseph Doherty 3f6bfebd6d graccesscli: extend script editor surface — --field, --lock-trigger-type, scripts delete
Three additions to the script editor commands. Each one closes a real
gap surfaced by the round-trip-test against \$DelmiaReceiver.ProcessRecipe.

1. `object scripts set --field <name>` — explicit text-field selection.

   Previously `scripts set` always wrote to <Name>.ExecuteText (via
   ScriptAttributeName's default). To rewrite DeclarationsText / StartupText
   / ShutdownText / OnScanText / OffScanText / Expression, callers had to
   pass the full attribute name as `--script Foo.StartupText`, which is
   brittle. The new `--field` flag accepts any of the seven canonical
   ScriptTextSuffixes and composes <script>.<field> directly. Validates
   against the suffix list so an unrecognised --field surfaces a friendly
   error rather than a downstream FindAttributeForMutation failure.
   Default behavior (no --field) is unchanged: ExecuteText.

2. `object scripts settings set --lock-trigger-type` — parallel to the
   existing --lock-trigger-period. After writing TriggerType the new flag
   calls SetLocked(MxLockedInMe), matching the lock pattern on the period
   field. Without it, --trigger-type writes the value but leaves the
   attribute unlocked.

3. `object scripts delete` — script-named alias for the existing
   extension-delete subcommand. Wraps obj.DeleteExtensionPrimitive(
   "ScriptExtension", scriptName) inside AtomicObjectEdit (checkout / save
   / checkin). Removes the burden of remembering the generic
   `--extension-type ScriptExtension --primitive <Name>` form.

Test count delta: 61 -> 63 (+2 command-shape assertions for the new
ObjectScriptsSetCommand and ObjectScriptsDeleteCommand).

Live round-trip-test against \$DelmiaReceiver.ProcessRecipe:
- `--field DeclarationsText` write composed `ProcessRecipe.DeclarationsText`,
  CheckOut/Save/CheckIn all returned OK.
- `--field ExecuteText` round-trip same.
- A subsequent re-read shows the original body, suggesting that
  IAttribute.SetValue silently no-ops for ScriptExtension text fields on
  this GRAccess version (or the package-export reader pulls from a
  different snapshot than the just-saved revision). This is upstream of
  the editor surface — the new flags route correctly to the same SetValue
  path that scripts set already used. Diagnosing the SetValue
  ineffectiveness for script-text fields is a separate followup that
  should look at IScriptExtension-specific COM interfaces (per
  docs/script-parsing.md:8 "Object-level scripts are less direct").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:09:34 -04:00
Joseph Doherty 65537570b8 graccesscli: enumerate ScriptExtension via canonical text-attribute suffixes
`object scripts list` previously walked obj.Attributes through a
substring heuristic (`IsScriptAttribute`) that matched any name
containing "script", "expression", "execute", "startup", "shutdown",
or "scan". Two failure modes:

- "deSCRIPTion" contains "script", so every `MoveIn*.Description` /
  `MoveOut*.Description` attribute on `$MESReceiver` (and any other
  object with `.Description` UDAs) was mis-emitted as a script.
- `ScanState` / `ScanStateCmd` (ordinary attribute UDAs) leaked in via
  the "scan" keyword.

Net result on $MESReceiver: 25 entries, all false positives, zero
true ScriptExtensions.

The GRAccess COM API does not expose an `IScriptExtension` enumerable
(verified against ArchestrA.GRAccess.dll string table — only
`AddExtensionPrimitive` / `DeleteExtensionPrimitive` /
`RenameExtensionPrimitive` mutators exist). The only attribute-side
signal that an object carries a real ScriptExtension is the canonical
text-attribute suffix list documented in docs/script-parsing.md:
ExecuteText, DeclarationsText, StartupText, ShutdownText, OnScanText,
OffScanText, Expression.

Replace `IsScriptAttribute` with `ScriptExtensionPrefix(name)` that
detects exactly those suffixes, then group attributes by prefix:
- Each ScriptExtension yields one logical entry named by its prefix.
- The Fields[] list reports which text attributes are present.
- Sibling `<prefix>.TriggerType` / `<prefix>.TriggerPeriod` attributes
  are surfaced as TriggerType / TriggerPeriod when present.
- ExtensionType is always "ScriptExtension".

Outer `ObjectScripts(IGalaxy, IgObject, PackageSnapshot)` updated to:
- Thread ExtensionType / Fields / TriggerType / TriggerPeriod through.
- Look up body content by prefix first, then fall back to
  `<prefix>.ExecuteText` (the package's serialized bodies are keyed
  by full attribute name).
- Mark every `<prefix>.<field>` as emitted so the package-fallback
  branch doesn't re-emit each text field as its own entry.

Drop dead `IsScriptAttribute` (no other callers).

Validation against live ZB galaxy:
- $MESReceiver: 25 → 0 (correct: no ScriptExtensions on this template)
- $TestMachine: 1 entry — UpdateTestChangingInt with body
  "Me.TestChangingInt = System.Random().Next(1,1000);" + all 7 fields
  + trigger metadata.
- $DelmiaReceiver: 2 entries — ProcessRecipe + Reset, both with bodies
  ("Me.RecipeDownloadFlag = false;") + trigger metadata.

61/61 existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:54:41 -04:00
Joseph Doherty ac3368993e mxaccesscli: capture user-attribution investigation as a pick-up note
Adds a dated incident-style file at the tool root recording the full
arc of the User_Name attribution work over the last two days:

- Six runs tabulated (galaxy mode x advise variant x write variant x
  resulting User_Name) so the next agent can see what's already been
  ruled out.
- Current state of the CLI (auth, advise routing, WriteSecured) and
  the galaxy (eOSUserBased, ArchestraUsers role, engines deployed
  before the security change).
- Leading hypothesis: running aaEngine processes still operate under
  the original eNone security context because galaxy security is
  compiled at deploy time. Until the platform/engines are
  redeployed, auth_user_id stays at 1 and User_Name stays NULL.
- Concrete pick-up commands: undeploy/redeploy DevPlatform ->
  DevAppEngine -> TestArea -> TestMachine_001 via graccesscli, then
  re-run the trigger / ack-as-dohertj2 / clear sequence and query
  Events.
- Fallbacks if a clean redeploy doesn't change the answer
  (aaBootstrap restart, two-person Verified Write, attribute
  security classification check, cross-tool comparison vs Object
  Viewer / InTouch).

Linked from README.md's resource index so an agent landing on the
tool finds the open thread without spelunking commit history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:48:09 -04:00
Joseph Doherty c8f31bd653 mxaccesscli: add --secured / --verifier-* options for WriteSecured
WriteCommand grows three new options:

  --secured                Route the write through
                           LMXProxyServer.WriteSecured(currentUserId,
                           verifierUserId, value) instead of plain
                           Write(value, userId). Required for attributes
                           classified as Secured Write or Verified Write,
                           and useful for testing whether the audit
                           subsystem propagates user identity when
                           explicitly told the write is "secured".

  --verifier-username      Galaxy / OS username of the verifier for a
                           two-person Verified Write. Implies --secured.
  --verifier-domain        Domain composed with --verifier-username as
                           '<domain>\<username>'.
  --verifier-password      Verifier password. Redacted in the JSON
                           query echo.

When --secured is on without a verifier, the same auth_user_id is
used for both currentUserId and verifierUserId (single-user Secured
Write semantics). When a verifier is provided, the CLI authenticates
both users and bails cleanly with "verifier-authentication-failed"
on a verifier credential mismatch.

The JSON envelope's results[] gains `secured` and `verifier_user_id`
fields so an agent can confirm which path ran.

MxItem grows WriteSecured(value, currentUserId, verifierUserId).

Verified live against TestMachine_001.TestAlarm002.AckMsg under
eOSUserBased + ArchestraUsers role: --secured succeeds with
auth_user_id=1, verifier_user_id=1, MxCategoryOk. User_Name in the
Historian Events row remains NULL — same as plain Write. The
audit-attribution gate is not Write vs WriteSecured; running engines
likely still need a redeploy to pick up the new security mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:48:17 -04:00
Joseph Doherty 68eb9adae7 mxaccesscli: catch ArgumentException from AuthenticateUser as auth-failed
Under eOSUserBased galaxy security mode the LMXProxyServer raises
ArgumentException("Value does not fall within the expected range")
for bad credentials instead of silently returning 0 like permissive
(eNone) galaxies do. Both shapes mean "auth failed"; MxSession.Authenticate
now normalizes them into a 0 return so WriteCommand reports a clean
"authentication-failed" envelope and exits 1 — instead of crashing
with a stack trace.

Verified live against the ZB galaxy in eOSUserBased mode:
  bad password  -> ok=false, error="authentication-failed", exit 1
  good password -> ok=true,  auth_user_id=1, exit 0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:43:55 -04:00
Joseph Doherty a842f67b54 mxaccesscli: document session-scoped userId semantics
Replaces the "Reusing a userId" subsection with a clearer
"userId is session-scoped" note that captures the actual
GRAccess/MxAccess contract:

- AuthenticateUser returns an int that is bound to the LMX session
  hServer it was issued against, exactly like tag handles from
  AddItem. When the session unregisters (i.e. the MxSession is
  disposed, which happens when the process exits) the userId is
  no longer valid.
- Each mxa write invocation creates a fresh MxSession, so an
  authenticated write must include --username/--password (or a
  userId resolved within the current process).
- --user-id <N> is a hand-off for in-process callers, not a
  persistable credential.

This matches the user's observation that userIds behave like
tag/item handles in scope, and explains why batch flows benefit
from a future session-mode mxa daemon (modeled on graccesscli's
session) that holds one MxSession across many commands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:17:34 -04:00
Joseph Doherty 9de8660688 mxaccesscli: route write through Advise vs AdviseSupervisory by user
WriteCommand now picks the LMXProxyServer advise variant based on
whether credentials were supplied:

  --username given  -> Advise            (operator action; the write
                                          is attributed to the
                                          authenticated Galaxy user
                                          in the alarm/event audit
                                          trail)
  no --username     -> AdviseSupervisory (supervisory action; the
                                          write is attributed to the
                                          hosting client itself, no
                                          Galaxy user claimed)

MxItem grows AdviseSupervisory() alongside Advise() and shares the
same UnAdvise / RemoveItem teardown.

Verified live with the trigger / ack-as-dohertj2 / clear sequence on
TestMachine_001.TestAlarm002. The Set (anonymous, supervisory) and
Clear (anonymous, supervisory) rows pair with the Acknowledged row
(authenticated, Advise) under one Alarm_ID. On this development
galaxy every action still maps to User_Name=DefaultUser regardless
of advise variant — that's a galaxy-security configuration trait,
not a CLI bug. The routing is in place and will differentiate
correctly on a strict galaxy with real user records.

docs/usage.md gains an "Advise variant" section explaining the rule
and the expected User_Name population on strict vs permissive
galaxies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:10:25 -04:00
Joseph Doherty 0d25ec445f mxaccesscli: add --username / --domain / --password to write
Wraps LMXProxyServer.AuthenticateUser at the session level
(MxSession.Authenticate) and surfaces three new options on the
write command:

  -u, --username <name>   galaxy / OS user
      --domain   <name>   composed as <domain>\<username>; omit for
                          galaxy-authenticated logins or UPN forms
  -p, --password <pwd>    redacted to "***" in the JSON query echo

The CLI calls AuthenticateUser before AddItem so an auth failure
(userId == 0) bails out cleanly without leaving a half-set-up item.
On success the resolved userId flows into Write(hServer, hItem,
value, userId) and is reflected in:

  - human output: "[OK ] write <tag> = <val> (as <verify-user>, userId=N)"
  - LLM-JSON results[]: { "authenticated": true, "auth_user_id": N }

Verified live against TestChildObject.TestInt with credentials
DESKTOP-6JL3KKO\dohertj / Sonamu89:

  read   -> 99
  write 7 with --username/--domain/--password -> ok, auth_user_id=1
  read   -> 7
  write 99 with same auth -> ok
  read   -> 99 (restored)

Important behavior surfaced and documented in docs/usage.md
"Authentication" section: on a galaxy configured in permissive
(Free Access) mode, AuthenticateUser returns a non-zero userId
regardless of credentials — verified by intentionally passing a
wrong password and an unknown user, both of which still resolved
to userId=1 and the write went through. The CLI's auth path is
wired correctly; the galaxy just isn't strict. To exercise real
authentication, target a galaxy with galaxyAuthenticationMode and
attribute-level security above Free Access.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:00:06 -04:00
Joseph Doherty 7da30cf515 mxaccesscli: support bulk array writes via <arrayAttr>[]
WriteCommand now accepts multiple positional values when the tag
reference ends with '[]', bundling them into a strongly-typed array
(string[], int[], bool[], etc.) before passing to MxAccess.Write.
The CLR marshals the array to a COM SAFEARRAY of the matching
VARTYPE, which is the shape MxAccess expects for an array attribute.

Verified live on a 50-slot String[] (MESReceiver_001.MoveInPartNumbers):

  write 50 distinct strings A1..A50  -> ok, MxCategoryOk
  read [] -> ['A1','A2', ..., 'A50']

Plus a guardrail: passing multiple values without the '[]' suffix
exits 2 with a clear error so a typo can't accidentally write only
the first element of an indexed reference.

Critical finding documented in docs/usage.md: **a bulk write resizes
the array to the count provided.** Writing 25 values into a 50-slot
array leaves the array at 25 elements; the trailing 25 are
deallocated, not zero-filled. Verified by 50 -> 25 -> 50 round-trip
on the same attribute. Discover the runtime length via
'mxa read <attr>[]' or the configured length via grdb's
attributes.sql array_dimension column.

Type matrix in docs/usage.md updated:
- Bulk array via '[]'        - read  + write 
- Bare reference (no brackets) - read  + write 
- Element via '[N]'           - unchanged

ValueCoercion.cs: adds CoerceArray(IReadOnlyList<string>, typeHint)
that produces strongly-typed arrays. Default element type is inferred
from the first value when --type is unspecified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 21:09:01 -04:00
Joseph Doherty e9e37385aa mxaccesscli: document <arrayAttr>[] as the bulk-array read syntax
Empty square brackets — `<obj>.<arrayAttr>[]` — are the supported
MxAccess reference form for fetching a whole array as a single value.
Verified live:

  mxa read 'MESReceiver_001.MoveInPartNumbers[]' --llm-json
  -> ok=true, value=["", "11111", "", ..., "15", ...] (50 elements,
     quality=192, MxCategoryOk)

The bare reference without brackets continues to return
MxCategoryCommunicationError, Detail=1003 — that's a real proxy-side
rejection, not a CLI bug. Documented separately so the matrix shows
both the working syntax and the failing one.

docs/usage.md:
- Type matrix gains a "bulk array read via []" row () and a
  "bare reference" row (, kept for reference).
- "Reading arrays" section rewritten around the two reference
  shapes — `[]` for whole arrays, `[N]` for elements (1-based,
  N <= NumElements).
- Adds a "Discovering array length" note pointing at grdb's
  attributes.sql array_dimension column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 21:02:28 -04:00
Joseph Doherty cc1ce05f96 mxaccesscli: document type/array support matrix from live verification
Adds a 'Type support matrix' table to docs/usage.md capturing what was
verified against the live ZB galaxy on System Platform 2017 Express
with MxAccess 3.2.0.0:

Verified read + write (round-tripped):
- MxBoolean   — DelmiaReceiver_001.TestAttribute (false→true→false)
- MxInteger   — MESReceiver_001.MoveInBatchID    (999→12345→0→999)
- MxString    — MESReceiver_001.MoveOutErrorText (round-trip + restore)
- String array element via 1-based [N] indexing
                — MESReceiver_001.MoveInPartNumbers[2] (11111→ARR-WRITE-OK→11111)

Verified read only (no writeable instance in this galaxy):
- MxFloat       — DevPlatform.CPULoad{,Avg,Max,Min}, PageFaultsAvg
- MxTime        — DevPlatform.SystemStartupTime → ISO-8601 string
- MxReferenceType — Container/Host/Area → target object's Tagname

Wired but no live instance found (matrix marked ''):
- MxDouble, MxElapsedTime, MxQualifiedEnum, MxInternationalizedString,
  MxBigString. Adding a writeable UDA of any of these types should
  exercise the existing code path without further changes.

Hard limitation surfaced and documented:
- MxAccess.AddItem('<obj>.<arrayAttr>') for bulk array read returns
  MxCategoryCommunicationError, Detail=1003. Read array elements
  individually via the 1-based '[N]' suffix; '[0]' is invalid.

Adds a 'Reading arrays' section pointing at grdb's attributes.sql
('array_dimension' column) for discovering element count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:33:07 -04:00
Joseph Doherty 7cbe2756d0 mxaccesscli: clean up diag output
Rewrites DiagCommand to match the structure of read/write/subscribe:

- Output goes through CliFx's IConsole instead of System.Console
  (kills the 6 CliFx_SystemConsoleShouldBeAvoided build warnings).
- Single concise summary line at the top: `diag <tag>  (<sec>s, <status>)`.
- Status now correctly distinguishes ok / error / no-events. A bogus
  reference produces a DataChange event with MxCategoryConfigurationError;
  that reports as "error", not the previous (incorrect) "ok".
- Adds --llm-json mode mirroring read/write: { query, ok, results: [...] }
  with register/add_item/events/totals fields.
- Captures both DataChange and OperationComplete events with status
  detail rendered consistently.
- Drops the duplicate "Done. Total events: N" / "diag result: events=N"
  trailing lines.

Also fixes exit-code propagation from soft-failure commands. Environment
.ExitCode set inside ExecuteAsync was being overwritten by CliFx's own
return from RunAsync. Program.Main now folds the two:

    return cliFxExit != 0 ? cliFxExit : Environment.ExitCode

This restores correct exit-on-failure behavior for both `diag` (error
status) and `write` (timeout / non-Ok ack) without resorting to
CommandException, which would print "ERROR\n<stack trace>" even for an
empty message.

Verified:
  diag TestChildObject.TestInt   -> EXIT 0, "ok"
  diag NoSuchObject.NoSuchAttr   -> EXIT 1, "error"
  diag --llm-json                -> structured envelope, ok flag honors statuses

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:19:40 -04:00
Joseph Doherty ab202a1fa1 mxaccesscli: read/write/subscribe System Platform tags via MxAccess
New tool wrapping ArchestrA.MxAccess.LMXProxyServerClass (the same COM
proxy aaObjectViewer / WindowViewer use) as a CliFx CLI for LLM-driven
debugging.

Commands:
- mxa info      — loaded MxAccess assembly identity, supported value
                   types, MxStatusCategory enum.
- mxa read      — fetch one or more tag values; subscribes briefly,
                   captures first OnDataChange per tag, tears down.
- mxa write     — write a value with optional --type coercion; advises
                   first to resolve the attribute type, then waits for
                   OnWriteComplete with a per-call timeout.
- mxa subscribe — stream OnDataChange events for --seconds; JSON Lines
                   under --llm-json for piped agent consumption.
- mxa diag      — minimal smoke test on a private STA thread; bypasses
                   the CliFx pipeline for diagnosing apartment / pump
                   issues.

Implementation notes documented in docs/api-notes.md (reverse-engineered
because AVEVA does not publish a single canonical MxAccess reference):

- Net48 / x86 / [STAThread] are non-negotiable. The CLI runs the entire
  CliFx pipeline on a dedicated STA thread.
- COM events are dispatched as Win32 messages; AutoResetEvent.WaitOne
  alone does not pump them on this configuration. MxSession.WaitForUpdate
  loops Application.DoEvents() + drain + Sleep(20ms) instead.
- Write requires the target attribute's type to be resolved first.
  WriteCommand advises and waits for the initial OnDataChange before
  calling LMXProxyServerClass.Write to avoid ArgumentException
  "Value does not fall within the expected range".
- Errors carry the full MXSTATUS_PROXY[] from MxAccess (Success,
  Category, DetectedBy, Detail) so an agent can tell exactly which
  layer rejected a request.

Verified against the live ZB galaxy with a writeable tag identified
via grdb (TestChildObject.TestInt, mx_attribute_category=10):
  read:      99 (q=192, MxCategoryOk)
  write 7:   round-tripped — read returned 7 — written back to 99
  write str: TestChildObject.TestString round-tripped a timestamp
  subscribe: captured initial value plus subsequent change from a
             separate process

The vendored ArchestrA.MxAccess.dll is gitignored — it is copied from
C:\Program Files (x86)\ArchestrA\Framework\Bin\ on any System Platform
install per the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:02:51 -04:00
Joseph Doherty 5a78ec5a76 histdb: add connectioninfo.md with verified Runtime DB access
Captures the working SQL Server connection state on this machine:
server (localhost / MSSQLSERVER), database (Runtime), Windows-auth
account (DESKTOP-6JL3KKO\dohertj2, sysadmin), the sqlcmd recipe, four
paste-and-run sanity probes including an end-to-end INSQL retrieval,
linked-server roles (INSQL, INSQLD), and alternate client paths
(SSMS / PowerShell / ODBC / pyodbc).

Linked from histdb/README.md Layout and Resource index per
DOCS-GUIDE.md. Mirrors the pattern grdb/connectioninfo.md uses for
its own DB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:28:25 -04:00
Joseph Doherty 32f26272ae Initial commit: Wonderware / System Platform tools and reference
Five tools under one repo, all docs organized per DOCS-GUIDE.md:

- aalogcli: .NET 4.8 / x86 CliFx CLI for reading System Platform binary
  logs (*.aaLGX) for LLM debugging, built on aaOpenSource/aaLog. Commands:
  last, tail, range, unread, fields. Stable JSON envelope under --llm-json.
  Build template under lib/build/ for rebuilding aaLogReader.dll.

- aot: ArchestrA Object Toolkit 2014 v4.0 reference material. Dev guide
  (Markdown converted from CHM), API reference for the ArchestrA.Toolkit
  namespace, and the Monitor / Watchdog VS sample solutions.

- graccesscli: .NET 4.8 / x86 CliFx CLI that automates Galaxy
  configuration via the ArchestrA GRAccess COM interop. Includes session
  daemon, IPC protocol, and llm-json envelope contract.

- grdb: SQL/DDL exploration of the Galaxy Repository database. DDL
  captures, reusable queries, hierarchy / contained-name <-> tag-name
  translation notes.

- histdb: LLM-oriented reference for AVEVA Historian retrieval. INSQL
  linked-server, extension tables, every wwXxx time-domain extension,
  every retrieval mode, alarm/event SQL recipes, REST API. Distilled
  from the 243-page Historian Retrieval Guide.

Root contains:
- CLAUDE.md: thin index pointing into each tool's README.
- DOCS-GUIDE.md: doctrine for organizing docs for LLM consumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:22:20 -04:00