wwtools

Author	SHA1	Message	Date
Joseph Doherty	b222362ce0	mbproxy: remediate the 2026-05-16 code-review findings Fixes every finding from the codereviews/2026-05-16 multi-agent review (2 Critical, 20 Major, 38 Minor) and adds that review to the repo. Highlights: dashboard XSS escape; response cache invalidated on the write request (not just the response); ReloadValidator now runs at startup so port collisions / duplicate names / malformed Resilience profiles fail fast; AdminPort 0 genuinely disables the admin endpoint; PlcListener accept-loop faults propagate to the supervisor's faulted path; reconciler Restart builds before removing; Resilience pipelines are restart-only from a frozen snapshot; multiplexer connect-race leak, watchdog party-list snapshot, backend-response and FC16 framing validation; frontend reconnect retry and util.js load guard; plus the log-event/doc drift sweep and test-port hygiene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 18:08:06 -04:00
Joseph Doherty	554b05d28c	mbproxy: fix dashboard review findings, add named BCD tags + fleet config Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 03:39:39 -04:00
Joseph Doherty	59d0b5deb9	mbproxy: Wave 5 — fixes from third re-review pass Closes findings from the third focused re-review pass on the post-W4-followup state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md). W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync The hot-reload OnChange handler at AdminEndpointHost.StartAsync did fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort was hot-reloaded during shutdown, the queued Task could land between StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take the lock, and bind a fresh Kestrel WebApplication on the new port — resurrecting admin AFTER the host considered it shut down. Worse, if DisposeAsync had already run _lock.Dispose, the queued Task throws ObjectDisposedException as an unobserved Task exception. Fix: _disposed guard at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release. W5/m2 — inFlightAtCancel computed AFTER base.StopAsync The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync (so the multiplexers' counter providers were still wired), but it computed the snapshot AFTER base.StopAsync(cancellationToken). Between those two lines, in-flight requests whose responses arrive get removed from _correlation, and the watchdog can clear stale entries. The reported count therefore drifted downward from "in-flight at signal time" to "in-flight at compute time." Fix: snapshot at the very top of StopAsync before any cancellation is propagated. W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort) When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the body runs unprotected. A concurrent EnsureBackendConnectedAsync that DOES hold the gate may TryAllocate a TxId that collides (after wraparound in the allocator's forward scan) with one being released by the channel drain. The double-release would mark the new request's slot as free even though it's legitimately in-flight, allowing the next allocation to reuse the same slot and CorrelationMap.TryAdd to fail (silent request drop). Probability is very low (gate timeout AND new accept landing AND TxId collision in 65,536-slot space); the only consequence is one dropped request the client retries. Documented inline at PlcMultiplexer.cs near the gateHeld declaration as accepted best-effort behaviour. W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per re-review verdict. Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 07:13:47 -04:00
Joseph Doherty	9251c564c1	mbproxy: resolve remaining items from ReReviewAfterRemediation.md Closes the latent + minor + test-discipline items left after Wave 4. Updates the re-review doc with a final resolution table — every actionable finding now marked Resolved or Accepted with rationale. NM3 — _supervisorCts leaks on re-Start StartAsync now disposes the previous CTS before reassigning. Idempotent: a try/catch (ObjectDisposedException) covers the very-first-Start case where the field-init CTS is still fresh. NM4 — W2.15 TCS is single-shot _firstAttemptCompleted is no longer readonly; StartAsync re-creates it after the W2.16 guard so a re-Started supervisor's WaitForInitialBindAttemptAsync doesn't observe the previous run's signal. Nm6 — _admin GetService<> returns null silently ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered. Preserves the loud-failure intent from the original IHostedService registration without forcing test hosts to wire admin. Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard Added a volatile bool _disposed flag with an early-return at the top of DisposeAsync. Symmetry with PlcMultiplexer; protects against ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton again on host shutdown. T3 — RemoveInheritedAppsettings only fires on Build AfterTargets="Build;Publish" + a second Delete against $(PublishDir) so a `dotnet publish` against the test csproj doesn't ship the example PLCs from the linked install template. T4 — Stale TryAttachOrCreate__ReturnsTrue_ test method names Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool return. Accepted (with rationale documented in ReReviewAfterRemediation.md): Nm2 — CoalescedHit semantic is per-design Nm4 — _lastBindError preservation on clean exit is intentional forensics Nm5 — EventLogBridge has no injectable logger Nm8 — Cosmetic log noise T1 — Reflection on private fields documented as maintenance trap Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 07:02:21 -04:00
Joseph Doherty	7a435957ee	mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes Closes the new findings from the post-remediation re-review (codereviews/2026-05-14/ReReviewAfterRemediation.md): NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was meant to replace. Supervisor.StopAsync nulls the per-mux counter provider before the drain loop runs, so CountInFlight always returns 0 and the drain budget is never spent on actual draining. Fix: snapshot the in-flight count BEFORE supervisor stop, drop the theatrical post-stop loop, and report InFlightAtCancel as the snapshot count (= the number of in-flight requests dropped by the stop). The supervisor stop IS the drain — there is nothing to drain that wouldn't be killed by the stop itself. NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable Without a token, a long Polly-wrapped EnsureBackendConnectedAsync against an unreachable host could hold the gate for the full BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound the wait with a 2 s teardown deadline; on timeout proceed best-effort without the gate. Worst-case consequence is one orphaned in-flight cycle on the dying backend, surfaced to upstream as exception 0x0B by the watchdog. NM2 — ReplaceContext non-atomic ctx + provider swap Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If `_ctx` was swapped first, a snapshot taken in the gap would still hold the OLD adapter wrapping the OLD cache — which the supervisor disposes immediately after we return. Fix: set the provider FIRST, then swap `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new), never (old-after-disposed). NM5 — Self-cascade ObjectDisposedException after dispose Writer/reader fault catches fired `_ = TearDownBackendAsync(...)` unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as an unobserved Task exception. Fix: skip self-cascade when `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit TearDown anyway. Nm1 — Saturation cleanup uses await SendResponseAsync W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync, which would serialise on a wedged late-attacher's full bounded channel and stall delivery to its peers — contradicting the W1.3 doctrine that the fan-out path must never await per-pipe writes. Fix: use TrySendResponse and increment ResponseDropForFullUpstream on drop. T2 — WatchdogVsResponse_Race seeded Random fragility Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms; Random's algorithm is implementation-defined across .NET major versions (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all samples on one side of the deadline and break the "both branches must fire" assertion. Fix: deterministic counter-based alternation (15 fast + 15 slow across 30 iterations) — guaranteed by construction. Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS single-shot semantics) are unfixed: no caller actually re-Starts a supervisor today; both become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Documented in the re-review. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green (T2 alternation is deterministic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 06:52:33 -04:00
Joseph Doherty	f2c6669444	mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 05:15:34 -04:00

6 Commits