mbproxy: resolve remaining items from ReReviewAfterRemediation.md

Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.

NM3 — _supervisorCts leaks on re-Start
  StartAsync now disposes the previous CTS before reassigning. Idempotent:
  a try/catch (ObjectDisposedException) covers the very-first-Start case
  where the field-init CTS is still fresh.

NM4 — W2.15 TCS is single-shot
  _firstAttemptCompleted is no longer readonly; StartAsync re-creates it
  after the W2.16 guard so a re-Started supervisor's
  WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.

Nm6 — _admin GetService<> returns null silently
  ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
  Preserves the loud-failure intent from the original IHostedService
  registration without forcing test hosts to wire admin.

Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
  Added a volatile bool _disposed flag with an early-return at the top of
  DisposeAsync. Symmetry with PlcMultiplexer; protects against
  ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
  again on host shutdown.

T3 — RemoveInheritedAppsettings only fires on Build
  AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
  so a `dotnet publish` against the test csproj doesn't ship the example
  PLCs from the linked install template.

T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
  Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
  return.

Accepted (with rationale documented in ReReviewAfterRemediation.md):
  Nm2 — CoalescedHit semantic is per-design
  Nm4 — _lastBindError preservation on clean exit is intentional forensics
  Nm5 — EventLogBridge has no injectable logger
  Nm8 — Cosmetic log noise
  T1  — Reflection on private fields documented as maintenance trap

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 07:02:21 -04:00
parent 7a435957ee
commit 9251c564c1
6 changed files with 134 additions and 87 deletions
@@ -4,99 +4,113 @@ Re-review of the codebase after the six-commit remediation of the original 2026-
**Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/700 lines). **Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/700 lines).
## Status
> **All actionable findings resolved.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (this commit) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. Remaining items are **Accepted** with rationale (explained below) — they're deliberate trade-offs or doc-only nuances that don't warrant code changes.
>
> **Final test count:** 387 pass / 0 fail. Race tests stable across 3 isolated runs.
## Headline ## Headline
The remediation is structurally sound. **One new critical finding** — the W1.5 drain loop is structurally always-zero — inherited the very bug it was meant to replace. **Five new major findings** cluster in two areas: The remediation was structurally sound. The re-review found:
- **W1.4 cascade gating** introduced two disposal-deadlock and exception-swallowing paths around the new `_connectGate.WaitAsync()` call inside `TearDownBackendAsync`. - **1 critical finding** — the W1.5 drain loop inherited the very `ShutdownCoordinator` bug it was meant to replace. **Resolved** by Wave 4 (`7a43595`) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop.
- **W1.1 + W1.5 lifecycle ordering** introduced a non-atomic context swap (snapshot reads can see a disposed cache during reseat) and silently single-shot supervisor semantics from the W2.15 + W2.16 combination. - **5 major findings** clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. **All resolved** by Wave 4 + Wave 4-followup.
- **8 minor findings** + **4 test-discipline findings**. Most resolved; the rest accepted with documented rationale.
Plus a handful of test-discipline gaps from the new race-hard tests (reflection on private field names, RNG seed fragility across runtime versions). ## Resolution table
## New critical findings (1) | ID | Severity | Finding | Status | Commit |
|----|----------|---------|--------|--------|
| **NC1** | Critical | `ProxyWorker.StopAsync` drain loop structurally always-zero — inherited the original `ShutdownCoordinator` bug | ✅ **Resolved** | `7a43595` |
| **NM1** | Major | `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable — disposal can be blocked indefinitely | ✅ **Resolved** | `7a43595` |
| **NM2** | Major | `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can read a disposed cache | ✅ **Resolved** | `7a43595` |
| **NM3** | Major | `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard | ✅ **Resolved** | (W4-followup) |
| **NM4** | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ **Resolved** | (W4-followup) |
| **NM5** | Major | Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal | ✅ **Resolved** | `7a43595` |
| **Nm1** | Minor | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe | ✅ **Resolved** | `7a43595` |
| **Nm2** | Minor | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04 | ⚪ **Accepted** | (doc) |
| **Nm3** | Minor | Both supervisor-stop and drain shared `gracefulMs` budget | ✅ **Resolved** | `7a43595` (drain deleted) |
| **Nm4** | Minor | `finally` preserves `_lastBindError` after clean cancellation | ⚪ **Accepted** | (by design) |
| **Nm5** | Minor | `EventLogBridge` no startup log of armed state | ⚪ **Accepted** | (low value) |
| **Nm6** | Minor | `_admin` lazy resolution returns `null` silently if registration absent | ✅ **Resolved** | (W4-followup) |
| **Nm7** | Minor | `AdminEndpointHost.DisposeAsync` no `_disposed` guard | ✅ **Resolved** | (W4-followup) |
| **Nm8** | Minor | `TearDownBackendAsync` cosmetic log-noise on queued cascades | ⚪ **Accepted** | (cosmetic) |
| **T1** | Test | Reflection coupling on private field names | ⚪ **Accepted** | (commented in code) |
| **T2** | Test | `WatchdogVsResponse_Race` seeded `Random` cross-runtime fragility | ✅ **Resolved** | `7a43595` |
| **T3** | Test | `RemoveInheritedAppsettings` only fires on Build, not Publish | ✅ **Resolved** | (W4-followup) |
| **T4** | Test | Stale `TryAttachOrCreate_*_ReturnsTrue_*` test method names after W3 dropped the bool | ✅ **Resolved** | (W4-followup) |
### NC1. `ProxyWorker.StopAsync` drain loop is structurally always-zero — inherited the original `ShutdownCoordinator` bug it replaced **Resolved: 13/18. Accepted: 5/18.**
**File:** `src/Mbproxy/Proxy/ProxyWorker.cs:276-310` (introduced by W1.5)
**Path of execution:** ---
1. `await Task.WhenAll(stopTasks)` — each `PlcListenerSupervisor.StopAsync` cancels its inner CTS, the listener's `RunAsync` exits, the OperationCanceledException catch calls `await listener.DisposeAsync()`, which calls `_multiplexer.DisposeAsync()`, which calls `_ctx.Counters.SetMultiplexProvider(null)` (`PlcMultiplexer.cs:224`).
2. *Then* the drain loop begins. `CountInFlight` reads `supervisor.CurrentCounters.Snapshot().InFlightCount`, and `ProxyCounters.Snapshot()` (`ProxyCounters.cs:403-404`) returns `provider?.InFlightCount ?? 0`. The provider is now `null`. So `total == 0` on the first iteration and the loop exits immediately.
The drain budget (`gracefulMs`) is never spent on actual draining. `inFlightAtCancel` will always log as `0`. The `mbproxy.shutdown.complete` event under load will report `InFlightAtCancel=0` even when in-flight requests were forcibly cancelled. ## Resolved findings — what landed
**This is the same defect the original review (`AdminAndDiagnostics.md:C1`) called out in the deleted `ShutdownCoordinator`.** The W1.5 relocation faithfully copied the broken sequencing. ### NC1 — `ProxyWorker.StopAsync` drain loop is structurally always-zero
**Resolution:** `7a43595` snapshots `inFlightAtCancel = CountInFlight()` BEFORE calling `supervisor.StopAsync(...)`. The post-stop drain loop and `drainCts` are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. `mbproxy.shutdown.complete` now reports a meaningful "requests dropped by stop" count.
**Fix options:** ### NM1 — `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable
- (a) Snapshot in-flight counts BEFORE calling `supervisors.StopAsync`, then move drain-with-real-multiplexers into a phase BEFORE supervisor stop (i.e. stop accepting new connections, drain in-flight, then stop supervisors). **Resolution:** `7a43595` bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; `gateHeld` flag tracks whether we acquired it so the `finally` only releases when appropriate. `ObjectDisposedException` from a disposed semaphore short-circuits to a clean return.
- (b) Accept that supervisor stop *is* the drain (matches today's behavior) and remove the theatrical loop + `InFlightAtCancel` field from the log event.
## New major findings (5) ### NM2 — `ReplaceContext` non-atomic ctx + provider swap
**Resolution:** `7a43595` swapped the order: provider FIRST, then `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).
### NM1. `TearDownBackendAsync` acquires `_connectGate` with no token — disposal can be blocked indefinitely ### NM3 — `_supervisorCts` leaks across `StartAsync` re-entry
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:333` (introduced by W1.4) **Resolution:** W4-followup. `StartAsync` now does `try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { }` before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.
`await _connectGate.WaitAsync().ConfigureAwait(false)` does not honour `_disposeCts.Token`. The dispose-driven teardown calls this after cancelling `_disposeCts`, but if a long Polly-wrapped `EnsureBackendConnectedAsync` is mid-connect against an unreachable host (waiting up to `BackendConnectTimeoutMs`), DisposeAsync blocks here for that duration. This blocks `ProxyWorker.StopAsync` past its drain budget. ### NM4 — W2.15 TCS never re-armed (supervisor single-shot)
**Resolution:** W4-followup. `_firstAttemptCompleted` is now non-readonly and re-created in `StartAsync` after the W2.16 guard. A re-Started supervisor's `WaitForInitialBindAttemptAsync` no longer observes the previous run's signal.
**Fix:** pass `_disposeCts.Token` (or a CTS that cancels on dispose) to `WaitAsync`, accept the `OperationCanceledException`, and skip teardown work if disposal already completed. ### NM5 — Self-cascade `ObjectDisposedException` after dispose
**Resolution:** `7a43595` gated the writer + reader fault-path `_ = TearDownBackendAsync(...)` calls behind `if (!_disposeCts.IsCancellationRequested)`. DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant *and* exception-throwing.
### NM2. `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can see a disposed cache ### Nm1 — Saturation cleanup uses `await SendResponseAsync`
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:173-184` (introduced by W1.1) **Resolution:** `7a43595` replaced the per-party `await SendResponseAsync` with `TrySendResponse` and `IncrementResponseDropForFullUpstream` on drop. Same doctrine as W1.3.
Snapshot path (`ProxyCounters.Snapshot()` called from the admin endpoint) reads `_cacheStatsProvider` independently of `_ctx`. Sequence: ### Nm3 — `gracefulMs` used twice
1. Thread A enters `ReplaceContext`, writes `_ctx = newContext` (line 177). **Resolution:** Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is `gracefulMs + 2 s admin`.
2. Thread B (admin) snapshots, reads `_cacheStatsProvider` (still the OLD adapter wrapping the OLD cache).
3. Thread A runs `SetCacheStatsProvider(newCacheAdapter)`.
4. Supervisor's `ReplaceContextAsync` proceeds to `oldCache.Clear()` then `oldCache.Dispose()`.
A snapshot during the swap reports `cacheEntryCount`/`cacheBytes` from a cache the live multiplexer no longer references. Worse, if the snapshot races past step 4, the stats adapter holds a disposed `ResponseCache` and `Count`/`ApproximateBytes` may throw. **Same shape of race as the original review's "snapshot inconsistency" pattern**, re-opened on a new path by the W1.1 fix. ### Nm6 — `_admin` lazy resolution silent null
**Resolution:** W4-followup. `ProxyWorker.ExecuteAsync` now logs a Warning when `GetService<AdminEndpointHost>()` returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.
**Fix:** swap the order (provider first, then `_ctx`) inside a short lock, OR have `ResponseCache.Count`/`ApproximateBytes` no-op on a disposed instance. ### Nm7 — `AdminEndpointHost.DisposeAsync` no `_disposed` guard
**Resolution:** W4-followup. Added a `volatile bool _disposed` flag and a guard at the top of `DisposeAsync`; symmetry with `PlcMultiplexer`.
### NM3. `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard ### T2 — `WatchdogVsResponse_Race` seeded `Random` fragility
**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:140, 437-446` **Resolution:** `7a43595` replaced `new Random(12345) → rng.Next(350, 450)` with a counter-based alternation `(n & 1) == 1 ? 350 : 450`. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.
W2.16's guard prevents re-Start while busy, but it does not address the leaking previous CTS the comment claims to fix. `StopAsync` sets `_state = Stopped` first, then awaits `_supervisorTask`. The previous `_supervisorCts` is disposed only in `DisposeAsync`, not in `StopAsync`. A re-Start pattern (Stop → Start without Dispose) leaks one CTS per cycle. ### T3 — `RemoveInheritedAppsettings` only on Build
**Resolution:** W4-followup. `AfterTargets="Build;Publish"` and a second `Delete` against `$(PublishDir)appsettings.json` (guarded by `Condition="'$(PublishDir)' != ''"` so a plain Build doesn't trip it).
Today no caller actually re-Starts a supervisor (ConfigReconciler's remove path uses `DisposeAsync`), so the leak is **latent**. Worth fixing now or documenting. ### T4 — Stale `TryAttachOrCreate_*` test names
**Resolution:** W4-followup. Renamed three test methods to drop `Try` and `_ReturnsTrue_` to match the W3 `AttachOrCreate` (void-returning) signature.
### NM4. W2.15 TCS is never re-armed; W2.16 + TCS together make the supervisor effectively single-shot ---
**File:** `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:74-75, 132-147`
`_firstAttemptCompleted` is `readonly` and set in the field initialiser. Once `TrySetResult` fires (anywhere in `RunSupervisorAsync` including the `finally`), it stays completed forever. A restarted supervisor would have `WaitForInitialBindAttemptAsync` return immediately on the *previous* run's signal, regardless of the new run's bind status. Combined with the comment claiming "the supervisor's state machine has exactly one Start", the intent appears to be single-shot — but neither the guard nor the TCS enforces it cleanly. ## Accepted findings — what wasn't fixed and why
Today no caller re-Starts so this is **latent**. Either re-arm the TCS in `RunSupervisorAsync` or have W2.16 refuse re-Start unconditionally. ### Nm2 — `CoalescedHit` increments for late attachers that get exception 04
**Why accepted:** The counter is correct per the design contract (`coalescedHit + coalescedMiss = total FC03/FC04 requests`). A late attacher *did* coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at `PlcMultiplexer.cs:809-812` (W2.4 doc clarification covers this).
### NM5. Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal ### Nm4 — `_lastBindError` preserved after clean cancellation
**File:** `src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs:451, 648` (introduced by W1.4 + W2.5) **Why accepted:** Intentional. The field's contract is "last fault while running" — useful for operators triaging *why* a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The `finally` block at `PlcListenerSupervisor.cs:435-436` already documents this with a comment.
After `DisposeAsync` runs `_connectGate.Dispose()`, any in-flight self-cascade (`_ = TearDownBackendAsync(...)` from writer/reader fault paths) reaches `WaitAsync()` and throws `ObjectDisposedException`. The `try/finally` will then run `Release()` on a disposed semaphore in the `finally`, also throwing. Both throws happen on a fire-and-forget Task — they become unobserved exceptions on TPL's UnobservedTaskException event. Noisy but not fatal. ### Nm5 — `EventLogBridge` no startup log of armed state
**Why accepted:** The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use `Log.ForContext` or similar). The class doc at `EventLogBridge.cs:18-20` already tells operators "must be created by `install.ps1` before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using `Log.Logger`) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.
**Fix:** wrap the body in `try { … } catch (ObjectDisposedException) { }` or short-circuit on `_disposed` before re-entering teardown. ### Nm8 — `TearDownBackendAsync` log-noise on queued cascades
**Why accepted:** Cosmetic. Each cascade logs its own `BackendDisconnected` event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.
## New minor findings (8) ### T1 — Reflection on private field names in race tests
**Why accepted:** The two reflection sites (`DrainAllocator` in `PlcMultiplexerTests`, the runtime-fault test in `SupervisorTests`) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only `internal` accessors that pollute production code, (b) exposing fields via `[InternalsVisibleTo]` (which Mbproxy.csproj already does for the test project — but the *fields* themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.
| # | File:line | Finding | ---
|---|-----------|---------|
| Nm1 | `PlcMultiplexer.cs:842` (W1.2 cleanup) | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe. A single wedged late-attacher stalls saturation delivery to all others — contradicts the W1.3 doctrine. Use `TrySendResponse` here too. |
| Nm2 | `PlcMultiplexer.cs:810` | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04. Counter shows hits that produced no real coalesced response. Document or decrement. |
| Nm3 | `ProxyWorker.cs:269-310` | Both supervisor-stop and (theatrical) drain share `gracefulMs`. Worst-case shutdown is `2 * gracefulMs + 2s admin` — operators expect single budget. |
| Nm4 | `PlcListenerSupervisor.cs:433-446` | `finally` writes `_state = Stopped` directly, preserving `_lastBindError` from a prior fault. After clean cancellation operators see "last bind error" that was never the cause of stopping. Doc the field as "last fault while running" or clear on clean exit. |
| Nm5 | `EventLogBridge.cs:35-45` | W2.23 cached `_sourceExists` is correct, but no startup log confirms armed state. If `install.ps1` registers source after service starts, every Error+ event is silently dropped until restart. Add a one-line startup log. |
| Nm6 | `ProxyWorker.cs:224-235` | W1.5 lazy `_admin = _services.GetService<AdminEndpointHost>()` returns `null` silently if `AddMbproxyAdmin()` is forgotten. Previous `IHostedService` registration would have errored loudly during host build. Use `GetRequiredService<>` or log a warning. |
| Nm7 | `AdminEndpointHost.cs:200-210` | `DisposeAsync` lacks a `_disposed` guard; `ProxyWorker.StopAsync` calls `StopAsync` then DI disposes again. Operations are idempotent today but symmetry with `PlcMultiplexer` would be cleaner. |
| Nm8 | `PlcMultiplexer.cs:415` | `TearDownBackendAsync` log `BackendDisconnected` event fires per cascade; queued cascades on the gate each log their own event with mostly-zero counts. Cosmetic noise. |
## Test-discipline findings (4)
| # | File:line | Finding |
|---|-----------|---------|
| T1 | `PlcMultiplexerTests.cs:1152-1165`, `SupervisorTests.cs:1779-1789` | `DrainAllocator` and the runtime-fault test reflect on private field names (`_allocator`, `_currentListener`, `_listener`). A rename refactor breaks them at run-time, not compile-time. Recommend `[InternalsVisibleTo]` test seam OR comments on each reflected field warning that tests depend on the name. |
| T2 | `PlcMultiplexerTests.cs:1379-1380` (W3 #8) | `WatchdogVsResponse_Race` uses `new Random(12345)` over `[350, 450)` ms with watchdog at 400 ms; asserts BOTH branches hit. `Random` with a seed is implementation-defined across major .NET versions (legacy → Xoshiro128 in .NET 6). On a runtime upgrade the seed could land all 30 samples on one side and the test breaks. Replace with deterministic alternation: `int delay = (i & 1) == 0 ? 350 : 450;`. |
| T3 | `Mbproxy.Tests.csproj:38-40` (W2.21) | `RemoveInheritedAppsettings` Target uses `AfterTargets="Build"` and deletes from `$(OutputPath)`. Won't fire on `dotnet publish`. Test projects rarely publish but worth `AfterTargets="Build;Publish"` + a second Delete against `$(PublishDir)appsettings.json`. |
| T4 | `InFlightByKeyMapTests.cs:46, 75, 112` | After W3 removed the `bool` return, test method names still read `TryAttachOrCreate_..._ReturnsTrue_WasNewTrue`. Decorative but misleading for grep. |
## Verified clean (sampled, not exhaustive) ## Verified clean (sampled, not exhaustive)
The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:
- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name). - **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
- **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor). - **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow. - **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow.
@@ -104,23 +118,12 @@ After `DisposeAsync` runs `_connectGate.Dispose()`, any in-flight self-cascade (
- **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work. - **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work.
- **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path. - **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path.
- **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth. - **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth.
- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` (bind-success, bind-failure-into-recovery, run-failure-into-recovery, finally) — no hang on `WaitForInitialBindAttemptAsync` for the first run. - **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` — no hang on `WaitForInitialBindAttemptAsync` for the first run. (W4 / NM4 added the re-arm for subsequent Starts.)
- **W2.17 TransitionTo lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples. - **W2.17 `TransitionTo` lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples.
- **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe. - **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
- **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent. - **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
- **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit. - **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit.
## Recommended next actions ## Closed
If you want to keep the issue tracker tidy, these are the items that should land as a small follow-up commit: The 2026-05-14 review series — original review → 4 remediation waves → re-review → wave-4 → wave-4-followup — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.
1. **NC1** — fix or remove the theatrical drain loop. Smallest change: delete the loop and the `InFlightAtCancel` field from the log.
2. **NM1** — pass `_disposeCts.Token` to `_connectGate.WaitAsync` in `TearDownBackendAsync`.
3. **NM2** — wrap the `_ctx` write + provider re-registration in a single short lock, or null-guard the cache adapter against disposed-cache reads.
4. **NM5** — wrap the self-cascade body in `try { … } catch (ObjectDisposedException) { }`.
5. **Nm1** — use `TrySendResponse` in the W1.2 saturation cleanup loop to avoid wedge-by-wedged-attacher.
6. **T2** — replace the seeded `Random` in `WatchdogVsResponse_Race` with deterministic alternation.
NM3 and NM4 are latent (no current caller exercises supervisor re-Start). They become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Defer or document.
Everything else is documentation / cosmetic / minor cleanup.
+13 -1
View File
@@ -44,6 +44,13 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
// Protects concurrent Start/Stop calls (hot-reload + StopAsync racing). // Protects concurrent Start/Stop calls (hot-reload + StopAsync racing).
private readonly SemaphoreSlim _lock = new(1, 1); private readonly SemaphoreSlim _lock = new(1, 1);
// Phase 12 (W4 / Nm7) — idempotency flag for DisposeAsync. ProxyWorker.StopAsync
// calls our StopAsync explicitly; the DI container then disposes the singleton on
// host shutdown. Without this flag the second pass would Dispose `_lock` twice and
// re-dispose the change registration (both currently safe but symmetry with
// PlcMultiplexer prevents future regression).
private volatile bool _disposed;
// Current configured port — used to detect changes on hot-reload. // Current configured port — used to detect changes on hot-reload.
private int _currentPort; private int _currentPort;
@@ -199,14 +206,19 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
public async ValueTask DisposeAsync() public async ValueTask DisposeAsync()
{ {
if (_disposed) return;
_disposed = true;
_optionsChangeRegistration?.Dispose(); _optionsChangeRegistration?.Dispose();
_lock.Dispose(); _optionsChangeRegistration = null;
if (_app is { } app) if (_app is { } app)
{ {
_app = null; _app = null;
await app.DisposeAsync().ConfigureAwait(false); await app.DisposeAsync().ConfigureAwait(false);
} }
try { _lock.Dispose(); } catch (ObjectDisposedException) { /* race-safe */ }
} }
// ── Logging ────────────────────────────────────────────────────────────── // ── Logging ──────────────────────────────────────────────────────────────
+10
View File
@@ -233,6 +233,16 @@ internal sealed partial class ProxyWorker : BackgroundService
_logger.LogError(ex, "Admin endpoint failed to start: {Message}", ex.Message); _logger.LogError(ex, "Admin endpoint failed to start: {Message}", ex.Message);
} }
} }
else
{
// Phase 12 (W4 / Nm6) — surface the absence. The previous IHostedService
// registration would have hard-errored in DI if AddMbproxyAdmin() was missing
// from Program.cs; the W1.5 lazy lookup returns null silently. A single warning
// makes a botched composition observable without blocking startup.
_logger.LogWarning(
"Admin endpoint not registered (AddMbproxyAdmin() missing from composition). " +
"Status page will be unavailable; service continues without it.");
}
// ── 6. Keep the worker alive until the host signals stop ───────────────────── // ── 6. Keep the worker alive until the host signals stop ─────────────────────
// Supervisors run their own background loops; ExecuteAsync just waits. // Supervisors run their own background loops; ExecuteAsync just waits.
@@ -71,7 +71,13 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
// for the first time (reached Bound or Recovering). Replaces the previous busy-poll // for the first time (reached Bound or Recovering). Replaces the previous busy-poll
// implementation in WaitForInitialBindAttemptAsync, which raced fast Stopped→Bound→ // implementation in WaitForInitialBindAttemptAsync, which raced fast Stopped→Bound→
// Stopped transitions and never exited if the supervisor task threw inside Polly. // Stopped transitions and never exited if the supervisor task threw inside Polly.
private readonly TaskCompletionSource _firstAttemptCompleted = new( //
// Phase 12 (W4 / NM4) — non-readonly so StartAsync can re-arm it for a re-Started
// supervisor. Without re-arming, a restart-after-stop scenario would have
// WaitForInitialBindAttemptAsync return immediately on the previous run's signal,
// never observing the new run's bind status. No production caller currently re-Starts,
// but the supervisor's state machine should be consistent.
private TaskCompletionSource _firstAttemptCompleted = new(
TaskCreationOptions.RunContinuationsAsynchronously); TaskCreationOptions.RunContinuationsAsynchronously);
// ── Public surface ──────────────────────────────────────────────────────────────────── // ── Public surface ────────────────────────────────────────────────────────────────────
@@ -132,16 +138,28 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
public Task StartAsync(CancellationToken ct) public Task StartAsync(CancellationToken ct)
{ {
// Phase 12 (W2.16) — refuse to re-Start an already-running or already-disposed // Phase 12 (W2.16) — refuse to re-Start an already-running or already-disposed
// supervisor. The original code reassigned _supervisorCts unconditionally, which // supervisor. After Stop the state machine returns to Stopped and StartAsync
// leaked the previous CTS and could leave a zombie task running against an // can re-arm; W4/NM3+NM4 below ensure the per-Start state (CTS, TCS) is fresh
// unobserved token. The supervisor's state machine has exactly one Start. // each time so no leak or stale signal carries across cycles.
if (_disposed) if (_disposed)
throw new ObjectDisposedException(nameof(PlcListenerSupervisor)); throw new ObjectDisposedException(nameof(PlcListenerSupervisor));
if (_state != SupervisorState.Stopped || !_supervisorTask.IsCompleted) if (_state != SupervisorState.Stopped || !_supervisorTask.IsCompleted)
throw new InvalidOperationException( throw new InvalidOperationException(
$"Supervisor for Plc='{_plc.Name}' has already been started."); $"Supervisor for Plc='{_plc.Name}' has already been started.");
// Phase 12 (W4 / NM3) — dispose the previous CTS before reassigning. The original
// code overwrote _supervisorCts unconditionally, leaking the prior CTS on every
// re-Start cycle (and any registrations linked to it). Idempotent: ObjectDisposed
// catch covers the very-first-Start case where the field-init CTS is still fresh.
try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { /* fresh */ }
_supervisorCts = CancellationTokenSource.CreateLinkedTokenSource(ct); _supervisorCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
// Phase 12 (W4 / NM4) — re-arm the first-attempt TCS so a re-Started supervisor
// doesn't immediately observe the previous run's signal in
// WaitForInitialBindAttemptAsync.
_firstAttemptCompleted = new TaskCompletionSource(
TaskCreationOptions.RunContinuationsAsynchronously);
_supervisorTask = Task.Run(() => RunSupervisorAsync(_supervisorCts.Token), CancellationToken.None); _supervisorTask = Task.Run(() => RunSupervisorAsync(_supervisorCts.Token), CancellationToken.None);
return Task.CompletedTask; return Task.CompletedTask;
} }
@@ -34,9 +34,13 @@
<!-- Phase 12 (W2.21) — the linked appsettings.json from Mbproxy.csproj propagates to <!-- Phase 12 (W2.21) — the linked appsettings.json from Mbproxy.csproj propagates to
the test bin via the project reference. Tests build their own in-memory the test bin via the project reference. Tests build their own in-memory
configurations and must not pick up the shipped template's example PLCs. The configurations and must not pick up the shipped template's example PLCs. The
Target deletes the inherited file from the test bin after build. --> Target deletes the inherited file from both the build output AND the publish
<Target Name="RemoveInheritedAppsettings" AfterTargets="Build"> payload (W4 / T3 — adds Publish so a `dotnet publish` against this csproj for a
packaged self-test would not leak the template's example PLCs into the published
bundle). -->
<Target Name="RemoveInheritedAppsettings" AfterTargets="Build;Publish">
<Delete Files="$(OutputPath)appsettings.json" Condition="Exists('$(OutputPath)appsettings.json')" /> <Delete Files="$(OutputPath)appsettings.json" Condition="Exists('$(OutputPath)appsettings.json')" />
<Delete Files="$(PublishDir)appsettings.json" Condition="'$(PublishDir)' != '' AND Exists('$(PublishDir)appsettings.json')" />
</Target> </Target>
</Project> </Project>
@@ -43,7 +43,7 @@ public sealed class InFlightByKeyMapTests
} }
[Fact] [Fact]
public async Task TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue() public async Task AttachOrCreate_NewKey_CallsFactory_WasNewTrue()
{ {
var pipe = MakePipe(); var pipe = MakePipe();
try try
@@ -72,7 +72,7 @@ public sealed class InFlightByKeyMapTests
} }
[Fact] [Fact]
public async Task TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse() public async Task AttachOrCreate_ExistingKey_AppendsParty_WasNewFalse()
{ {
var pipeA = MakePipe(); var pipeA = MakePipe();
var pipeB = MakePipe(); var pipeB = MakePipe();
@@ -109,7 +109,7 @@ public sealed class InFlightByKeyMapTests
} }
[Fact] [Fact]
public async Task TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend() public async Task AttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend()
{ {
var pipeA = MakePipe(); var pipeA = MakePipe();
var pipeB = MakePipe(); var pipeB = MakePipe();