f2c6669444
Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
237 lines
26 KiB
Markdown
237 lines
26 KiB
Markdown
# Remediation Plan — 2026-05-14 Code Review Findings
|
||
|
||
Plan for resolving every finding raised in the 2026-05-14 review. Items are grouped into three waves by impact, with each item naming the file:line, the problem, the suggested change, and the regression test that should lock the fix in place.
|
||
|
||
**Conventions:**
|
||
- "Wave 1" = ship before next deploy. Production-observable correctness defects.
|
||
- "Wave 2" = next sprint. Contract gaps, concurrency primitives, telemetry accuracy.
|
||
- "Wave 3" = opportunistic. Cleanup, dead code, doc, missing tests.
|
||
- Effort estimates assume a developer already familiar with the multiplexer's lifecycle. ⚙ = code change, 📝 = doc change, ✅ = test addition.
|
||
|
||
**Dependency map** (Wave 1 items have load-bearing relationships):
|
||
|
||
```
|
||
W1.1 (context swap) ──┬──► W1.2 (cache.flushed emission)
|
||
├──► W2.1 (ConfigReconciler coalescingAccessor)
|
||
├──► W2.7 (skip-invalidation-while-recovering)
|
||
└──► Tests ✅ (cache flush on tag-list reload)
|
||
|
||
W1.5 (ShutdownCoordinator ownership) ──► W2.4 (IOptionsMonitor switch)
|
||
└► W2.6 (drop hardcoded 5s deadline)
|
||
|
||
W2.3 (ConfigReconciler thread safety) ──► W3 test (concurrent reload assertion)
|
||
```
|
||
|
||
Other items are independent and can be parallelised.
|
||
|
||
---
|
||
|
||
## Wave 1 — Critical Correctness (4 items)
|
||
|
||
### W1.1 — Plumb context updates into the running `PlcMultiplexer` ⚙ ✅
|
||
|
||
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:50,101,109` and call sites at `:497-509`. Driven by `Proxy/Supervision/PlcListenerSupervisor.cs:199-216`.
|
||
|
||
**Problem:** `ReplaceContextAsync` swaps `_currentContext` on the supervisor, but `PlcMultiplexer` captured `_ctx = perPlcContext` at construction and never re-reads. So tag-list reloads, the cache-flush-on-tag-list contract, and any other context-derived state are visible only on the next listener fault. Worse, the old cache is disposed while the running mux still uses it.
|
||
|
||
**Change:** Make `_ctx` a `volatile` field on `PlcMultiplexer` (or wrap it in a reference type the supervisor can swap atomically). Add a public `ReplaceContext(PerPlcContext newContext)` method on the multiplexer; have `ReplaceContextAsync` call it before disposing the old context. Every place inside the multiplexer that reads `_ctx.TagMap` / `_ctx.Cache` / `_ctx.CacheInvalidator` must read the field fresh, not snapshot it into a local.
|
||
|
||
**Tests:**
|
||
- ✅ `PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu` — set up a mux, send one FC03, swap the context with a different tag map, send a second FC03, assert different rewriting on the second response.
|
||
- ✅ `PlcMultiplexerTests.ReplaceContext_FlushesOldCache` — populate the old cache, swap context, assert next read goes to backend (cache miss, not a hit on the disposed cache).
|
||
- ✅ `HotReloadE2ETests.E2E_GlobalTagListReload_FlushesPlcCache` — pre-seed cache via FC03, save modified `appsettings.json`, assert `cacheEntryCount` for that PLC drops to 0 and the next read goes through.
|
||
|
||
**Effort:** ~1 day. The mechanical change is small; the risk is making sure every read-site of `_ctx.*` is migrated and no field is hoisted into a constructor capture.
|
||
|
||
---
|
||
|
||
### W1.2 — Fix coalescing factory leak ⚙ ✅
|
||
|
||
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:692-733`, `Proxy/Multiplexing/InFlightByKeyMap.cs:62-83`.
|
||
|
||
**Problem:** When `_correlation.TryAdd` or `_allocator.TryAllocate` fails inside the `_inFlightByKey.TryAttachOrCreate` factory, the code stores the stub `InFlightRequest` under the key. Late attachers join the stub's `InterestedParties` list. The post-lock cleanup at `:724` only sends the saturation exception to the leader; late attachers wait forever.
|
||
|
||
**Change:** Either (a) inside the factory, do not return / store on failure — let `TryAttachOrCreate` propagate the error so no other thread sees the stub; or (b) on the cleanup path, walk `inFlight.InterestedParties` and deliver the exception 0x04 frame to every attached party before removing the entry. Option (b) is the smaller diff. Also remove the dead `bool` return from `TryAttachOrCreate` (`InFlightByKeyMap.cs:62`) — it always returns `true`.
|
||
|
||
**Tests:**
|
||
- ✅ `PlcMultiplexerTests.CoalescingFactoryFails_AllAttachersGetException` — stub the allocator to return false, fire 5 concurrent identical FC03 reads, assert all 5 receive exception 0x04 with their original TxIds.
|
||
- ✅ `InFlightByKeyMapTests.AttachOrCreate_FactoryThrows_DoesNotStoreStub` — drive the factory to throw and assert no entry remains in the map.
|
||
|
||
**Effort:** ~half day.
|
||
|
||
---
|
||
|
||
### W1.3 — Decouple backend reader from per-upstream backpressure ⚙ ✅
|
||
|
||
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:545`.
|
||
|
||
**Problem:** `await party.Pipe.SendResponseAsync(outFrame, ct)` runs synchronously inside the single backend reader task. If one upstream's bounded response channel (capacity 16) is full, the reader stalls — every other client on that PLC stops receiving responses.
|
||
|
||
**Change:** Replace the synchronous `await` with a non-blocking `TryWrite` (drop-on-full, with a counter `responseDropForFullUpstream`), or fan out per-party with `_ = Task.Run(() => party.Pipe.SendResponseAsync(...))`. Drop-on-full is the cleaner semantics: a wedged upstream loses responses but the rest of the fleet keeps moving; the upstream's watchdog or socket closure will collect it. Add a counter for the dropped responses so it's observable on the status page.
|
||
|
||
**Tests:**
|
||
- ✅ `PlcMultiplexerTests.SlowUpstream_DoesNotStallPeerResponses` — wedge one upstream's channel reader, fire reads from a second upstream, assert the second upstream receives responses on time and the wedged upstream's drop counter increments.
|
||
|
||
**Effort:** ~half day. Decide drop-vs-fanout first; the rest is mechanical.
|
||
|
||
---
|
||
|
||
### W1.4 — Drain or replace `_outboundChannel` on cascade ⚙ ✅
|
||
|
||
**File:** `Proxy/Multiplexing/PlcMultiplexer.cs:362-385,423-428`.
|
||
|
||
**Problem:** When the backend writer faults at `:380` and triggers a cascade, frames already enqueued in `_outboundChannel` are stranded. After reconnect, a new writer task drains them into the *fresh* backend with old proxy TxIds whose correlation entries were dropped by `TearDownBackendAsync`. The PLC responds; the reader silently drops the response (`:423-428`); upstream peers wait for the watchdog to time them out — observable as a multi-second pause for every client whose request was in flight at the cascade moment.
|
||
|
||
**Change:** On cascade, complete `_outboundChannel.Writer.Complete()` before tearing down the writer task. In `EnsureBackendConnectedAsync`, recreate the channel as part of bringing up a new backend connection. Alternative: drain-and-discard (`while (TryRead) { releaseTxId; }`) before re-running the writer.
|
||
|
||
**Tests:**
|
||
- ✅ `PlcMultiplexerTests.Cascade_StrandedOutboundFrames_AreDiscarded` — fill the outbound channel, force a cascade mid-drain, reconnect, assert the next request gets a normal response and the stranded frames did not produce orphaned TxIds (counter `responseToUnknownTxId` stays at 0 for the new connection).
|
||
|
||
**Effort:** ~1 day. Care needed to not accidentally race the channel recreation against fast reconnects.
|
||
|
||
---
|
||
|
||
### W1.5 — Resolve `ShutdownCoordinator` double-stop ⚙ ✅
|
||
|
||
**File:** `Program.cs:60-66`, `Diagnostics/ShutdownCoordinator.cs:147`, `Proxy/ProxyWorker.cs:196-217`, `Admin/AdminEndpointHost.cs`.
|
||
|
||
**Problem:** `ProxyWorker` and `AdminEndpointHost` are both `IHostedService`. Their `StopAsync` runs unconditionally during host shutdown. The coordinator on `ApplicationStopping` *also* calls `StopAsync` on both — so each is stopped twice. The drain logic in the coordinator therefore runs against a connection set that is already empty, and `mbproxy.shutdown.complete` always reports `InFlightAtCancel = 0`. The existing `ShutdownE2ETests.cs:113` doesn't actually assert the coordinator ran.
|
||
|
||
**Change:** Pick one of:
|
||
- **(A)** Make `ProxyWorker` and `AdminEndpointHost` *not* implement `IHostedService` — register them as singletons only — and let the coordinator drive their entire lifecycle (start in `ApplicationStarted`, stop in `ApplicationStopping`).
|
||
- **(B)** Move the drain logic into `ProxyWorker.StopAsync` and delete the coordinator entirely. Simpler if the only thing the coordinator does is order + drain + admin-stop.
|
||
|
||
(B) is the smaller change and matches how worker services are typically structured.
|
||
|
||
**Tests:**
|
||
- ✅ Update `ShutdownE2ETests` to assert `mbproxy.shutdown.complete` fires with a positive `InFlightAtCancel` when an in-flight request is dropped at deadline.
|
||
- ✅ Assert `mbproxy.shutdown.complete` fires exactly once.
|
||
|
||
**Effort:** ~1 day if (B); ~2 days if (A) because every IHostedService gets unwound.
|
||
|
||
---
|
||
|
||
## Wave 2 — Major Fixes (20 items)
|
||
|
||
Grouped by file/area for batched PRs.
|
||
|
||
### Multiplexer / Concurrency
|
||
|
||
| # | File:line | Problem | Change | Test |
|
||
|---|-----------|---------|--------|------|
|
||
| W2.1 | `Configuration/ConfigReconciler.cs:294-304,378-388` ↔ `Proxy/ProxyWorker.cs:129-130` | Add/restart paths construct `PlcListenerSupervisor` without the `coalescingAccessor` `ProxyWorker` builds at startup. Hot-reload of `ReadCoalescing.Enabled` doesn't propagate to PLCs added or restarted via reload. | Pass the same `coalescingAccessor` into `ConfigReconciler.Attach` and use it in the add/restart constructors. | ✅ `ConfigReconcilerTests.AddedPlc_HonoursCurrentReadCoalescingFlag` — reload with `Enabled=false`, add a PLC, assert that PLC's reads do not coalesce. |
|
||
| W2.2 | `Proxy/Multiplexing/PlcMultiplexer.cs:86,142,181,216,566`; `UpstreamPipe.cs:52,77` | `_disposed` is non-volatile but read on the hot path. On weakly-ordered platforms (ARM) stale-false reads after disposal are permitted. | Mark both fields `volatile`, or replace with `Interlocked.CompareExchange(ref _disposedFlag, 1, 0)` pattern. The whole project ships win-x64 today so this is correctness defense for future portability. | (existing dispose tests are sufficient; verify via inspection.) |
|
||
| W2.3 | `Configuration/ConfigReconciler.cs:236-238,269-271,306,390` | `_supervisors` `Dictionary<,>` is mutated from concurrent `Task.Run` continuations inside `Task.WhenAll`. Outer apply is serialised by a semaphore but the inner tasks are not. | Either `ConcurrentDictionary<,>`, or collect results from `Task.WhenAll` first then mutate the dict on a single thread. | ✅ Replace existing `Concurrent_Reloads_Serialised` with one that drives genuinely-concurrent add/remove via stub supervisors and asserts no `KeyNotFoundException` / corruption. |
|
||
| W2.4 | `Proxy/Multiplexing/PlcMultiplexer.cs:721-733` (counter parity); `:626` (cache miss double-count) | Counter semantics drift from the docs: saturation increments `coalescedMissCount` but produces no backend round-trip; `cacheMiss` is incremented even when the request later coalesces onto a peer's in-flight read. | Two options: (a) restructure increments to fire only on the actual outcome (move `IncrementCoalescedMiss` after the successful send; demote cache-miss to "cache-eligible request" and add a separate `cacheElevatedToBackend` counter), or (b) leave behaviour unchanged and update `docs/Reference/LogEvents.md` + `docs/design.md` to match. (b) is the lower-effort fix. | ✅ Either way, add `PlcMultiplexerTests.CounterSemantics_DocumentedContract` that asserts the chosen rule. |
|
||
| W2.5 | `Proxy/Multiplexing/PlcMultiplexer.cs:209,350` | `_disposeCts` disposed while watchdog/cancel paths may still race against it. `oldCts?.Dispose()` at `:350` races with reader's `ct.IsCancellationRequested` check. | Wrap `_disposeCts.Dispose()` in a `try { ... } catch (ObjectDisposedException) { }` at `:209`; gate `oldCts?.Dispose()` at `:350` behind the join of the reader task. | (existing dispose tests cover the happy path; consider a stress test that interleaves Dispose with active requests.) |
|
||
| W2.6 | `Proxy/Multiplexing/PlcMultiplexer.cs:283` | `_connectGate` `SemaphoreSlim` never disposed. | Add to `Dispose`. | (no test needed.) |
|
||
| W2.7 | `Proxy/Multiplexing/PlcMultiplexer.cs:468` | Reads the FC byte from the post-rewriter buffer. Correct today (FC byte never rewritten) but fragile. | Use `inFlight.Fc` from the request side. | (existing tests cover.) |
|
||
|
||
### Cache / Hot-reload
|
||
|
||
| # | File:line | Problem | Change | Test |
|
||
|---|-----------|---------|--------|------|
|
||
| W2.8 | `Proxy/Cache/CacheLogEvents.cs:67-76`, supervisor reseat path | `mbproxy.cache.flushed` event defined and documented but never emitted. After W1.1 the reseat path can call `cache.Clear()` instead of disposing — that's the natural emission point. | After W1.1, replace `cache.Dispose()` in `ReplaceContextAsync` with `cache.Clear(reason: "tag-list-reload")` and have `Clear` emit the log event. | ✅ `ResponseCacheTests.Clear_EmitsFlushedEvent_WithReason`; ✅ `HotReloadE2ETests` add an assertion that `mbproxy.cache.flushed` fires on the tag-list reload. |
|
||
| W2.9 | `Proxy/Multiplexing/PlcMultiplexer.cs:497-509` (currently no gate) | "Skip cache invalidations when listener is `recovering`" (`design.md:203`) is unimplemented. Implicit gating holds today (no backend → no FC06/FC16 response → no invalidation), but is undocumented and will break if any new code path produces a write response off the live backend. | Either (a) add an explicit check on `SupervisorState.Recovering` before calling `_invalidator.Invalidate(...)`, or (b) accept the implicit gating and add a comment in `PlcMultiplexer.cs` near the invalidation call referencing `design.md:203`. (b) is simpler and matches reality. | ✅ Doc-only fix → no test. (a) → fault-injection test that synthesises a response during recovery and asserts no invalidation. |
|
||
| W2.10 | `Configuration/ReloadValidator.cs:80` | `ReloadValidator` runs `BcdTagMapBuilder.Build` and discards the result; the resolved per-tag `CacheTtlMs` after the per-PLC default fold is never re-checked against `Cache.AllowLongTtl`. | After Build, walk `result.Map.Values` and re-run the long-TTL gate against resolved values. | ✅ Add to `ReloadValidatorTests`: tag with explicit null TTL inheriting a per-PLC default of 60_001 ms is rejected when `AllowLongTtl=false`. |
|
||
|
||
### BCD Rewriter
|
||
|
||
| # | File:line | Problem | Change | Test |
|
||
|---|-----------|---------|--------|------|
|
||
| W2.11 | `Bcd/BcdTagMapBuilder.cs:84-103` | `DuplicateAddress` validation is unreachable — input collapses through a `Dictionary` first. Design contract "reject duplicate addresses within a single PLC's resolved list" is silently violated. | Detect duplicates *before* the dictionary collapse: walk the `Add` list with a `HashSet<ushort>`, collect duplicates, return them as `DuplicateAddress` errors. Same for the merged `Global ∪ Add − Remove` resolution where Add overrides Global at the same address (intentional width-override path stays valid; only collisions within Add itself are errors). | ✅ `BcdTagMapBuilderTests.Build_DuplicateAddressInAddList_ReportsError` (currently the test acknowledges the gap). |
|
||
| W2.12 | `Bcd/BcdTagMapBuilder.cs:113-127` | `OverlappingHighRegister` reports each pair twice (once from each direction). | Track reported pairs in `HashSet<(ushort,ushort)>`, dedupe symmetric reports. | ✅ `BcdTagMapBuilderTests.Build_TwoOverlappingPairs_ReportsExactlyOneErrorPerPair`. |
|
||
| W2.13 | `Proxy/BcdPduPipeline.cs:206-211` | FC16 32-bit write silently mutates values when `clientLow > 9999` — multiplication math accepts the OOR value and re-encodes a different number. | Validate `clientLow < 10_000 && clientHigh < 10_000` before the multiplication; on failure, increment `invalidBcdWarnings` and pass through raw. | ✅ `BcdPduPipelineTests.FC16_32Bit_ClientHighOrLowOOR_PassesThroughRaw_WithInvalidBcdWarning`. |
|
||
| W2.14 | `Proxy/BcdPduPipeline.cs:159` | FC16 `byteCount` field at `pdu[5]` is parsed but discarded. Malformed claims of `qty=10` with 4 bytes of register data silently process half the request. | Add `if (pdu.Length < 6 + qty*2) return;` at the top of `ProcessFc16Request` (and increment a counter or log). | ✅ `BcdPduPipelineTests.FC16_TruncatedRegisterData_PassesThroughRawWithoutHalfRewrite`. |
|
||
|
||
### Supervisor
|
||
|
||
| # | File:line | Problem | Change | Test |
|
||
|---|-----------|---------|--------|------|
|
||
| W2.15 | `Proxy/Supervision/PlcListenerSupervisor.cs:137-144` | `WaitForInitialBindAttemptAsync` busy-polls every 10 ms, races a fast `Stopped→Bound→Stopped`, never exits if the supervisor task throws. | Replace polling loop with a `TaskCompletionSource<bool>` set on the first state transition out of `Stopped`. | ✅ `SupervisorTests.WaitForInitialBindAttemptAsync_ResolvesEvenWhenBindThrows` — drive a synthetic bind exception and assert the wait returns within ~50 ms. |
|
||
| W2.16 | `Proxy/Supervision/PlcListenerSupervisor.cs:126` | `_supervisorCts` reassigned on `StartAsync` without disposing the previous instance. | Add a guard (`throw if not Stopped`) or dispose the previous CTS before reassigning. | (covered by inspection.) |
|
||
| W2.17 | `Proxy/Supervision/PlcListenerSupervisor.cs:177-180` | `Snapshot()` reads three fields independently → status page can show inconsistent state/lastBindError/recoveryAttempts triples. | Read into locals under a single lock, or use a single `volatile` struct that's swapped atomically. | ✅ `SupervisorTests.Snapshot_DuringTransition_IsInternallyConsistent` — fire many transitions in parallel and assert no inconsistent triple ever observed. |
|
||
| W2.18 | `Options/ConnectionOptions.cs` and `MbproxyOptions.cs` validators | `Connection.{BackendConnectTimeoutMs,BackendRequestTimeoutMs,GracefulShutdownTimeoutMs}` are not validated for `> 0`. A misconfigured 0 produces immediate `CancelAfter(0)` failures. | Add `> 0` checks to `MbproxyOptionsValidator` and `ReloadValidator`. | ✅ `MbproxyOptionsBindingTests.Validation_RejectsZeroOrNegativeTimeouts`. |
|
||
|
||
### Hosting / Diagnostics
|
||
|
||
| # | File:line | Problem | Change | Test |
|
||
|---|-----------|---------|--------|------|
|
||
| W2.19 | `Diagnostics/ShutdownCoordinator.cs:93,102,118`; `Program.cs:42` | Coordinator binds `IOptions<MbproxyOptions>` (singleton snapshot). A hot-reloaded `GracefulShutdownTimeoutMs` is silently ignored. | Switch to `IOptionsMonitor<MbproxyOptions>`; read `.CurrentValue.Connection.GracefulShutdownTimeoutMs` inside `ShutdownAsync`. After W1.5 verify the coordinator still exists; if it was removed, this fix migrates to wherever drain logic lives. | ✅ `ShutdownE2ETests.E2E_HotReloadedShutdownTimeout_IsHonoured`. |
|
||
| W2.20 | `Diagnostics/ShutdownCoordinator.cs:147` | Hardcoded `TimeSpan.FromSeconds(5)` for supervisor stop deadline regardless of `GracefulShutdownTimeoutMs`. | Use the configured value. | (covered by W2.19's test if extended.) |
|
||
| W2.21 | `src/Mbproxy/appsettings.json` ↔ `install/mbproxy.config.template.json` | Shipped `appsettings.json` is an empty stub. Fresh `Mbproxy.exe` from publish folder is a no-op service. | Either mirror the install template into `src/Mbproxy/appsettings.json`, or set `<None Update="appsettings.json"><CopyToOutputDirectory>Never</CopyToOutputDirectory></None>` in the csproj so the publish folder doesn't ship a stub. | ✅ `HostSmokeTests.Host_StartsWithCheckedInAppsettings_AndAtLeastOneListenerBinds`. |
|
||
| W2.22 | `Admin/StatusDto.cs:102`, `Admin/StatusSnapshotBuilder.cs:127-131` | `pdus.invalidBcdWarnings` and `backendExceptions.codeOther` are tracked in `CounterSnapshot` but never surfaced via the JSON DTO or HTML renderer. Dead increment paths. | Add the two fields to `PlcPdusStatus` / `ExceptionCounts` and wire them through `StatusSnapshotBuilder` and `StatusHtmlRenderer`. | ✅ `AdminEndpointTests.StatusJson_IncludesInvalidBcdWarnings_AfterPipelineWarning`; ✅ analogous for `codeOther`. |
|
||
| W2.23 | `Diagnostics/EventLogBridge.cs:46` | `EventLog.SourceExists` registry hit on every Error+ log line. | Cache the result in a `bool _sourceExists` field set in the constructor (one-shot). Document that source registration changes after process start require a service restart. | (covered by inspection; optional perf-sensitive E2E.) |
|
||
| W2.24 | `Proxy/ProxyWorker.cs:202-217` | Hard-coded 5-second deadline in `StopAsync`. Coexists with the coordinator's deadline; reader of just `ProxyWorker` would miss the contract. | If W1.5 keeps the coordinator: add a comment pointing at it. If W1.5 deletes the coordinator: replace the 5s with `IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs`. | (covered by W2.19's test.) |
|
||
|
||
---
|
||
|
||
## Wave 3 — Cleanup, Doc-only, Test Gaps (low priority)
|
||
|
||
### Cleanup (small ⚙)
|
||
|
||
- `Proxy/Multiplexing/PlcMultiplexer.cs:445-450` — dead `elapsedMs` calculation, never used. Remove.
|
||
- `Proxy/Multiplexing/UpstreamPipe.cs:262-263` — meaningless conditional `firstRead && remaining == count ? false : false`. Replace with `return false`.
|
||
- `Proxy/Multiplexing/InFlightByKeyMap.cs:62` — `TryAttachOrCreate` always returns `true`; drop the dead `bool`. (Resolved as part of W1.2.)
|
||
- `Bcd/BcdCodec.cs:103-110` and `Proxy/BcdPduPipeline.cs:455-459` — duplicated `HasBadNibble`. Promote codec's helper to `internal` and call it from the pipeline.
|
||
- `Proxy/Supervision/PlcListenerSupervisor.cs:268,323,346` — copy-pasted exception-message truncation. Extract `Truncate(string, int)` helper.
|
||
- `Proxy/Multiplexing/PlcMultiplexer.cs:844-846` — comment says "1-second floor"; code uses 100 ms. Fix comment.
|
||
- `Configuration/ConfigReconciler.cs:428` — duplicated tag-delta computation. Either expose a delta count from `ReloadPlan.Compute` or drop the log enrichment.
|
||
- `Mbproxy.csproj:42` — comment describes Polly as "deferred" but it's actively used. Remove stale comment.
|
||
- `Admin/StatusSnapshotBuilder.cs:69` — unreachable `?.Address.ToString()` fallback. Simplify to `p.RemoteEp?.ToString() ?? "?"`.
|
||
|
||
### Doc-only (📝)
|
||
|
||
- `README.md` (operator-facing) — add a callout describing the 32-bit BCD wire format as "two base-10000 digits in CDAB", not standard binary CDAB Int32. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) will silently corrupt values > 9999 if they assume normal CDAB-int32 semantics. *(BcdRewriter M1.)*
|
||
- `docs/design.md` and `docs/Reference/LogEvents.md` — clarify counter semantics for cache-miss-during-coalescing per the choice made in W2.4. *(Mux M9.)*
|
||
- `docs/Operations/Configuration.md` — note that the Event Log sink is a custom `EventLogBridge`, not `Serilog.Sinks.EventLog`, so reviewers don't add a duplicate sink. *(Hosting minor.)*
|
||
- `Proxy/ResponseCache.md` (or `docs/Architecture/ResponseCache.md`) — document the implicit invalidation-skip-during-recovery gating. *(Cache M1 / Supervisor M1, paired with W2.9 option (b).)*
|
||
- `docs/plan/README.md` — add a phase entry for "Phase 12: Review remediation 2026-05-14" pointing at this file, so the phase history stays complete.
|
||
|
||
### Test gaps (✅)
|
||
|
||
The following design contracts are unverified by any current test. Add them in priority order:
|
||
|
||
1. ✅ `ReloadValidatorTests` — `Cache.AllowLongTtl=false` rejects `CacheTtlMs > 60_000` (per-tag and per-PLC-default forms). Paired with W2.10. *(Cache test gap; design.md.)*
|
||
2. ✅ `ReloadValidatorTests` — `Cache.AllowLongTtl=true` happy-path acceptance.
|
||
3. ✅ `HotReloadE2ETests` — tag-list reload flushes the affected PLC's cache (paired with W1.1 + W2.8).
|
||
4. ✅ `SupervisorTests` — replace placeholder `Supervisor_RuntimeFault_TriggersRecovery` with a test that genuinely faults the accept loop and asserts state goes `Bound → Recovering → Bound`. *(Supervisor test gap.)*
|
||
5. ✅ `PlcMultiplexerTests` — TxId allocator saturation propagates a clean Modbus exception (0x06 or equivalent) to upstream, doesn't hang. *(Mux test gap.)*
|
||
6. ✅ `PlcMultiplexerTests` — coalescing factory leak under saturation with multiple attachers. (Paired with W1.2.)
|
||
7. ✅ `PlcMultiplexerTests` — backend-reader head-of-line block by a slow upstream. (Paired with W1.3.)
|
||
8. ✅ `PlcMultiplexerTests` — watchdog↔response race (response arriving in the same tick as `SnapshotOlderThan`).
|
||
9. ✅ `PlcMultiplexerTests` — cascade racing with a brand-new accept (does the new pipe survive or land in a zombie multiplexer mid-teardown?).
|
||
10. ✅ `HotReloadE2ETests` — flip `ReadCoalescing.Enabled` at runtime (not start-up) and assert subsequent reads stop coalescing. (Paired with W2.1.)
|
||
11. ✅ `BcdCodecTests` — `Encode16(int.MinValue)` doesn't blow up.
|
||
12. ✅ `BcdPduPipelineTests` — write range ending exactly mid-32-bit-pair on the high address (inverse of existing low-side test).
|
||
13. ✅ `BcdPduPipelineTests` — mixed 16-bit + 32-bit + non-BCD slots in a single FC03 read.
|
||
14. ✅ `BcdPduPipelineTests` — FC16 *response* handling (currently only the request is tested).
|
||
15. ✅ `ProxyForwardingTests` or similar — `qty > 128` FC03 read passes through unchanged; PLC's exception 03 surfaces to the client.
|
||
16. ✅ `ConfigReconcilerTests` — replace the soft "concurrent reloads serialised" test with one that drives genuinely-concurrent applies and asserts no overlap (using a fake supervisor with a "claim count" assertion).
|
||
17. ✅ `AdminEndpointTests` — POST/PUT/DELETE/HEAD against `/` and `/status.json` return 405 (verifies no unintended methods are bound).
|
||
|
||
---
|
||
|
||
## Out of Scope
|
||
|
||
Findings that this plan deliberately does **not** address:
|
||
|
||
- **Mux C2 / C3** (cache-hit-during-cascade visibility model). Per the design these are acceptable behaviours; the Multiplexer review downgraded them.
|
||
- Any finding that recommended adding a new feature (e.g. an `Mbproxy.AdminBindAddress` config knob) or new defense-in-depth security hardening — those were trimmed from the reviews per direction on 2026-05-14.
|
||
|
||
---
|
||
|
||
## Suggested Sequencing
|
||
|
||
A focused remediation pass might split as:
|
||
|
||
| Sprint | Items | Outcome |
|
||
|--------|-------|---------|
|
||
| Sprint 1 (~1 week) | W1.1, W1.2, W1.3, W1.4 | The four critical multiplexer/hot-reload defects are gone. Hot-reload finally works as documented. |
|
||
| Sprint 2 (~1 week) | W1.5, W2.19, W2.20, W2.24 + the related test (ShutdownE2ETests) | Shutdown path is single-owner and respects the configured timeout. |
|
||
| Sprint 3 (~1 week) | W2.1, W2.3, W2.8, W2.9, W2.10, W2.21 + cache/supervisor tests | All the major hot-reload + cache contract gaps closed. |
|
||
| Sprint 4 (~1 week) | All remaining W2 items (BCD, supervisor, multiplexer minor) | Counter accuracy, dispose hygiene, validator coverage. |
|
||
| Sprint 5 (~3 days) | Wave 3 cleanup + doc + tests | Suite + docs caught up. |
|
||
|
||
After Sprint 1 the production risk profile drops sharply. After Sprint 3 every documented design contract is enforced. Wave 3 is opportunistic and can be folded into ongoing work.
|