fix(server): resolve Low code-review findings (Server-004,006,008,012,014,015)

- Server-004: pass the role-derived display name to UserIdentity's base
  ctor (the SDK's DisplayName has no public setter) and drop the dead
  Display property; make RoleBasedIdentity internal sealed.
- Server-006: derive a bounded CancellationToken from the SDK's
  OperationContext.OperationDeadline in OnReadValue / OnWriteValue so a
  stalled driver call can no longer pin the request thread.
- Server-008: mark handled slots via CallMethodRequest.Processed = true
  in RouteScriptedAlarmMethodCalls (the SDK skips on Processed, not on a
  Good error slot).
- Server-012: PeerHttpProbeLoop.ProbeAsync stops mutating client.Timeout
  per call; uses a per-request CancellationTokenSource linked to the
  shutdown token instead.
- Server-014: wire SealedBootstrap into Program.cs via AddSealedBootstrap
  + OpcUaServerService so the generation-sealed cache + stale-config flag
  + resilient reader actually run; /healthz now reflects cache-fallback
  state.
- Server-015: replace the stale 'PR 16 / PR 17 minimum-viable scope'
  class summaries on OtOpcUaServer and OpcUaServerOptions with the
  shipped LDAP + anonymous-role + configurable security-profile prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-23 07:24:20 -04:00
parent 2b33b64a58
commit 6134050ceb
14 changed files with 698 additions and 40 deletions

View File

@@ -7,7 +7,7 @@
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 6 |
| Open findings | 0 |
## Checklist coverage
@@ -74,13 +74,13 @@
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` |
| Status | Open |
| Status | Resolved |
**Description:** `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Since the ctor passes only `userName` to base, `DisplayName` resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. `Display` is dead code.
**Recommendation:** Drop `Display`; set the base `DisplayName = displayName ?? userName;`. Verify `ResolveCallUser` yields the resolved display name.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — re-triaged: in the pinned SDK version (1.5.374.126) `UserIdentity.DisplayName` is a sealed-virtual auto-property with no public setter, so the base `DisplayName = …` assignment the original recommendation suggested won't compile. Instead the fix passes `displayName ?? userName` as the first arg to the base `UserIdentity(string, string)` ctor — the SDK seeds `DisplayName` from that arg internally — and removes the dead `Display` property. `RoleBasedIdentity` is now `internal sealed` so `DriverNodeManager.ResolveCallUser` can be unit-tested against the production identity type. Regression tests `RoleBasedIdentityTests.DisplayName_returns_LDAP_resolved_display_name_when_present`, `DisplayName_falls_back_to_userName_when_LDAP_display_name_is_null`, and `ResolveCallUser_yields_LDAP_resolved_display_name` cover the behaviour.
### Server-005
| Field | Value |
@@ -102,13 +102,13 @@
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` |
| Status | Open |
| Status | Resolved |
**Description:** `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout.
**Recommendation:** Derive a `CancellationToken` from the `OperationContext` / `TransportQuotas.OperationTimeout` so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — added `DriverNodeManager.DeriveOperationCancellation(ISystemContext, TimeSpan fallback)` helper that reads `SystemContext.OperationContext.OperationDeadline` (which the stack sets from the client's `RequestHeader.TimeoutHint`). `OnReadValue` and `OnWriteValue` now pass `cts.Token` to `_invoker.ExecuteAsync` / `ExecuteWriteAsync` instead of `CancellationToken.None`, and surface `BadTimeout` (instead of `BadInternalError`) when the deadline fires. Handles both the SDK's sentinel deadlines: `DateTime.MinValue` (no deadline plumbed) and `DateTime.MaxValue` (TimeoutHint=0, the SDK default) collapse to a 30-s fallback. A deadline > Int32.MaxValue ms in the future also clamps to the fallback so the read path never throws `ArgumentOutOfRangeException` from inside `CancellationTokenSource(TimeSpan)`. Regression tests in `DriverNodeManagerCancellationTests` cover all five paths (future / past / missing / MinValue / MaxValue).
### Server-007
| Field | Value |
@@ -130,13 +130,13 @@
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` |
| Status | Open |
| Status | Resolved |
**Description:** `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-on-Good assumption is not a guaranteed SDK contract. If `base.Call` re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition.
**Recommendation:** Verify against the pinned SDK whether `base.Call` skips Good-pre-populated slots. If not, exclude routed slots from `methodsToCall` before `base.Call`. Add a test asserting exactly-once engine transition for a routed Acknowledge.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — verified against the pinned SDK (DeepWiki query against OPCFoundation/UA-.NETStandard): `CustomNodeManager2.Call` / `CallInternalAsync` skip slots whose `CallMethodRequest.Processed` flag is `true`, not slots whose `errors[i]` is a non-Bad `ServiceResult`. `RouteScriptedAlarmMethodCalls` now sets `request.Processed = true` on every handled slot — success, `ArgumentException`, and generic exception paths — so `base.Call` never re-dispatches a routed Acknowledge / Confirm / AddComment to the stack's built-in Part 9 handler. Regression tests in `ScriptedAlarmMethodRoutingProcessedFlagTests` assert `Processed` is `true` after each engine path and `false` for slots the helper passes through to `base.Call`.
### Server-009
| Field | Value |
@@ -186,13 +186,13 @@
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` |
| Status | Open |
| Status | Resolved |
**Description:** `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance.
**Recommendation:** Configure the timeout once via `AddHttpClient(HttpClientName).ConfigureHttpClient(...)`, or use a per-request linked `CancellationTokenSource(_options.HttpProbeTimeout)`; drop the per-call `client.Timeout` mutation.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — `ProbeAsync` no longer mutates `client.Timeout`. Replaced with a per-call `CancellationTokenSource(_options.HttpProbeTimeout)` linked to the loop's shutdown token; `GetAsync` consumes the linked token so the per-request deadline is enforced via cancellation instead of via the factory-vended `HttpClient` instance. Regression test `PeerHttpProbeLoopTests.Tick_does_not_mutate_factory_vended_client_Timeout` asserts the timeout-on-client mutation is gone.
### Server-013
| Field | Value |
@@ -214,13 +214,13 @@
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` |
| Status | Open |
| Status | Resolved |
**Description:** `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never registered in DI nor referenced by `OpcUaServerService` — it and its `StaleConfigFlag` plumbing are dead in the production wire-up; the release blocker remains open in practice.
**Recommendation:** Either register `SealedBootstrap` (with `GenerationSealedCache`/`ResilientConfigReader`/`StaleConfigFlag`) and wire `StaleConfigFlag` into the health host, or delete `SealedBootstrap` and correct the release-readiness doc.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — added `ServerWiring.AddSealedBootstrap` DI helper that registers `GenerationSealedCache` (rooted at a `.sealed` sibling of `NodeOptions.LocalCachePath`), `StaleConfigFlag`, `ResilientConfigReader`, and `SealedBootstrap`. `Program.cs` calls it after `AddSingleton<NodeBootstrap>()`; `OpcUaServerService` now consumes `SealedBootstrap` instead of `NodeBootstrap`; `OpcUaApplicationHost` is constructed with `staleConfigFlag` resolved from DI so `/healthz`'s `usingStaleConfig` reflects the cache-fallback state. The legacy `NodeBootstrap` registration stays for back-compat with the integration tests that construct it directly. Regression test `SealedBootstrapWiringTests.SealedBootstrap_and_its_dependencies_are_registered_in_DI` asserts the registrations compose without missing-service exceptions; `SealedBootstrap.cs`'s xml-doc updated to describe the live wire-up rather than the deferred plan.
### Server-015
| Field | Value |
@@ -228,10 +228,10 @@
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` |
| Status | Open |
| Status | Resolved |
**Description:** `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication.
**Recommendation:** Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language.
**Resolution:** _(open)_
**Resolution:** Resolved 2026-05-23 — rewrote both class summaries. `OtOpcUaServer` now describes the live LDAP UserName / Anonymous identity-token flow, the `RoleBasedIdentity` wrapper, and the configurable `SecurityProfile` driven by `OpcUaServerOptions`. `OpcUaServerOptions` now describes endpoint + identity + PKI + health + LDAP + anonymous-role surfaces and points at `docs/security.md`. The stale "PR 16 / PR 17 minimum-viable scope" and "deferred to their own PR" language is gone.