# Code Review — Server | Field | Value | |---|---| | Module | `src/Server/ZB.MOM.WW.OtOpcUa.Server` | | Reviewer | Claude Code | | Review date | 2026-05-22 | | Commit reviewed | `76d35d1` | | Status | Reviewed | | Open findings | 0 | ## Checklist coverage | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Server-001, Server-002, Server-003 | | 2 | OtOpcUa conventions | Server-004 | | 3 | Concurrency & thread safety | Server-005, Server-006 | | 4 | Error handling & resilience | Server-007, Server-008 | | 5 | Security | Server-009, Server-010, Server-011 | | 6 | Performance & resource management | Server-012 | | 7 | Design-document adherence | Server-013 | | 8 | Code organization & conventions | Server-014 | | 9 | Testing coverage | No issues found | | 10 | Documentation & comments | Server-015 | ## Findings ### Server-001 | Field | Value | |---|---| | Severity | Critical | | Category | Correctness & logic bugs | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791` | | Status | Resolved | **Description:** `WriteNodeIdUnknown` calls itself unconditionally as its first statement, then sets `errors[i]`. Unbounded recursion with no base case overflows the stack. Called from all four `HistoryRead*` overrides whenever a HistoryRead targets a node whose `NodeId` cannot be resolved to a driver full reference. Any client issuing such a HistoryRead triggers an uncatchable `StackOverflowException` that terminates the process — a remotely-triggerable DoS. **Recommendation:** Replace the self-call with the result-slot assignment mirroring `WriteUnsupported`/`WriteInternalError`: `results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown };` then `errors[i] = StatusCodes.BadNodeIdUnknown;`. **Resolution:** Resolved 2026-05-22 — replaced the unconditional self-call in `WriteNodeIdUnknown` with the result-slot assignment (`results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown }`), mirroring `WriteUnsupported`/`WriteInternalError`; the helper is now `internal` for testability. Regression test `DriverNodeManagerHistoryMappingTests.WriteNodeIdUnknown_returns_BadNodeIdUnknown_without_unbounded_recursion` runs the helper on a small-stack worker thread and asserts it returns promptly with `BadNodeIdUnknown`. ### Server-002 | Field | Value | |---|---| | Severity | High | | Category | Correctness & logic bugs | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63` | | Status | Resolved | **Description:** `IsAllowed` does `if (decision.IsAllowed) return true; return !_strictMode;`. When a session carries resolved LDAP groups and the evaluator returns an explicit deny, lax mode (default) overrides it to `true`. The lax fallback is intended only for sessions lacking LDAP groups / missing tries, but here it also nullifies authored `NodeAcl` deny rules for fully-resolved sessions. Per-tag deny ACLs do nothing until `StrictMode` is on. **Recommendation:** Distinguish "indeterminate / no grant" from "explicit deny." Fall through to `!_strictMode` only when indeterminate; an explicit deny returns `false` regardless of mode. Extend `AuthorizeDecision` with an `IsExplicitDeny` flag if needed. **Resolution:** Resolved 2026-05-22 — `AuthorizationGate.IsAllowed` now switches on the evaluator's `AuthorizationVerdict`: `Allow` returns true, `Denied` (explicit deny rule matched) returns false in both strict and lax mode, and only the indeterminate `NotGranted` case falls through to `!_strictMode`. The existing `AuthorizationVerdict.Denied` tri-state member is now honoured rather than collapsed into the lax fallback. Regression tests `ExplicitDeny_LaxMode_Denies` / `ExplicitDeny_StrictMode_Denies` / `NotGranted_LaxMode_Allows` / `NotGranted_StrictMode_Denies` in `AuthorizationGateTests` cover all four verdict×mode combinations via a fixed-verdict evaluator stub. ### Server-003 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119` | | Status | Resolved | **Description:** `ReadRawAsync`'s XML doc claims "newest-first," but `TagRingBuffer.Snapshot()` returns oldest-to-newest and the loop preserves that order — so results are oldest-first. Also `maxValuesPerNode` is capped against total buffer size *before* the `[startUtc, endUtc)` filter, so a paged read returns the oldest in-window samples, contradicting the doc and usual HistoryRead expectations. **Recommendation:** Make code and doc agree on ordering (raw HistoryRead is normally ascending source-timestamp). Apply `maxValuesPerNode` to the in-window count, not the whole buffer. **Resolution:** Resolved 2026-05-22 — corrected XML doc from "newest-first" to "oldest-first (ascending source timestamp, matching OPC UA Part 11 §6.4 raw-values default)"; moved `maxValuesPerNode` cap inside the time-window loop so the limit applies only to in-window results, not the whole buffer snapshot. ### Server-004 | Field | Value | |---|---| | Severity | Low | | Category | OtOpcUa conventions | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` | | Status | Resolved | **Description:** `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Since the ctor passes only `userName` to base, `DisplayName` resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. `Display` is dead code. **Recommendation:** Drop `Display`; set the base `DisplayName = displayName ?? userName;`. Verify `ResolveCallUser` yields the resolved display name. **Resolution:** Resolved 2026-05-23 — re-triaged: in the pinned SDK version (1.5.374.126) `UserIdentity.DisplayName` is a sealed-virtual auto-property with no public setter, so the base `DisplayName = …` assignment the original recommendation suggested won't compile. Instead the fix passes `displayName ?? userName` as the first arg to the base `UserIdentity(string, string)` ctor — the SDK seeds `DisplayName` from that arg internally — and removes the dead `Display` property. `RoleBasedIdentity` is now `internal sealed` so `DriverNodeManager.ResolveCallUser` can be unit-tested against the production identity type. Regression tests `RoleBasedIdentityTests.DisplayName_returns_LDAP_resolved_display_name_when_present`, `DisplayName_falls_back_to_userName_when_LDAP_display_name_is_null`, and `ResolveCallUser_yields_LDAP_resolved_display_name` cover the behaviour. ### Server-005 | Field | Value | |---|---| | Severity | Medium | | Category | Concurrency & thread safety | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311` | | Status | Resolved | **Description:** `OnValueChanged` raises `TransitionRaised` on the value-change thread; the subscriber `OnAlarmServiceTransition` drives `ConditionSink.OnTransition` → `alarm.ReportEvent`. `DriverNodeManager.Dispose` detaches the handler but does not synchronise against an in-flight `Invoke`. The service is process-shared across drivers, so a transition can dispatch to a `ConditionSink` whose `DriverNodeManager` is concurrently being disposed → `ReportEvent` on a torn-down node manager. **Recommendation:** Guard `OnAlarmServiceTransition` with a `_disposed` check under `Lock` before `sink.OnTransition`. Document that handlers must tolerate invocation during their owner's disposal. **Resolution:** Resolved 2026-05-22 — added `_nodeManagerDisposed` field; `Dispose(bool)` now sets it under `Lock` before detaching the handler; `OnAlarmServiceTransition` checks the flag under the same `Lock` and exits early, preventing forwarding to a sink after the node manager has begun disposal. ### Server-006 | Field | Value | |---|---| | Severity | Low | | Category | Concurrency & thread safety | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` | | Status | Resolved | **Description:** `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout. **Recommendation:** Derive a `CancellationToken` from the `OperationContext` / `TransportQuotas.OperationTimeout` so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available. **Resolution:** Resolved 2026-05-23 — added `DriverNodeManager.DeriveOperationCancellation(ISystemContext, TimeSpan fallback)` helper that reads `SystemContext.OperationContext.OperationDeadline` (which the stack sets from the client's `RequestHeader.TimeoutHint`). `OnReadValue` and `OnWriteValue` now pass `cts.Token` to `_invoker.ExecuteAsync` / `ExecuteWriteAsync` instead of `CancellationToken.None`, and surface `BadTimeout` (instead of `BadInternalError`) when the deadline fires. Handles both the SDK's sentinel deadlines: `DateTime.MinValue` (no deadline plumbed) and `DateTime.MaxValue` (TimeoutHint=0, the SDK default) collapse to a 30-s fallback. A deadline > Int32.MaxValue ms in the future also clamps to the fallback so the read path never throws `ArgumentOutOfRangeException` from inside `CancellationTokenSource(TimeSpan)`. Regression tests in `DriverNodeManagerCancellationTests` cover all five paths (future / past / missing / MinValue / MaxValue). ### Server-007 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183` | | Status | Resolved | **Description:** `HealthEndpointsHost` is built without a `configDbHealthy` delegate, so the default `() => true` is used — `/healthz` always reports `configDbReachable = true` and never 503s on a DB outage. `_staleConfigFlag` is also never supplied by `Program.cs`, so the stale-config signal is inert too. `/healthz` degenerates to a pure liveness probe; operators get a false-healthy during a DB outage. **Recommendation:** Wire a real config-DB probe (cheap cached `SELECT 1`) into `HealthEndpointsHost`, and register `StaleConfigFlag` in `Program.cs`. Or move DB health to `/readyz` and drop the misleading `configDbReachable` field. **Resolution:** Resolved 2026-05-22 — added `Func? configDbHealthy` parameter to `OpcUaApplicationHost` (defaults null, backward-compatible); `Program.cs` constructs a `DbHealthCache` that calls `CanConnectAsync` every 10 s and caches the result, then passes `() => dbHealthCache.IsHealthy`; `/healthz` now reflects real DB reachability and returns 503 on a DB outage (unless stale-config cache is warm). ### Server-008 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` | | Status | Resolved | **Description:** `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-on-Good assumption is not a guaranteed SDK contract. If `base.Call` re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition. **Recommendation:** Verify against the pinned SDK whether `base.Call` skips Good-pre-populated slots. If not, exclude routed slots from `methodsToCall` before `base.Call`. Add a test asserting exactly-once engine transition for a routed Acknowledge. **Resolution:** Resolved 2026-05-23 — verified against the pinned SDK (DeepWiki query against OPCFoundation/UA-.NETStandard): `CustomNodeManager2.Call` / `CallInternalAsync` skip slots whose `CallMethodRequest.Processed` flag is `true`, not slots whose `errors[i]` is a non-Bad `ServiceResult`. `RouteScriptedAlarmMethodCalls` now sets `request.Processed = true` on every handled slot — success, `ArgumentException`, and generic exception paths — so `base.Call` never re-dispatches a routed Acknowledge / Confirm / AddComment to the stack's built-in Part 9 handler. Regression tests in `ScriptedAlarmMethodRoutingProcessedFlagTests` assert `Processed` is `true` after each engine path and `false` for slots the helper passes through to `base.Call`. ### Server-009 | Field | Value | |---|---| | Severity | High | | Category | Security | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74` | | Status | Resolved | **Description:** `AllowInsecureLdap` defaults to `true` (and `Program.cs` reads `?? true`); `UseTls` defaults to `false`. Out of the box, usernames and plaintext passwords are bound to LDAP over an unencrypted socket. A production deployment enabling LDAP without explicitly setting `AllowInsecureLdap=false` ships credentials in clear text on the server→LDAP hop. **Recommendation:** Default `AllowInsecureLdap` to `false` in both the property initializer and the `Program.cs` fallback. Log a startup warning when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`. **Resolution:** Resolved 2026-05-22 — `LdapOptions.AllowInsecureLdap` now defaults to `false` (secure-by-default) and `Program.cs`'s config fallback reads `?? false`. `Program.cs` logs a startup `Log.Warning` when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`, flagging the clear-text credential hop. Regression tests in `LdapOptionsTests` assert the new secure defaults. ### Server-010 | Field | Value | |---|---| | Severity | Medium | | Category | Security | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291` | | Status | Resolved | **Description:** `AutoAcceptUntrustedClientCertificates` defaults to `true` (`Program.cs` reads `?? true`). `BuildConfiguration` wires a handler that accepts any client cert failing with `BadCertificateUntrusted`. A deployment that forgets to flip the flag accepts every untrusted client cert, defeating the PKI trust list. With the always-present `None` policy, the default posture is fully open. **Recommendation:** Default `AutoAcceptUntrustedClientCertificates` to `false`; keep auto-accept as opt-in dev convenience. `docs/security.md` already shows `false` — align code to doc. **Resolution:** Resolved 2026-05-22 — `OpcUaServerOptions.AutoAcceptUntrustedClientCertificates` property initialiser changed from `true` to `false` (secure by default, aligning with `docs/security.md`); `Program.cs` config fallback changed from `?? true` to `?? false`. ### Server-011 | Field | Value | |---|---| | Severity | Medium | | Category | Security | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346` | | Status | Resolved | **Description:** `BuildUserTokenPolicies` advertises a `UserName` token policy only when `SecurityProfile == Basic256Sha256SignAndEncrypt && Ldap.Enabled`. With the default `SecurityProfile = None` and `Ldap.Enabled = true`, the LDAP authenticator is wired but no UserName policy is advertised — clients cannot present credentials; the only path in is Anonymous. The operator's intent is silently not honoured, with no diagnostic. **Recommendation:** Validate config at startup and warn/fail when `Ldap.Enabled = true` but no UserName policy is advertised. Allow UserName tokens on any non-None profile (they are stack-encrypted regardless, per `docs/security.md`). **Resolution:** Resolved 2026-05-22 — `BuildUserTokenPolicies` now advertises a `UserName` token policy whenever `Ldap.Enabled && SecurityProfile != None` (previously required `== Basic256Sha256SignAndEncrypt`); `StartAsync` logs a `LogWarning` at startup when `Ldap.Enabled = true` but `SecurityProfile = None`, surfacing the misconfiguration before clients connect. ### Server-012 | Field | Value | |---|---| | Severity | Low | | Category | Performance & resource management | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` | | Status | Resolved | **Description:** `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance. **Recommendation:** Configure the timeout once via `AddHttpClient(HttpClientName).ConfigureHttpClient(...)`, or use a per-request linked `CancellationTokenSource(_options.HttpProbeTimeout)`; drop the per-call `client.Timeout` mutation. **Resolution:** Resolved 2026-05-23 — `ProbeAsync` no longer mutates `client.Timeout`. Replaced with a per-call `CancellationTokenSource(_options.HttpProbeTimeout)` linked to the loop's shutdown token; `GetAsync` consumes the linked token so the per-request deadline is enforced via cancellation instead of via the factory-vended `HttpClient` instance. Regression test `PeerHttpProbeLoopTests.Tick_does_not_mutate_factory_vended_client_Timeout` asserts the timeout-on-client mutation is gone. ### Server-013 | Field | Value | |---|---| | Severity | Medium | | Category | Design-document adherence | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89` | | Status | Resolved | **Description:** `docs/security.md` documents 7 transport security profiles and `CLAUDE.md` references a `SecurityProfileResolver`. The code's `OpcUaSecurityProfile` enum has only `None` and `Basic256Sha256SignAndEncrypt`; `BuildSecurityPolicies` adds a policy only for the latter; `SecurityProfileResolver` does not exist in the repo (grep finds it only in docs). `Basic256Sha256-Sign` and all Aes profiles are unimplemented, and `Program.cs:89`'s `Enum.TryParse` silently selects `None` for an unrecognised profile string. **Recommendation:** Reconcile code and docs — implement the missing profiles + `SecurityProfileResolver`, or trim `docs/security.md` / `CLAUDE.md` to the two supported profiles. At minimum, log a warning when a configured `SecurityProfile` fails to parse instead of silently using `None`. **Resolution:** Resolved 2026-05-22 — replaced the silent `Enum.TryParse ?? None` fallback in `Program.cs` with a `ParseSecurityProfile` helper that produces a warning string listing supported profiles when the configured value is unrecognised; the warning is emitted via `Log.Warning` at startup before the host builds, making the misconfiguration immediately visible. Implementing the missing 5 profiles is tracked as a doc-to-code gap rather than a single finding fix. ### Server-014 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` | | Status | Resolved | **Description:** `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never registered in DI nor referenced by `OpcUaServerService` — it and its `StaleConfigFlag` plumbing are dead in the production wire-up; the release blocker remains open in practice. **Recommendation:** Either register `SealedBootstrap` (with `GenerationSealedCache`/`ResilientConfigReader`/`StaleConfigFlag`) and wire `StaleConfigFlag` into the health host, or delete `SealedBootstrap` and correct the release-readiness doc. **Resolution:** Resolved 2026-05-23 — added `ServerWiring.AddSealedBootstrap` DI helper that registers `GenerationSealedCache` (rooted at a `.sealed` sibling of `NodeOptions.LocalCachePath`), `StaleConfigFlag`, `ResilientConfigReader`, and `SealedBootstrap`. `Program.cs` calls it after `AddSingleton()`; `OpcUaServerService` now consumes `SealedBootstrap` instead of `NodeBootstrap`; `OpcUaApplicationHost` is constructed with `staleConfigFlag` resolved from DI so `/healthz`'s `usingStaleConfig` reflects the cache-fallback state. The legacy `NodeBootstrap` registration stays for back-compat with the integration tests that construct it directly. Regression test `SealedBootstrapWiringTests.SealedBootstrap_and_its_dependencies_are_registered_in_DI` asserts the registrations compose without missing-service exceptions; `SealedBootstrap.cs`'s xml-doc updated to describe the live wire-up rather than the deferred plan. ### Server-015 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` | | Status | Resolved | **Description:** `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication. **Recommendation:** Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language. **Resolution:** Resolved 2026-05-23 — rewrote both class summaries. `OtOpcUaServer` now describes the live LDAP UserName / Anonymous identity-token flow, the `RoleBasedIdentity` wrapper, and the configurable `SecurityProfile` driven by `OpcUaServerOptions`. `OpcUaServerOptions` now describes endpoint + identity + PKI + health + LDAP + anonymous-role surfaces and points at `docs/security.md`. The stale "PR 16 / PR 17 minimum-viable scope" and "deferred to their own PR" language is gone.