Files
lmxopcua/code-reviews/Server/findings.md
Joseph Doherty 6134050ceb fix(server): resolve Low code-review findings (Server-004,006,008,012,014,015)
- Server-004: pass the role-derived display name to UserIdentity's base
  ctor (the SDK's DisplayName has no public setter) and drop the dead
  Display property; make RoleBasedIdentity internal sealed.
- Server-006: derive a bounded CancellationToken from the SDK's
  OperationContext.OperationDeadline in OnReadValue / OnWriteValue so a
  stalled driver call can no longer pin the request thread.
- Server-008: mark handled slots via CallMethodRequest.Processed = true
  in RouteScriptedAlarmMethodCalls (the SDK skips on Processed, not on a
  Good error slot).
- Server-012: PeerHttpProbeLoop.ProbeAsync stops mutating client.Timeout
  per call; uses a per-request CancellationTokenSource linked to the
  shutdown token instead.
- Server-014: wire SealedBootstrap into Program.cs via AddSealedBootstrap
  + OpcUaServerService so the generation-sealed cache + stale-config flag
  + resilient reader actually run; /healthz now reflects cache-fallback
  state.
- Server-015: replace the stale 'PR 16 / PR 17 minimum-viable scope'
  class summaries on OtOpcUaServer and OpcUaServerOptions with the
  shipped LDAP + anonymous-role + configurable security-profile prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 07:24:20 -04:00

22 KiB
Raw Blame History

Code Review — Server

Field Value
Module src/Server/ZB.MOM.WW.OtOpcUa.Server
Reviewer Claude Code
Review date 2026-05-22
Commit reviewed 76d35d1
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Server-001, Server-002, Server-003
2 OtOpcUa conventions Server-004
3 Concurrency & thread safety Server-005, Server-006
4 Error handling & resilience Server-007, Server-008
5 Security Server-009, Server-010, Server-011
6 Performance & resource management Server-012
7 Design-document adherence Server-013
8 Code organization & conventions Server-014
9 Testing coverage No issues found
10 Documentation & comments Server-015

Findings

Server-001

Field Value
Severity Critical
Category Correctness & logic bugs
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791
Status Resolved

Description: WriteNodeIdUnknown calls itself unconditionally as its first statement, then sets errors[i]. Unbounded recursion with no base case overflows the stack. Called from all four HistoryRead* overrides whenever a HistoryRead targets a node whose NodeId cannot be resolved to a driver full reference. Any client issuing such a HistoryRead triggers an uncatchable StackOverflowException that terminates the process — a remotely-triggerable DoS.

Recommendation: Replace the self-call with the result-slot assignment mirroring WriteUnsupported/WriteInternalError: results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown }; then errors[i] = StatusCodes.BadNodeIdUnknown;.

Resolution: Resolved 2026-05-22 — replaced the unconditional self-call in WriteNodeIdUnknown with the result-slot assignment (results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown }), mirroring WriteUnsupported/WriteInternalError; the helper is now internal for testability. Regression test DriverNodeManagerHistoryMappingTests.WriteNodeIdUnknown_returns_BadNodeIdUnknown_without_unbounded_recursion runs the helper on a small-stack worker thread and asserts it returns promptly with BadNodeIdUnknown.

Server-002

Field Value
Severity High
Category Correctness & logic bugs
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63
Status Resolved

Description: IsAllowed does if (decision.IsAllowed) return true; return !_strictMode;. When a session carries resolved LDAP groups and the evaluator returns an explicit deny, lax mode (default) overrides it to true. The lax fallback is intended only for sessions lacking LDAP groups / missing tries, but here it also nullifies authored NodeAcl deny rules for fully-resolved sessions. Per-tag deny ACLs do nothing until StrictMode is on.

Recommendation: Distinguish "indeterminate / no grant" from "explicit deny." Fall through to !_strictMode only when indeterminate; an explicit deny returns false regardless of mode. Extend AuthorizeDecision with an IsExplicitDeny flag if needed.

Resolution: Resolved 2026-05-22 — AuthorizationGate.IsAllowed now switches on the evaluator's AuthorizationVerdict: Allow returns true, Denied (explicit deny rule matched) returns false in both strict and lax mode, and only the indeterminate NotGranted case falls through to !_strictMode. The existing AuthorizationVerdict.Denied tri-state member is now honoured rather than collapsed into the lax fallback. Regression tests ExplicitDeny_LaxMode_Denies / ExplicitDeny_StrictMode_Denies / NotGranted_LaxMode_Allows / NotGranted_StrictMode_Denies in AuthorizationGateTests cover all four verdict×mode combinations via a fixed-verdict evaluator stub.

Server-003

Field Value
Severity Medium
Category Correctness & logic bugs
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119
Status Resolved

Description: ReadRawAsync's XML doc claims "newest-first," but TagRingBuffer.Snapshot() returns oldest-to-newest and the loop preserves that order — so results are oldest-first. Also maxValuesPerNode is capped against total buffer size before the [startUtc, endUtc) filter, so a paged read returns the oldest in-window samples, contradicting the doc and usual HistoryRead expectations.

Recommendation: Make code and doc agree on ordering (raw HistoryRead is normally ascending source-timestamp). Apply maxValuesPerNode to the in-window count, not the whole buffer.

Resolution: Resolved 2026-05-22 — corrected XML doc from "newest-first" to "oldest-first (ascending source timestamp, matching OPC UA Part 11 §6.4 raw-values default)"; moved maxValuesPerNode cap inside the time-window loop so the limit applies only to in-window results, not the whole buffer snapshot.

Server-004

Field Value
Severity Low
Category OtOpcUa conventions
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200
Status Resolved

Description: RoleBasedIdentity declares its own Display property, but the base UserIdentity already has a settable DisplayName. DriverNodeManager.ResolveCallUser/RouteScriptedAlarmMethodCalls read the base DisplayName, never Display. Since the ctor passes only userName to base, DisplayName resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. Display is dead code.

Recommendation: Drop Display; set the base DisplayName = displayName ?? userName;. Verify ResolveCallUser yields the resolved display name.

Resolution: Resolved 2026-05-23 — re-triaged: in the pinned SDK version (1.5.374.126) UserIdentity.DisplayName is a sealed-virtual auto-property with no public setter, so the base DisplayName = … assignment the original recommendation suggested won't compile. Instead the fix passes displayName ?? userName as the first arg to the base UserIdentity(string, string) ctor — the SDK seeds DisplayName from that arg internally — and removes the dead Display property. RoleBasedIdentity is now internal sealed so DriverNodeManager.ResolveCallUser can be unit-tested against the production identity type. Regression tests RoleBasedIdentityTests.DisplayName_returns_LDAP_resolved_display_name_when_present, DisplayName_falls_back_to_userName_when_LDAP_display_name_is_null, and ResolveCallUser_yields_LDAP_resolved_display_name cover the behaviour.

Server-005

Field Value
Severity Medium
Category Concurrency & thread safety
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166, src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311
Status Resolved

Description: OnValueChanged raises TransitionRaised on the value-change thread; the subscriber OnAlarmServiceTransition drives ConditionSink.OnTransitionalarm.ReportEvent. DriverNodeManager.Dispose detaches the handler but does not synchronise against an in-flight Invoke. The service is process-shared across drivers, so a transition can dispatch to a ConditionSink whose DriverNodeManager is concurrently being disposed → ReportEvent on a torn-down node manager.

Recommendation: Guard OnAlarmServiceTransition with a _disposed check under Lock before sink.OnTransition. Document that handlers must tolerate invocation during their owner's disposal.

Resolution: Resolved 2026-05-22 — added _nodeManagerDisposed field; Dispose(bool) now sets it under Lock before detaching the handler; OnAlarmServiceTransition checks the flag under the same Lock and exits early, preventing forwarding to a sink after the node manager has begun disposal.

Server-006

Field Value
Severity Low
Category Concurrency & thread safety
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348
Status Resolved

Description: OnReadValue/OnWriteValue are synchronous stack hooks that block on async driver calls via .GetAwaiter().GetResult() with CancellationToken.None. With MaxRequestThreadCount = 100, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout.

Recommendation: Derive a CancellationToken from the OperationContext / TransportQuotas.OperationTimeout so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available.

Resolution: Resolved 2026-05-23 — added DriverNodeManager.DeriveOperationCancellation(ISystemContext, TimeSpan fallback) helper that reads SystemContext.OperationContext.OperationDeadline (which the stack sets from the client's RequestHeader.TimeoutHint). OnReadValue and OnWriteValue now pass cts.Token to _invoker.ExecuteAsync / ExecuteWriteAsync instead of CancellationToken.None, and surface BadTimeout (instead of BadInternalError) when the deadline fires. Handles both the SDK's sentinel deadlines: DateTime.MinValue (no deadline plumbed) and DateTime.MaxValue (TimeoutHint=0, the SDK default) collapse to a 30-s fallback. A deadline > Int32.MaxValue ms in the future also clamps to the fallback so the read path never throws ArgumentOutOfRangeException from inside CancellationTokenSource(TimeSpan). Regression tests in DriverNodeManagerCancellationTests cover all five paths (future / past / missing / MinValue / MaxValue).

Server-007

Field Value
Severity Medium
Category Error handling & resilience
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183
Status Resolved

Description: HealthEndpointsHost is built without a configDbHealthy delegate, so the default () => true is used — /healthz always reports configDbReachable = true and never 503s on a DB outage. _staleConfigFlag is also never supplied by Program.cs, so the stale-config signal is inert too. /healthz degenerates to a pure liveness probe; operators get a false-healthy during a DB outage.

Recommendation: Wire a real config-DB probe (cheap cached SELECT 1) into HealthEndpointsHost, and register StaleConfigFlag in Program.cs. Or move DB health to /readyz and drop the misleading configDbReachable field.

Resolution: Resolved 2026-05-22 — added Func<bool>? configDbHealthy parameter to OpcUaApplicationHost (defaults null, backward-compatible); Program.cs constructs a DbHealthCache that calls CanConnectAsync every 10 s and caches the result, then passes () => dbHealthCache.IsHealthy; /healthz now reflects real DB reachability and returns 503 on a DB outage (unless stale-config cache is warm).

Server-008

Field Value
Severity Low
Category Error handling & resilience
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736
Status Resolved

Description: RouteScriptedAlarmMethodCalls marks a handled slot by setting errors[i] = ServiceResult.Good, assuming base.Call skips non-null Good error slots. The stack and GateCallMethodRequests only ever pre-populate Bad slots; the skip-on-Good assumption is not a guaranteed SDK contract. If base.Call re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition.

Recommendation: Verify against the pinned SDK whether base.Call skips Good-pre-populated slots. If not, exclude routed slots from methodsToCall before base.Call. Add a test asserting exactly-once engine transition for a routed Acknowledge.

Resolution: Resolved 2026-05-23 — verified against the pinned SDK (DeepWiki query against OPCFoundation/UA-.NETStandard): CustomNodeManager2.Call / CallInternalAsync skip slots whose CallMethodRequest.Processed flag is true, not slots whose errors[i] is a non-Bad ServiceResult. RouteScriptedAlarmMethodCalls now sets request.Processed = true on every handled slot — success, ArgumentException, and generic exception paths — so base.Call never re-dispatches a routed Acknowledge / Confirm / AddComment to the stack's built-in Part 9 handler. Regression tests in ScriptedAlarmMethodRoutingProcessedFlagTests assert Processed is true after each engine path and false for slots the helper passes through to base.Call.

Server-009

Field Value
Severity High
Category Security
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44, src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74
Status Resolved

Description: AllowInsecureLdap defaults to true (and Program.cs reads ?? true); UseTls defaults to false. Out of the box, usernames and plaintext passwords are bound to LDAP over an unencrypted socket. A production deployment enabling LDAP without explicitly setting AllowInsecureLdap=false ships credentials in clear text on the server→LDAP hop.

Recommendation: Default AllowInsecureLdap to false in both the property initializer and the Program.cs fallback. Log a startup warning when LDAP is enabled with UseTls=false && AllowInsecureLdap=true.

Resolution: Resolved 2026-05-22 — LdapOptions.AllowInsecureLdap now defaults to false (secure-by-default) and Program.cs's config fallback reads ?? false. Program.cs logs a startup Log.Warning when LDAP is enabled with UseTls=false && AllowInsecureLdap=true, flagging the clear-text credential hop. Regression tests in LdapOptionsTests assert the new secure defaults.

Server-010

Field Value
Severity Medium
Category Security
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59, src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291
Status Resolved

Description: AutoAcceptUntrustedClientCertificates defaults to true (Program.cs reads ?? true). BuildConfiguration wires a handler that accepts any client cert failing with BadCertificateUntrusted. A deployment that forgets to flip the flag accepts every untrusted client cert, defeating the PKI trust list. With the always-present None policy, the default posture is fully open.

Recommendation: Default AutoAcceptUntrustedClientCertificates to false; keep auto-accept as opt-in dev convenience. docs/security.md already shows false — align code to doc.

Resolution: Resolved 2026-05-22 — OpcUaServerOptions.AutoAcceptUntrustedClientCertificates property initialiser changed from true to false (secure by default, aligning with docs/security.md); Program.cs config fallback changed from ?? true to ?? false.

Server-011

Field Value
Severity Medium
Category Security
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346
Status Resolved

Description: BuildUserTokenPolicies advertises a UserName token policy only when SecurityProfile == Basic256Sha256SignAndEncrypt && Ldap.Enabled. With the default SecurityProfile = None and Ldap.Enabled = true, the LDAP authenticator is wired but no UserName policy is advertised — clients cannot present credentials; the only path in is Anonymous. The operator's intent is silently not honoured, with no diagnostic.

Recommendation: Validate config at startup and warn/fail when Ldap.Enabled = true but no UserName policy is advertised. Allow UserName tokens on any non-None profile (they are stack-encrypted regardless, per docs/security.md).

Resolution: Resolved 2026-05-22 — BuildUserTokenPolicies now advertises a UserName token policy whenever Ldap.Enabled && SecurityProfile != None (previously required == Basic256Sha256SignAndEncrypt); StartAsync logs a LogWarning at startup when Ldap.Enabled = true but SecurityProfile = None, surfacing the misconfiguration before clients connect.

Server-012

Field Value
Severity Low
Category Performance & resource management
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79
Status Resolved

Description: ProbeAsync creates an IHttpClientFactory client and mutates client.Timeout on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance.

Recommendation: Configure the timeout once via AddHttpClient(HttpClientName).ConfigureHttpClient(...), or use a per-request linked CancellationTokenSource(_options.HttpProbeTimeout); drop the per-call client.Timeout mutation.

Resolution: Resolved 2026-05-23 — ProbeAsync no longer mutates client.Timeout. Replaced with a per-call CancellationTokenSource(_options.HttpProbeTimeout) linked to the loop's shutdown token; GetAsync consumes the linked token so the per-request deadline is enforced via cancellation instead of via the factory-vended HttpClient instance. Regression test PeerHttpProbeLoopTests.Tick_does_not_mutate_factory_vended_client_Timeout asserts the timeout-on-client mutation is gone.

Server-013

Field Value
Severity Medium
Category Design-document adherence
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19, src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346, src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89
Status Resolved

Description: docs/security.md documents 7 transport security profiles and CLAUDE.md references a SecurityProfileResolver. The code's OpcUaSecurityProfile enum has only None and Basic256Sha256SignAndEncrypt; BuildSecurityPolicies adds a policy only for the latter; SecurityProfileResolver does not exist in the repo (grep finds it only in docs). Basic256Sha256-Sign and all Aes profiles are unimplemented, and Program.cs:89's Enum.TryParse silently selects None for an unrecognised profile string.

Recommendation: Reconcile code and docs — implement the missing profiles + SecurityProfileResolver, or trim docs/security.md / CLAUDE.md to the two supported profiles. At minimum, log a warning when a configured SecurityProfile fails to parse instead of silently using None.

Resolution: Resolved 2026-05-22 — replaced the silent Enum.TryParse ?? None fallback in Program.cs with a ParseSecurityProfile helper that produces a warning string listing supported profiles when the configured value is unrecognised; the warning is emitted via Log.Warning at startup before the host builds, making the misconfiguration immediately visible. Implementing the missing 5 profiles is tracked as a doc-to-code gap rather than a single finding fix.

Server-014

Field Value
Severity Low
Category Code organization & conventions
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs
Status Resolved

Description: SealedBootstrap claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but Program.cs registers and uses NodeBootstrap instead. SealedBootstrap is never registered in DI nor referenced by OpcUaServerService — it and its StaleConfigFlag plumbing are dead in the production wire-up; the release blocker remains open in practice.

Recommendation: Either register SealedBootstrap (with GenerationSealedCache/ResilientConfigReader/StaleConfigFlag) and wire StaleConfigFlag into the health host, or delete SealedBootstrap and correct the release-readiness doc.

Resolution: Resolved 2026-05-23 — added ServerWiring.AddSealedBootstrap DI helper that registers GenerationSealedCache (rooted at a .sealed sibling of NodeOptions.LocalCachePath), StaleConfigFlag, ResilientConfigReader, and SealedBootstrap. Program.cs calls it after AddSingleton<NodeBootstrap>(); OpcUaServerService now consumes SealedBootstrap instead of NodeBootstrap; OpcUaApplicationHost is constructed with staleConfigFlag resolved from DI so /healthz's usingStaleConfig reflects the cache-fallback state. The legacy NodeBootstrap registration stays for back-compat with the integration tests that construct it directly. Regression test SealedBootstrapWiringTests.SealedBootstrap_and_its_dependencies_are_registered_in_DI asserts the registrations compose without missing-service exceptions; SealedBootstrap.cs's xml-doc updated to describe the live wire-up rather than the deferred plan.

Server-015

Field Value
Severity Low
Category Documentation & comments
Location src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21, src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26
Status Resolved

Description: OtOpcUaServer's class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." OpcUaServerOptions's says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication.

Recommendation: Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language.

Resolution: Resolved 2026-05-23 — rewrote both class summaries. OtOpcUaServer now describes the live LDAP UserName / Anonymous identity-token flow, the RoleBasedIdentity wrapper, and the configurable SecurityProfile driven by OpcUaServerOptions. OpcUaServerOptions now describes endpoint + identity + PKI + health + LDAP + anonymous-role surfaces and points at docs/security.md. The stale "PR 16 / PR 17 minimum-viable scope" and "deferred to their own PR" language is gone.