Files
lmxopcua/code-reviews/Server/findings.md
Joseph Doherty adf794f791 fix(server): resolve High code-review findings (Server-002, Server-009)
Server-002 — AuthorizationGate lax-mode no longer overrides explicit deny.
IsAllowed now switches on the evaluator's AuthorizationVerdict: Allow -> true,
Denied (an authored deny rule matched) -> false in BOTH strict and lax mode,
and only the indeterminate NotGranted case falls through to !_strictMode.
Previously `if (decision.IsAllowed) return true; return !_strictMode;` let lax
mode (the default) nullify authored NodeAcl deny rules for fully-resolved
sessions. The tri-state AuthorizationVerdict.Denied member is now honoured.

Server-009 — LDAP is secure-by-default. LdapOptions.AllowInsecureLdap now
defaults to false (was true) and Program.cs's config fallback reads `?? false`
(was `?? true`), so an LDAP-enabled deployment will not bind credentials over
an unencrypted socket unless an operator explicitly opts in. Program.cs also
logs a startup warning when LDAP is enabled with UseTls=false and
AllowInsecureLdap=true, flagging the clear-text server->LDAP credential hop.

Regression tests: AuthorizationGateTests covers all four verdict x mode
combinations via a fixed-verdict evaluator stub; new LdapOptionsTests asserts
the secure defaults. Both Server and Server.Tests build clean; the 15 targeted
tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 06:11:06 -04:00

238 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — Server
| Field | Value |
|---|---|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 12 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Server-001, Server-002, Server-003 |
| 2 | OtOpcUa conventions | Server-004 |
| 3 | Concurrency & thread safety | Server-005, Server-006 |
| 4 | Error handling & resilience | Server-007, Server-008 |
| 5 | Security | Server-009, Server-010, Server-011 |
| 6 | Performance & resource management | Server-012 |
| 7 | Design-document adherence | Server-013 |
| 8 | Code organization & conventions | Server-014 |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Server-015 |
## Findings
### Server-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791` |
| Status | Resolved |
**Description:** `WriteNodeIdUnknown` calls itself unconditionally as its first statement, then sets `errors[i]`. Unbounded recursion with no base case overflows the stack. Called from all four `HistoryRead*` overrides whenever a HistoryRead targets a node whose `NodeId` cannot be resolved to a driver full reference. Any client issuing such a HistoryRead triggers an uncatchable `StackOverflowException` that terminates the process — a remotely-triggerable DoS.
**Recommendation:** Replace the self-call with the result-slot assignment mirroring `WriteUnsupported`/`WriteInternalError`: `results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown };` then `errors[i] = StatusCodes.BadNodeIdUnknown;`.
**Resolution:** Resolved 2026-05-22 — replaced the unconditional self-call in `WriteNodeIdUnknown` with the result-slot assignment (`results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown }`), mirroring `WriteUnsupported`/`WriteInternalError`; the helper is now `internal` for testability. Regression test `DriverNodeManagerHistoryMappingTests.WriteNodeIdUnknown_returns_BadNodeIdUnknown_without_unbounded_recursion` runs the helper on a small-stack worker thread and asserts it returns promptly with `BadNodeIdUnknown`.
### Server-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63` |
| Status | Resolved |
**Description:** `IsAllowed` does `if (decision.IsAllowed) return true; return !_strictMode;`. When a session carries resolved LDAP groups and the evaluator returns an explicit deny, lax mode (default) overrides it to `true`. The lax fallback is intended only for sessions lacking LDAP groups / missing tries, but here it also nullifies authored `NodeAcl` deny rules for fully-resolved sessions. Per-tag deny ACLs do nothing until `StrictMode` is on.
**Recommendation:** Distinguish "indeterminate / no grant" from "explicit deny." Fall through to `!_strictMode` only when indeterminate; an explicit deny returns `false` regardless of mode. Extend `AuthorizeDecision` with an `IsExplicitDeny` flag if needed.
**Resolution:** Resolved 2026-05-22 — `AuthorizationGate.IsAllowed` now switches on the evaluator's `AuthorizationVerdict`: `Allow` returns true, `Denied` (explicit deny rule matched) returns false in both strict and lax mode, and only the indeterminate `NotGranted` case falls through to `!_strictMode`. The existing `AuthorizationVerdict.Denied` tri-state member is now honoured rather than collapsed into the lax fallback. Regression tests `ExplicitDeny_LaxMode_Denies` / `ExplicitDeny_StrictMode_Denies` / `NotGranted_LaxMode_Allows` / `NotGranted_StrictMode_Denies` in `AuthorizationGateTests` cover all four verdict×mode combinations via a fixed-verdict evaluator stub.
### Server-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119` |
| Status | Open |
**Description:** `ReadRawAsync`'s XML doc claims "newest-first," but `TagRingBuffer.Snapshot()` returns oldest-to-newest and the loop preserves that order — so results are oldest-first. Also `maxValuesPerNode` is capped against total buffer size *before* the `[startUtc, endUtc)` filter, so a paged read returns the oldest in-window samples, contradicting the doc and usual HistoryRead expectations.
**Recommendation:** Make code and doc agree on ordering (raw HistoryRead is normally ascending source-timestamp). Apply `maxValuesPerNode` to the in-window count, not the whole buffer.
**Resolution:** _(open)_
### Server-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` |
| Status | Open |
**Description:** `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Since the ctor passes only `userName` to base, `DisplayName` resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. `Display` is dead code.
**Recommendation:** Drop `Display`; set the base `DisplayName = displayName ?? userName;`. Verify `ResolveCallUser` yields the resolved display name.
**Resolution:** _(open)_
### Server-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311` |
| Status | Open |
**Description:** `OnValueChanged` raises `TransitionRaised` on the value-change thread; the subscriber `OnAlarmServiceTransition` drives `ConditionSink.OnTransition``alarm.ReportEvent`. `DriverNodeManager.Dispose` detaches the handler but does not synchronise against an in-flight `Invoke`. The service is process-shared across drivers, so a transition can dispatch to a `ConditionSink` whose `DriverNodeManager` is concurrently being disposed → `ReportEvent` on a torn-down node manager.
**Recommendation:** Guard `OnAlarmServiceTransition` with a `_disposed` check under `Lock` before `sink.OnTransition`. Document that handlers must tolerate invocation during their owner's disposal.
**Resolution:** _(open)_
### Server-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` |
| Status | Open |
**Description:** `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout.
**Recommendation:** Derive a `CancellationToken` from the `OperationContext` / `TransportQuotas.OperationTimeout` so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available.
**Resolution:** _(open)_
### Server-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183` |
| Status | Open |
**Description:** `HealthEndpointsHost` is built without a `configDbHealthy` delegate, so the default `() => true` is used — `/healthz` always reports `configDbReachable = true` and never 503s on a DB outage. `_staleConfigFlag` is also never supplied by `Program.cs`, so the stale-config signal is inert too. `/healthz` degenerates to a pure liveness probe; operators get a false-healthy during a DB outage.
**Recommendation:** Wire a real config-DB probe (cheap cached `SELECT 1`) into `HealthEndpointsHost`, and register `StaleConfigFlag` in `Program.cs`. Or move DB health to `/readyz` and drop the misleading `configDbReachable` field.
**Resolution:** _(open)_
### Server-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` |
| Status | Open |
**Description:** `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-on-Good assumption is not a guaranteed SDK contract. If `base.Call` re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition.
**Recommendation:** Verify against the pinned SDK whether `base.Call` skips Good-pre-populated slots. If not, exclude routed slots from `methodsToCall` before `base.Call`. Add a test asserting exactly-once engine transition for a routed Acknowledge.
**Resolution:** _(open)_
### Server-009
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74` |
| Status | Resolved |
**Description:** `AllowInsecureLdap` defaults to `true` (and `Program.cs` reads `?? true`); `UseTls` defaults to `false`. Out of the box, usernames and plaintext passwords are bound to LDAP over an unencrypted socket. A production deployment enabling LDAP without explicitly setting `AllowInsecureLdap=false` ships credentials in clear text on the server→LDAP hop.
**Recommendation:** Default `AllowInsecureLdap` to `false` in both the property initializer and the `Program.cs` fallback. Log a startup warning when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`.
**Resolution:** Resolved 2026-05-22 — `LdapOptions.AllowInsecureLdap` now defaults to `false` (secure-by-default) and `Program.cs`'s config fallback reads `?? false`. `Program.cs` logs a startup `Log.Warning` when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`, flagging the clear-text credential hop. Regression tests in `LdapOptionsTests` assert the new secure defaults.
### Server-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291` |
| Status | Open |
**Description:** `AutoAcceptUntrustedClientCertificates` defaults to `true` (`Program.cs` reads `?? true`). `BuildConfiguration` wires a handler that accepts any client cert failing with `BadCertificateUntrusted`. A deployment that forgets to flip the flag accepts every untrusted client cert, defeating the PKI trust list. With the always-present `None` policy, the default posture is fully open.
**Recommendation:** Default `AutoAcceptUntrustedClientCertificates` to `false`; keep auto-accept as opt-in dev convenience. `docs/security.md` already shows `false` — align code to doc.
**Resolution:** _(open)_
### Server-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346` |
| Status | Open |
**Description:** `BuildUserTokenPolicies` advertises a `UserName` token policy only when `SecurityProfile == Basic256Sha256SignAndEncrypt && Ldap.Enabled`. With the default `SecurityProfile = None` and `Ldap.Enabled = true`, the LDAP authenticator is wired but no UserName policy is advertised — clients cannot present credentials; the only path in is Anonymous. The operator's intent is silently not honoured, with no diagnostic.
**Recommendation:** Validate config at startup and warn/fail when `Ldap.Enabled = true` but no UserName policy is advertised. Allow UserName tokens on any non-None profile (they are stack-encrypted regardless, per `docs/security.md`).
**Resolution:** _(open)_
### Server-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` |
| Status | Open |
**Description:** `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance.
**Recommendation:** Configure the timeout once via `AddHttpClient(HttpClientName).ConfigureHttpClient(...)`, or use a per-request linked `CancellationTokenSource(_options.HttpProbeTimeout)`; drop the per-call `client.Timeout` mutation.
**Resolution:** _(open)_
### Server-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89` |
| Status | Open |
**Description:** `docs/security.md` documents 7 transport security profiles and `CLAUDE.md` references a `SecurityProfileResolver`. The code's `OpcUaSecurityProfile` enum has only `None` and `Basic256Sha256SignAndEncrypt`; `BuildSecurityPolicies` adds a policy only for the latter; `SecurityProfileResolver` does not exist in the repo (grep finds it only in docs). `Basic256Sha256-Sign` and all Aes profiles are unimplemented, and `Program.cs:89`'s `Enum.TryParse` silently selects `None` for an unrecognised profile string.
**Recommendation:** Reconcile code and docs — implement the missing profiles + `SecurityProfileResolver`, or trim `docs/security.md` / `CLAUDE.md` to the two supported profiles. At minimum, log a warning when a configured `SecurityProfile` fails to parse instead of silently using `None`.
**Resolution:** _(open)_
### Server-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` |
| Status | Open |
**Description:** `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never registered in DI nor referenced by `OpcUaServerService` — it and its `StaleConfigFlag` plumbing are dead in the production wire-up; the release blocker remains open in practice.
**Recommendation:** Either register `SealedBootstrap` (with `GenerationSealedCache`/`ResilientConfigReader`/`StaleConfigFlag`) and wire `StaleConfigFlag` into the health host, or delete `SealedBootstrap` and correct the release-readiness doc.
**Resolution:** _(open)_
### Server-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` |
| Status | Open |
**Description:** `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication.
**Recommendation:** Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language.
**Resolution:** _(open)_