docs(code-reviews): comprehensive per-module review pass at 76d35d1

Reviewed all 31 src/ production projects against the 10-category
checklist in REVIEW-PROCESS.md. Each module gets its own findings.md;
code-reviews/README.md is regenerated from them.

334 findings: 6 Critical, 46 High, 126 Medium, 156 Low.

Critical findings:
- Server-001: WriteNodeIdUnknown recurses unconditionally — a HistoryRead
  on an unresolvable node crashes the process (remote DoS).
- Admin-001/002: app-wide auth bypass (RouteView not AuthorizeRouteView)
  plus unauthenticated mutating routes.
- Core.Scripting-001: System.Environment reachable from operator scripts;
  Environment.Exit() terminates the server.
- Core.AlarmHistorian-001: rowIds/events parallel-list desync on a corrupt
  payload misapplies outcomes — silent alarm-event data loss.
- Driver.Galaxy-001: ReconnectSupervisor is built but never triggered, so
  a transient gateway drop permanently kills the event stream.

All findings are Status=Open; resolution is tracked per REVIEW-PROCESS.md
section 4. Review only — no source code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 05:20:27 -04:00
parent 76d35d1b9f
commit 8568f5cd85
32 changed files with 8134 additions and 2 deletions

View File

@@ -0,0 +1,237 @@
# Code Review — Server
| Field | Value |
|---|---|
| Module | `src/Server/ZB.MOM.WW.OtOpcUa.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Status | Reviewed |
| Open findings | 15 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Server-001, Server-002, Server-003 |
| 2 | OtOpcUa conventions | Server-004 |
| 3 | Concurrency & thread safety | Server-005, Server-006 |
| 4 | Error handling & resilience | Server-007, Server-008 |
| 5 | Security | Server-009, Server-010, Server-011 |
| 6 | Performance & resource management | Server-012 |
| 7 | Design-document adherence | Server-013 |
| 8 | Code organization & conventions | Server-014 |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Server-015 |
## Findings
### Server-001
| Field | Value |
|---|---|
| Severity | Critical |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:1791` |
| Status | Open |
**Description:** `WriteNodeIdUnknown` calls itself unconditionally as its first statement, then sets `errors[i]`. Unbounded recursion with no base case overflows the stack. Called from all four `HistoryRead*` overrides whenever a HistoryRead targets a node whose `NodeId` cannot be resolved to a driver full reference. Any client issuing such a HistoryRead triggers an uncatchable `StackOverflowException` that terminates the process — a remotely-triggerable DoS.
**Recommendation:** Replace the self-call with the result-slot assignment mirroring `WriteUnsupported`/`WriteInternalError`: `results[i] = new OpcHistoryReadResult { StatusCode = StatusCodes.BadNodeIdUnknown };` then `errors[i] = StatusCodes.BadNodeIdUnknown;`.
**Resolution:** _(open)_
### Server-002
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/AuthorizationGate.cs:60-63` |
| Status | Open |
**Description:** `IsAllowed` does `if (decision.IsAllowed) return true; return !_strictMode;`. When a session carries resolved LDAP groups and the evaluator returns an explicit deny, lax mode (default) overrides it to `true`. The lax fallback is intended only for sessions lacking LDAP groups / missing tries, but here it also nullifies authored `NodeAcl` deny rules for fully-resolved sessions. Per-tag deny ACLs do nothing until `StrictMode` is on.
**Recommendation:** Distinguish "indeterminate / no grant" from "explicit deny." Fall through to `!_strictMode` only when indeterminate; an explicit deny returns `false` regardless of mode. Extend `AuthorizeDecision` with an `IsExplicitDeny` flag if needed.
**Resolution:** _(open)_
### Server-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Phase7/RingBufferHistoryWriter.cs:96-119` |
| Status | Open |
**Description:** `ReadRawAsync`'s XML doc claims "newest-first," but `TagRingBuffer.Snapshot()` returns oldest-to-newest and the loop preserves that order — so results are oldest-first. Also `maxValuesPerNode` is capped against total buffer size *before* the `[startUtc, endUtc)` filter, so a paged read returns the oldest in-window samples, contradicting the doc and usual HistoryRead expectations.
**Recommendation:** Make code and doc agree on ordering (raw HistoryRead is normally ascending source-timestamp). Apply `maxValuesPerNode` to the in-window count, not the whole buffer.
**Resolution:** _(open)_
### Server-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | OtOpcUa conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:187-200` |
| Status | Open |
**Description:** `RoleBasedIdentity` declares its own `Display` property, but the base `UserIdentity` already has a settable `DisplayName`. `DriverNodeManager.ResolveCallUser`/`RouteScriptedAlarmMethodCalls` read the base `DisplayName`, never `Display`. Since the ctor passes only `userName` to base, `DisplayName` resolves to the username — so scripted-alarm Ack/Confirm/Shelve audit entries record the raw username, not the LDAP-resolved display name the comment promises. `Display` is dead code.
**Recommendation:** Drop `Display`; set the base `DisplayName = displayName ?? userName;`. Verify `ResolveCallUser` yields the resolved display name.
**Resolution:** _(open)_
### Server-005
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Alarms/AlarmConditionService.cs:166`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:303-311` |
| Status | Open |
**Description:** `OnValueChanged` raises `TransitionRaised` on the value-change thread; the subscriber `OnAlarmServiceTransition` drives `ConditionSink.OnTransition``alarm.ReportEvent`. `DriverNodeManager.Dispose` detaches the handler but does not synchronise against an in-flight `Invoke`. The service is process-shared across drivers, so a transition can dispatch to a `ConditionSink` whose `DriverNodeManager` is concurrently being disposed → `ReportEvent` on a torn-down node manager.
**Recommendation:** Guard `OnAlarmServiceTransition` with a `_disposed` check under `Lock` before `sink.OnTransition`. Document that handlers must tolerate invocation during their owner's disposal.
**Resolution:** _(open)_
### Server-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:478-482, 1342-1348` |
| Status | Open |
**Description:** `OnReadValue`/`OnWriteValue` are synchronous stack hooks that block on async driver calls via `.GetAwaiter().GetResult()` with `CancellationToken.None`. With `MaxRequestThreadCount = 100`, a burst of reads/writes into a stalled driver pins request threads for the full pipeline timeout, exhausting the pool and stalling unrelated sessions. The call cannot be cancelled by a client timeout.
**Recommendation:** Derive a `CancellationToken` from the `OperationContext` / `TransportQuotas.OperationTimeout` so a stuck driver call is abandoned. Longer term, use the stack's async service overrides if available.
**Resolution:** _(open)_
### Server-007
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:179-183` |
| Status | Open |
**Description:** `HealthEndpointsHost` is built without a `configDbHealthy` delegate, so the default `() => true` is used — `/healthz` always reports `configDbReachable = true` and never 503s on a DB outage. `_staleConfigFlag` is also never supplied by `Program.cs`, so the stale-config signal is inert too. `/healthz` degenerates to a pure liveness probe; operators get a false-healthy during a DB outage.
**Recommendation:** Wire a real config-DB probe (cheap cached `SELECT 1`) into `HealthEndpointsHost`, and register `StaleConfigFlag` in `Program.cs`. Or move DB health to `/readyz` and drop the misleading `configDbReachable` field.
**Resolution:** _(open)_
### Server-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/DriverNodeManager.cs:736` |
| Status | Open |
**Description:** `RouteScriptedAlarmMethodCalls` marks a handled slot by setting `errors[i] = ServiceResult.Good`, assuming `base.Call` skips non-null *Good* error slots. The stack and `GateCallMethodRequests` only ever pre-populate *Bad* slots; the skip-on-Good assumption is not a guaranteed SDK contract. If `base.Call` re-dispatches, the engine method and the stack's built-in Part 9 handler both fire — double transition.
**Recommendation:** Verify against the pinned SDK whether `base.Call` skips Good-pre-populated slots. If not, exclude routed slots from `methodsToCall` before `base.Call`. Add a test asserting exactly-once engine transition for a routed Acknowledge.
**Resolution:** _(open)_
### Server-009
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Security/LdapOptions.cs:44`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:74` |
| Status | Open |
**Description:** `AllowInsecureLdap` defaults to `true` (and `Program.cs` reads `?? true`); `UseTls` defaults to `false`. Out of the box, usernames and plaintext passwords are bound to LDAP over an unencrypted socket. A production deployment enabling LDAP without explicitly setting `AllowInsecureLdap=false` ships credentials in clear text on the server→LDAP hop.
**Recommendation:** Default `AllowInsecureLdap` to `false` in both the property initializer and the `Program.cs` fallback. Log a startup warning when LDAP is enabled with `UseTls=false && AllowInsecureLdap=true`.
**Resolution:** _(open)_
### Server-010
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:59`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:284-291` |
| Status | Open |
**Description:** `AutoAcceptUntrustedClientCertificates` defaults to `true` (`Program.cs` reads `?? true`). `BuildConfiguration` wires a handler that accepts any client cert failing with `BadCertificateUntrusted`. A deployment that forgets to flip the flag accepts every untrusted client cert, defeating the PKI trust list. With the always-present `None` policy, the default posture is fully open.
**Recommendation:** Default `AutoAcceptUntrustedClientCertificates` to `false`; keep auto-accept as opt-in dev convenience. `docs/security.md` already shows `false` — align code to doc.
**Resolution:** _(open)_
### Server-011
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:322-346` |
| Status | Open |
**Description:** `BuildUserTokenPolicies` advertises a `UserName` token policy only when `SecurityProfile == Basic256Sha256SignAndEncrypt && Ldap.Enabled`. With the default `SecurityProfile = None` and `Ldap.Enabled = true`, the LDAP authenticator is wired but no UserName policy is advertised — clients cannot present credentials; the only path in is Anonymous. The operator's intent is silently not honoured, with no diagnostic.
**Recommendation:** Validate config at startup and warn/fail when `Ldap.Enabled = true` but no UserName policy is advertised. Allow UserName tokens on any non-None profile (they are stack-encrypted regardless, per `docs/security.md`).
**Resolution:** _(open)_
### Server-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` |
| Status | Open |
**Description:** `ProbeAsync` creates an `IHttpClientFactory` client and mutates `client.Timeout` on every 2-second probe tick. The timeout belongs on the request or on the named-client registration, not set per call on a factory-vended instance.
**Recommendation:** Configure the timeout once via `AddHttpClient(HttpClientName).ConfigureHttpClient(...)`, or use a per-request linked `CancellationTokenSource(_options.HttpProbeTimeout)`; drop the per-call `client.Timeout` mutation.
**Resolution:** _(open)_
### Server-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Design-document adherence |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:9-19`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaApplicationHost.cs:296-346`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/Program.cs:89` |
| Status | Open |
**Description:** `docs/security.md` documents 7 transport security profiles and `CLAUDE.md` references a `SecurityProfileResolver`. The code's `OpcUaSecurityProfile` enum has only `None` and `Basic256Sha256SignAndEncrypt`; `BuildSecurityPolicies` adds a policy only for the latter; `SecurityProfileResolver` does not exist in the repo (grep finds it only in docs). `Basic256Sha256-Sign` and all Aes profiles are unimplemented, and `Program.cs:89`'s `Enum.TryParse` silently selects `None` for an unrecognised profile string.
**Recommendation:** Reconcile code and docs — implement the missing profiles + `SecurityProfileResolver`, or trim `docs/security.md` / `CLAUDE.md` to the two supported profiles. At minimum, log a warning when a configured `SecurityProfile` fails to parse instead of silently using `None`.
**Resolution:** _(open)_
### Server-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` |
| Status | Open |
**Description:** `SealedBootstrap` claims in its xml-doc to "close release blocker #2" by consuming the generation-sealed cache + resilient reader + stale-config flag, but `Program.cs` registers and uses `NodeBootstrap` instead. `SealedBootstrap` is never registered in DI nor referenced by `OpcUaServerService` — it and its `StaleConfigFlag` plumbing are dead in the production wire-up; the release blocker remains open in practice.
**Recommendation:** Either register `SealedBootstrap` (with `GenerationSealedCache`/`ResilientConfigReader`/`StaleConfigFlag`) and wire `StaleConfigFlag` into the health host, or delete `SealedBootstrap` and correct the release-readiness doc.
**Resolution:** _(open)_
### Server-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` |
| Status | Open |
**Description:** `OtOpcUaServer`'s class doc still says "PR 16 minimum-viable scope ... no security ... LDAP + security profiles are deferred." `OpcUaServerOptions`'s says "PR 17 minimum-viable scope: no LDAP, no security profiles beyond None." Both are stale — the class now does LDAP UserName auth, anonymous-role mapping, and a configurable security profile. A reader would wrongly conclude the server has no authentication.
**Recommendation:** Update both class summaries to describe current behaviour and drop the "deferred to a future PR" language.
**Resolution:** _(open)_